How to evaluate an AI support agent
A three-tier criteria stack to qualify any candidate. A metric framework purpose-built for this category. A reframe that protects you from optimizing for the wrong success measure.
The five functions in 2.1 tell you what a 2026 agent must do. This chapter tells you how to evaluate whether a specific candidate does it, with a measurement plan that doesn't silently inherit assumptions from the wrong category.
01The three-tier criteria stack
Trust sits at the top, because failure there disqualifies a candidate regardless of how it performs everywhere else.
A pre-screen to filter out candidates that can't physically operate against your content.
- Ingests your real documentation. Not a curated sample. Scanned 2008 bulletins, multi-column guides with embedded schematics, inconsistent revision naming.
- Preserves layout and visual structure. Wiring diagrams aren't the sum of their words; torque tables lose everything flattened to prose.
- Isolates revisions. Same family, different revisions, different correct answers. Verify it doesn't silently average across them.
- Auditable output. Every query, retrieval, citation, and refusal logged and reproducible six months later. The defensibility layer if a warranty claim ever traces back to a conversation.
For candidates that pass Tier 1, test real behavior against your hardest query types. Not demo content.
- Symptom-to-fix. "Flashing red, no relay click" matched to a diagnostic flow. Twenty real symptoms from last quarter.
- Multi-source synthesis. A wiring callout from manual 1, sequence from manual 2, torque spec from manual 3. Every component cites its source.
- Version collision. A paired battery of revision-collision queries to verify no silent averaging.
- Ambiguous queries. Too little context to disambiguate. A good agent asks one clarifying question; a bad one guesses.
- Out of scope. Deliberately out of scope, to verify a clean refusal with an explicit "not in verified documentation" signal.
The veto tier. Every criterion is pass-or-no-launch, and a failure here disqualifies regardless of headline numbers.
- Refusal behavior. Run 100 out-of-scope queries. Every response includes an explicit "couldn't source" signal, not a confident hedge.
- Citation integrity. Spot-check citations across a full conversation. Patterns of broken citations are Trust-tier failures.
- Escalation pathway. On refusal, the handoff carries the full transcript and asset identifiers. The engineer steps in pre-educated.
- Output auditability. Every answer reproducible from logs: the retrieval, the citation chain, and why it answered or refused.
The reframe that protects your success metric
Most evaluations inherit deflection rate from the call-center category. The metric measures whether AI kept the query off the human queue. It does not measure whether the customer's problem got solved.
In the built environment those are two different things. A tech whose query gets answered with a confidently wrong wiring diagram has been deflected (no ticket opened) and mis-served. The cost shows up six weeks later as a warranty claim, a callback, or brand defection. Set deflection as your success metric and you will hit the metric and lose the value.
Make resolution the primary success metric and treat deflection as a downstream indicator. Resolution pushes vendors toward citation integrity, refusal discipline, and content accuracy: the behaviors that prevent the silent failures deflection lets through.
03A metric framework built for this category
Most metric vocabularies come from chatbots at digital-native companies. They assume a chat-handler dynamic and a desk-based user who fills out surveys. Neither holds in the built environment.
What works instead: four layers. Each answers a different question, has a different audience, and runs on a different cadence.
Brand NPS, dealer NPS, cohort NPS where attribution is possible.
Lagging indicators. Causation is directional only; cohort slicing gets credible where the agent sits behind authentication.
Answer rate, source-pinned rate, refusal rate, in-conversation micro-CSAT, revision correctness.
Micro-CSAT is the single thumbs-up/down field techs actually click. Email surveys to installers get zero responses.
First-touch self-service, topic-by-topic deflection, source coverage, topic discovery.
Topic-by-topic deflection is far more actionable than an aggregate number.
First-time-fix, warranty avoidance, commissioning-time delta, cross-team insight pickup.
Cross-team insight pickup is the metric that proves the cascade is delivering. (Cascade: 3.4.)
Drop average handle time (no human handler). Drop post-resolution email CSAT (zero response). Drop aggregate deflection as a north star (use topic-by-topic and Layer 0 cohort NPS instead). Drop "automation rate" (it assumes a chat-handler dynamic). Keep resolution, not deflection.
Two signature moves uniquely available here
Refusal rate as a discipline metric
Most scorecards treat refusal as a gap. Where a confident wrong answer causes physical damage, a non-zero refusal rate proves the agent is operating with the rigor the category demands. Publish 8 to 15 percent as your target, track it as discipline, and treat a refusal rate trending toward zero as a red flag.
Cohort NPS where attribution holds
Most manufacturers know their NPS but can't tell which interactions moved it. A managed service with a connected query log and an authenticated session model can segment NPS by AI usage: dealers whose installers use it heavily versus lightly, customers whose query was answered versus who bounced. Open public chat surfaces don't always support this. Where the agent lives behind authentication, it does.
The metric set you commit to on day one decides what story you can tell on day 365. Inherit chatbot metrics and you'll defend deflection numbers to a board that wants business outcomes. Build the four-layer framework now and you'll tell a story about brand halo, field outcomes, and cross-team insight pickup.
- Uses the three-tier criteria stack: Entry to qualify, Performance to behave, Trust to disqualify on any fail
- Treats resolution as the primary success metric, deflection as downstream
- Runs the four-layer framework with the right audience and cadence per layer
- Publishes a non-zero refusal-rate target and treats refusal as discipline
- Tracks cohort NPS where attribution is possible, and is honest about limits where it isn't
- Has dropped inherited metrics that don't fit (AHT, post-resolution email CSAT, aggregate deflection)