How to evaluate an AI support agent

The five functions in 2.1 tell you what a 2026 agent must do. This chapter tells you how to evaluate whether a specific candidate does it, with a measurement plan that doesn't silently inherit assumptions from the wrong category.

The three-tier criteria stack

Trust sits at the top, because failure there disqualifies a candidate regardless of how it performs everywhere else.

Test Entry first to qualify. Test Performance for real behavior. Trust is the unappealable veto.

Tier 1 · EntryDoes this even fit?

A pre-screen to filter out candidates that can't physically operate against your content.

Ingests your real documentation. Not a curated sample. Scanned 2008 bulletins, multi-column guides with embedded schematics, inconsistent revision naming.
Preserves layout and visual structure. Wiring diagrams aren't the sum of their words; torque tables lose everything flattened to prose.
Isolates revisions. Same family, different revisions, different correct answers. Verify it doesn't silently average across them.
Auditable output. Every query, retrieval, citation, and refusal logged and reproducible six months later. The defensibility layer if a warranty claim ever traces back to a conversation.

Tier 2 · PerformanceDoes it work on the queries that matter?

For candidates that pass Tier 1, test real behavior against your hardest query types. Not demo content.

Symptom-to-fix. "Flashing red, no relay click" matched to a diagnostic flow. Twenty real symptoms from last quarter.
Multi-source synthesis. A wiring callout from manual 1, sequence from manual 2, torque spec from manual 3. Every component cites its source.
Version collision. A paired battery of revision-collision queries to verify no silent averaging.
Ambiguous queries. Too little context to disambiguate. A good agent asks one clarifying question; a bad one guesses.
Out of scope. Deliberately out of scope, to verify a clean refusal with an explicit "not in verified documentation" signal.

Tier 3 · TrustWould you put this in front of a real tech?

The veto tier. Every criterion is pass-or-no-launch, and a failure here disqualifies regardless of headline numbers.

Refusal behavior. Run 100 out-of-scope queries. Every response includes an explicit "couldn't source" signal, not a confident hedge.
Citation integrity. Spot-check citations across a full conversation. Patterns of broken citations are Trust-tier failures.
Escalation pathway. On refusal, the handoff carries the full transcript and asset identifiers. The engineer steps in pre-educated.
Output auditability. Every answer reproducible from logs: the retrieval, the citation chain, and why it answered or refused.

The reframe that protects your success metric

Most evaluations inherit deflection rate from the call-center category. The metric measures whether AI kept the query off the human queue. It does not measure whether the customer's problem got solved.

In the built environment those are two different things. A tech whose query gets answered with a confidently wrong wiring diagram has been deflected (no ticket opened) and mis-served. The cost shows up six weeks later as a warranty claim, a callback, or brand defection. Set deflection as your success metric and you will hit the metric and lose the value.

Optimize for deflection and the metric climbs while trust erodes. Optimize for resolution and both lines move together.

Make resolution the primary success metric and treat deflection as a downstream indicator. Resolution pushes vendors toward citation integrity, refusal discipline, and content accuracy: the behaviors that prevent the silent failures deflection lets through.

A metric framework built for this category

Most metric vocabularies come from chatbots at digital-native companies. They assume a chat-handler dynamic and a desk-based user who fills out surveys. Neither holds in the built environment.

What works instead: four layers. Each answers a different question, has a different audience, and runs on a different cadence.

Layer 0 · Brand halo

Brand NPS, dealer NPS, cohort NPS where attribution is possible.

Lagging indicators. Causation is directional only; cohort slicing gets credible where the agent sits behind authentication.

CEO / CMO / BoardAnnual

Layer 1 · Field-experience health

Answer rate, source-pinned rate, refusal rate, in-conversation micro-CSAT, revision correctness.

Micro-CSAT is the single thumbs-up/down field techs actually click. Email surveys to installers get zero responses.

CS OpsWeekly

Layer 2 · Support-team impact

First-touch self-service, topic-by-topic deflection, source coverage, topic discovery.

Topic-by-topic deflection is far more actionable than an aggregate number.

CS DirectorMonthly

Layer 3 · Field outcomes & business value

First-time-fix, warranty avoidance, commissioning-time delta, cross-team insight pickup.

Cross-team insight pickup is the metric that proves the cascade is delivering. (Cascade: 3.4.)

CFO / CEOQuarterly

Drop from a generic metric set

Drop average handle time (no human handler). Drop post-resolution email CSAT (zero response). Drop aggregate deflection as a north star (use topic-by-topic and Layer 0 cohort NPS instead). Drop "automation rate" (it assumes a chat-handler dynamic). Keep resolution, not deflection.

Two signature moves uniquely available here

Refusal rate as a discipline metric

Most scorecards treat refusal as a gap. Where a confident wrong answer causes physical damage, a non-zero refusal rate proves the agent is operating with the rigor the category demands. Publish 8 to 15 percent as your target, track it as discipline, and treat a refusal rate trending toward zero as a red flag.

Cohort NPS where attribution holds

Most manufacturers know their NPS but can't tell which interactions moved it. A managed service with a connected query log and an authenticated session model can segment NPS by AI usage: dealers whose installers use it heavily versus lightly, customers whose query was answered versus who bounced. Open public chat surfaces don't always support this. Where the agent lives behind authentication, it does.

Why this matters

The metric set you commit to on day one decides what story you can tell on day 365. Inherit chatbot metrics and you'll defend deflection numbers to a board that wants business outcomes. Build the four-layer framework now and you'll tell a story about brand halo, field outcomes, and cross-team insight pickup.

What good looks like

An evaluation framework calibrated for this category:

Uses the three-tier criteria stack: Entry to qualify, Performance to behave, Trust to disqualify on any fail
Treats resolution as the primary success metric, deflection as downstream
Runs the four-layer framework with the right audience and cadence per layer
Publishes a non-zero refusal-rate target and treats refusal as discipline
Tracks cohort NPS where attribution is possible, and is honest about limits where it isn't
Has dropped inherited metrics that don't fit (AHT, post-resolution email CSAT, aggregate deflection)

Next · Chapter 2.3

The category landscape