02.1 How to evaluate it

What an AI support agent must actually do

Five functions. Every candidate agent either does them or it doesn't. Three are familiar from generic vendors. Two are why generic vendors fail for products like yours.

Chapter 2.18 min readHow to evaluate it

Most evaluation rubrics for AI customer support come from B2C and SaaS, where the bar is fluent conversation about a billing dispute. That bar is too low for the built environment. The constraints in 1.3 demand stricter functional requirements.

Five functions a 2026 agent must perform
FUNCTION 1 Direct answer Not a list of 8 PDFs to scroll. FAILURE MODE Top 8 results. FUNCTION 2 pg 47 ✓ Source-pinned Cites the page it came from. FAILURE MODE No link = opinion. FUNCTION 3 Visual content treated like text Diagrams, video, schematics. FAILURE MODE "See the diagram." FUNCTION 4 ? Knows when to escalate Refuses, hands off with context. FAILURE MODE Confident guess. FUNCTION 5 Learns from what's missed Every gap is a signal. FAILURE MODE Black-hole log.
Function 3 (visual content treated like text) is the most-skipped capability in 2026 vendor pitches.
Function 1

Direct answer

When a tech asks, the agent replies in plain, actionable language with the answer itself. Not eight PDFs to scroll. Not a search-results page. Not "you might want to check the manual."

Failure mode

"Here are the top 8 results for 'wiring P1467-LE.'" That isn't an answer. That's keyword search wearing an AI veneer.

What good looks like

"On the P1467-LE rev D, the auxiliary relay common terminal is screw 7. Jumper to screw 5 for fail-secure." Followed by a citation, then "want me to pull up the wiring diagram?"

Enterprise search platforms fail this because retrieval is the product. Generic AI agents usually pass it, because direct-answer shape is their core value prop.

Function 2

Sourced answers

Meaningful claims tie back to the page, section, or video moment they came from. Not the corpus as a whole. Not a vague "see the installation guide." A clickable link the tech can verify without leaving the conversation.

Failure mode

Confident paragraphs with no links. That's an opinion delivered by a language model, and in a technical field, opinions get equipment damaged.

What good looks like

Substantive claims carry source links the tech can tap. The link opens to the page or moment the answer came from. Trust is verifiable.

Test move

Spot-check citations in a candidate's demo response. If a cited link lands somewhere that doesn't support the claim, that's a hallucinated citation. A small number in a controlled demo is a serious yellow flag; a pattern of them is disqualifying.

Function 3

Visual content treated like text

For a field tech, the answer is frequently a wiring diagram, schematic, torque-spec table, exploded view, or a 30-second clip of a training video. Text-only retrieval is half a product in this category.

A 2026-grade agent ingests, interprets, and retrieves visual content with the same fluency it handles text. When the right answer is a labeled diagram with two specific callouts highlighted, that's what the agent surfaces. Not a paragraph that says "see the wiring diagram."

Failure mode

"You can find the wiring diagram on page 47." Text-only retrieval with a polite link. The tech is back to PDF-scrolling.

What good looks like

The wiring diagram appears inline with the two terminals highlighted. If a video covers the same step, the 30-second segment plays inline with the timestamp deep-linked.

This is the function most often skipped, faked, or undersold in pitches. A demo will show multimodal on a clean stock-image diagram, then crumble on your real schematic that's a multi-column scan from 2014. Test on your actual content.

Function 4

Knows when to escalate

If the answer isn't in your documentation, the agent doesn't guess. It refuses, names what it doesn't know, and hands off with full context. The handoff is invisible to the tech.

Failure mode

A confident wrong answer generated to fill silence. Or a brick-wall "I can't help with that" with no path to a human. Both produce a tech who gives up on your brand.

What good looks like

"I don't see that procedure in the documentation. Connecting you to a tier-2 engineer with everything you've told me so far." The engineer opens to a 3-sentence summary. Cold handoffs eliminated.

A discipline metric, not a failure metric

The strongest agents have a measurable refusal rate of 8 to 15 percent. Frame it as a feature, not a gap. The day refusal trends to zero is the day you're running a guessing agent. (Metric framework: 2.2.)

Function 5

Learns from what's missed

Every question the agent couldn't answer is a signal. Every refusal, escalation, and reformulation tells you where documentation is silent, where installers are stuck, where product is confusing, where a new use case is emerging. A 2026-grade agent treats the query log as a first-class data product surfaced, summarized, and routed to the team that can act. (Full routing: 3.4.)

Failure mode

Unanswered queries disappear into a chat archive nobody reads. The same 30 questions show up every week. The documentation gap compounds invisibly.

What good looks like

A monthly report shows top unanswered queries by topic, the refusal trend, and new use cases, routed to docs, product, training, sales, and field service with the slice relevant to each.

In practice

Using the list in evaluation

Test all five against your real content with your real query types. Not demo content. Bring twenty real questions from your last quarter of support tickets and one real PDF that's been giving your team trouble. Watch how each candidate handles them.

Score each function as a binary: pass or fail. The rubric is intentionally unweighted. A candidate that fails Function 4 is dangerous in this field regardless of how strong the other four are.

A candidate that fails Function 4 is dangerous in this field, regardless of how strong the other four are.

What good looks like
A CS director who has internalized the five functions:
  • Has a written rubric scoring candidates pass or fail on each function
  • Tests each against their own content, not vendor demos
  • Treats Functions 3 and 4 as veto-level
  • Spot-checks citations in at least one full demo conversation
  • Can name a candidate they disqualified specifically on a five-function failure
Next · Chapter 2.2
How to evaluate it: the criteria stack and metrics
Get started

Want a five-function evaluation against your content?

We run a candidate agent against your real symptom queries and your real schematics, then score it pass or fail on each function.

Talk to us →