What an AI support agent must actually do
Five functions. Every candidate agent either does them or it doesn't. Three are familiar from generic vendors. Two are why generic vendors fail for products like yours.
Most evaluation rubrics for AI customer support come from B2C and SaaS, where the bar is fluent conversation about a billing dispute. That bar is too low for the built environment. The constraints in 1.3 demand stricter functional requirements.
Direct answer
When a tech asks, the agent replies in plain, actionable language with the answer itself. Not eight PDFs to scroll. Not a search-results page. Not "you might want to check the manual."
"Here are the top 8 results for 'wiring P1467-LE.'" That isn't an answer. That's keyword search wearing an AI veneer.
"On the P1467-LE rev D, the auxiliary relay common terminal is screw 7. Jumper to screw 5 for fail-secure." Followed by a citation, then "want me to pull up the wiring diagram?"
Enterprise search platforms fail this because retrieval is the product. Generic AI agents usually pass it, because direct-answer shape is their core value prop.
Function 2Sourced answers
Meaningful claims tie back to the page, section, or video moment they came from. Not the corpus as a whole. Not a vague "see the installation guide." A clickable link the tech can verify without leaving the conversation.
Confident paragraphs with no links. That's an opinion delivered by a language model, and in a technical field, opinions get equipment damaged.
Substantive claims carry source links the tech can tap. The link opens to the page or moment the answer came from. Trust is verifiable.
Spot-check citations in a candidate's demo response. If a cited link lands somewhere that doesn't support the claim, that's a hallucinated citation. A small number in a controlled demo is a serious yellow flag; a pattern of them is disqualifying.
Visual content treated like text
For a field tech, the answer is frequently a wiring diagram, schematic, torque-spec table, exploded view, or a 30-second clip of a training video. Text-only retrieval is half a product in this category.
A 2026-grade agent ingests, interprets, and retrieves visual content with the same fluency it handles text. When the right answer is a labeled diagram with two specific callouts highlighted, that's what the agent surfaces. Not a paragraph that says "see the wiring diagram."
"You can find the wiring diagram on page 47." Text-only retrieval with a polite link. The tech is back to PDF-scrolling.
The wiring diagram appears inline with the two terminals highlighted. If a video covers the same step, the 30-second segment plays inline with the timestamp deep-linked.
This is the function most often skipped, faked, or undersold in pitches. A demo will show multimodal on a clean stock-image diagram, then crumble on your real schematic that's a multi-column scan from 2014. Test on your actual content.
Function 4Knows when to escalate
If the answer isn't in your documentation, the agent doesn't guess. It refuses, names what it doesn't know, and hands off with full context. The handoff is invisible to the tech.
A confident wrong answer generated to fill silence. Or a brick-wall "I can't help with that" with no path to a human. Both produce a tech who gives up on your brand.
"I don't see that procedure in the documentation. Connecting you to a tier-2 engineer with everything you've told me so far." The engineer opens to a 3-sentence summary. Cold handoffs eliminated.
The strongest agents have a measurable refusal rate of 8 to 15 percent. Frame it as a feature, not a gap. The day refusal trends to zero is the day you're running a guessing agent. (Metric framework: 2.2.)
Learns from what's missed
Every question the agent couldn't answer is a signal. Every refusal, escalation, and reformulation tells you where documentation is silent, where installers are stuck, where product is confusing, where a new use case is emerging. A 2026-grade agent treats the query log as a first-class data product surfaced, summarized, and routed to the team that can act. (Full routing: 3.4.)
Unanswered queries disappear into a chat archive nobody reads. The same 30 questions show up every week. The documentation gap compounds invisibly.
A monthly report shows top unanswered queries by topic, the refusal trend, and new use cases, routed to docs, product, training, sales, and field service with the slice relevant to each.
Using the list in evaluation
Test all five against your real content with your real query types. Not demo content. Bring twenty real questions from your last quarter of support tickets and one real PDF that's been giving your team trouble. Watch how each candidate handles them.
Score each function as a binary: pass or fail. The rubric is intentionally unweighted. A candidate that fails Function 4 is dangerous in this field regardless of how strong the other four are.
A candidate that fails Function 4 is dangerous in this field, regardless of how strong the other four are.
- Has a written rubric scoring candidates pass or fail on each function
- Tests each against their own content, not vendor demos
- Treats Functions 3 and 4 as veto-level
- Spot-checks citations in at least one full demo conversation
- Can name a candidate they disqualified specifically on a five-function failure