We put excellence, value and quality above all - and it shows




A Technology Partnership That Goes Beyond Code

“Arbisoft has been my most trusted technology partner for now over 15 years. Arbisoft has very unique methods of recruiting and training, and the results demonstrate that. They have great teams, great positive attitudes and great communication.”
AI Vendor Eval Suite Coverage Audit Checklist (2026)

Use this checklist before you trust an AI vendor's eval claims
AI vendor eval claims often sound stronger than the artifacts behind them. A vendor can say it tests for hallucinations, uses human review, or maintains compliance controls while showing only demo outputs, generic benchmark results, or screenshots.
Eval-suite coverage tells you what failure modes the vendor can prove it tests, not what the vendor can describe in a sales call.
The practical question is simple: what can the vendor show, rerun, explain, and maintain after the deal is signed?
What eval-suite coverage means in vendor due diligence
An eval suite is the vendor's broader system of test inputs, expected behaviors, scoring rubrics, datasets, metrics, workflows, and release gates. An eval harness is the technical scaffolding that runs those evaluations.
Eval coverage is different. It asks whether the suite actually covers the risks and behaviors that matter for your use case.
High test volume does not mean high coverage. A vendor can run thousands of tests and still miss citation errors, prompt injection, policy-violating outputs, drift, or unsupported claims.

What this checklist audits, and what it does not
This checklist audits whether important evaluation dimensions are present, evidenced, and maintained.
It does not replace a deep review of the vendor's eval harness engineering, retrieval augmented generation architecture, production observability, hallucination measurement, or security and compliance posture. Use it to identify the gaps that deserve deeper review.
A gap is not always a disqualifier. It becomes serious when it maps to a high-risk workflow, a customer-facing use case, or a claim the vendor is using to win the deal.
The artifacts to request before scoring coverage
Ask for native artifacts, not slideware. Useful evidence includes:
- Eval plan or test strategy
- Test case taxonomy
- Golden datasets or representative held-out examples
- Scoring rubrics for automated, large language model, and human review
- Failure logs with severity and remediation history
- Continuous integration and continuous delivery evidence
- Release gate definitions
- Incident review examples
- Drift monitoring summaries
Screenshots are weak evidence. Pipeline logs, exported score reports, test definitions, and live walkthroughs are stronger because they can be inspected for provenance and repeatability.
The AI vendor eval suite coverage audit checklist
Use this checklist to score each coverage dimension as Green, Yellow, or Red. The goal is not a perfect-looking table. The goal is to find the risks that should affect procurement, pilot scope, or contract terms.
Coverage dimension | What to verify | Evidence to request | Red flag |
Groundedness | Outputs stay faithful to approved context | Groundedness rubric, failure logs | Retrieval recall treated as groundedness |
Citation accuracy | Citations support the exact claim | Citation tests, retrieval regressions | Demo citations only |
Regulatory-compliant output | Outputs follow scoped policy rules | Policy test sets, human review records | Compliance claim without output tests |
Adversarial input | Injection, jailbreak, bypass, and exfiltration tests exist | Red-team report, retest cadence | One-time red team only |
Hallucination KPIs | Metrics are defined by task type | KPI definitions, sampling plan, trends | One universal hallucination rate |
Drift and regression | Quality is compared against baselines | CI/CD gates, baseline reports | No release gate |
Production feedback | Incidents and feedback update evals | Dashboards, incident-to-eval examples | Monitoring disconnected from evals |
Security and compliance evidence | Governance artifacts are scoped to the AI workflow | SOC 2 scope, ISO/IEC 42001 scope, NIST AI RMF mapping | Certification without scope review |
How to use the checklist in a vendor review
Run the audit before final vendor selection, then again before contract signature. Include engineering, product, security, procurement, and compliance or legal stakeholders when the workflow is regulated or high-impact.
Use the same artifact requests for each vendor. Do not reward polish over reproducibility. A short eval plan connected to a live release gate is stronger than a polished PDF that cannot be traced to a running test.
If a vendor says artifacts are confidential, ask for a live walkthrough, sandboxed review, or auditor-mediated review. Treat partial evidence as Yellow until it can be verified.
Dimension 1: Groundedness coverage
Groundedness means the output is supported by the context, documents, or approved knowledge supplied to the system. It is not the same as plausibility or general factuality.
Ask whether the vendor separates "the model seems right" from "the provided context supports this claim." For retrieval augmented generation workflows, this matters because retrieving the right document does not prove the generated answer stayed faithful to it.
Good evidence includes a groundedness rubric, failed examples, scoring criteria, and remediation records.
Dimension 2: Citation accuracy coverage
Citation accuracy asks whether the cited source supports the specific claim. A response can be grounded and still cite the wrong document, attach a true source to an unsupported claim, or break after a retrieval update.
Ask for held-out citation test cases, examples of bad citation behavior, and regression results after knowledge base changes. If citation quality is central to the product, use a deeper RAG architecture audit to review retrieval precision, chunking, re-ranking, and source handling.
Dimension 3: Regulatory-compliant output coverage
Regulatory-compliant output coverage is not the same as a compliance certificate. It asks whether the vendor tests actual outputs against workflow-specific policies, prohibited claims, required disclosures, escalation rules, and jurisdictional constraints.
Ask for policy test sets, human review procedures, and examples of policy-violating outputs that were caught and remediated.
Automated filters are useful, but they are not enough for nuanced claims or context-dependent escalation. Legal and compliance teams should review scoped representations independently.
Dimension 4: Adversarial and malicious input coverage
Adversarial coverage tests how the system behaves when users try to bypass controls, inject instructions, exfiltrate data, manipulate tools, or induce prohibited outputs.
Ask for red-team artifacts, not just the statement that red teaming occurred. Useful evidence includes adversarial categories, severity grades, exploitability notes, remediation history, and retest cadence.
One-time testing is a warning sign. Prompt templates, tools, retrieval sources, and model versions change, so adversarial coverage must be rerun after meaningful system changes.
Dimension 5: Hallucination KPI coverage
A hallucination key performance indicator is only useful if the vendor defines what it measures. Factual error, unsupported synthesis, fabricated citation, and contradiction are related, but they are not the same metric.
Ask for metric definitions by task type, sampling strategy, reviewer instructions, threshold rationale, and trend data across model versions. Avoid accepting one aggregate hallucination number across unrelated workflows.
Dimension 6: Drift and regression coverage
AI systems can degrade quietly. Model drift, retrieval drift, data drift, prompt drift, and product behavior drift can all change output quality after launch.
Ask whether the vendor maintains baseline comparisons and regression suites that run before release. A release gate should define what score changes block deployment, trigger review, or require rollback.
Dashboards alone are not enough. The vendor should show that drift findings lead to eval updates, remediation, and release decisions.
Dimension 7: Production feedback and observability coverage
Pre-deployment evals miss failures that only appear with real users. Production feedback, traces, incidents, corrections, and quality labels should flow back into the eval suite.
Ask for monitoring summaries, feedback taxonomies, incident review examples, and evidence that production findings created new test cases or updated thresholds.
This is not a tooling comparison. When the feedback loop is unclear, run a separate production observability audit.
Dimension 8: Security, privacy, and compliance evidence coverage
Security and compliance artifacts are adjacent to eval coverage, not a substitute for it. A vendor can have strong evals and weak controls, or scoped certifications that do not apply to the AI workflow you are buying.
Ask whether SOC 2 scope includes the relevant AI inference and data processing environment. If ISO/IEC 42001 is claimed, verify whether the scope applies to the AI product under review. If NIST AI Risk Management Framework mapping is provided, look for Measure function evidence tied to testing and validation.
Do not accept certification names without scope, date, and applicability.
How to score a vendor's eval coverage without overfitting to the checklist
A checklist should support a decision, not produce a false sense of precision. Score each dimension by evidence quality and risk relevance.
Use four decision postures:
- Proceed when the relevant dimensions are Green or documented Yellow with a credible plan.
- Proceed with remediation obligations when gaps can be closed before launch.
- Pilot with restricted scope when risk is manageable only in a narrower workflow.
- Reject at this stage when a critical dimension is Red for the intended deployment.
Green, yellow, and red evidence
Green evidence is reproducible. It includes live artifacts, runnable tests, failure logs, remediation examples, and release gates tied to actual workflows.
Yellow evidence is partial or unverifiable. It includes process documentation without live artifacts, screenshots without audit trails, benchmark claims without methodology, or verbal descriptions of human review.
Red evidence includes vague claims, demo-only examples, refusal to share artifacts under any review arrangement, outdated certifications presented as current, or claims contradicted by other due diligence findings.
Minimum acceptable baseline by risk profile
Low-risk internal assistants need groundedness, hallucination KPI coverage, and regression coverage as a baseline.
Customer-facing workflows should add citation accuracy, adversarial coverage, and production feedback loops.
Regulated workflows need all eight dimensions addressed, with strong evidence for policy testing, human review, scoped governance artifacts, and release gates.
High-impact decision-support systems should not rely on Yellow evidence for core quality dimensions unless the contract includes explicit gates before expansion.
Common misleading eval claims and how to test them
Vendor claims become useful only when converted into verification questions. The strongest question is usually: "Can you show the artifact that proves this runs, fails, blocks, or improves?"
Claim: "We have an eval harness"
An eval harness is infrastructure. It does not prove meaningful coverage.
Ask for the test case taxonomy, dimensions not covered, recent failing tests, and examples of releases blocked by eval results. A harness that never fails may be too permissive, disconnected from release decisions, or limited to easy cases.
Claim: "Our RAG system prevents hallucinations"
Retrieval augmented generation can reduce some hallucination risks, but it does not remove them. It can still retrieve irrelevant documents, cite unsupported sources, synthesize conflicting information, or ignore context after a prompt or model change.
Ask how the vendor measures groundedness separately from retrieval recall, how citation accuracy is tested, and how hallucination KPIs change across model versions.
Claim: "We are compliant"
Compliance claims require scoped evidence. SOC 2, ISO/IEC 42001, and NIST AI RMF references should be reviewed for applicability to the specific AI workflow, not accepted as broad assurances.
Ask which system components are in scope, when the assessment period occurred, what exceptions exist, and how controls map to the workflow you plan to deploy.
What to do with the audit results
Turn every material finding into an action. An unowned audit note will not protect the deployment.
Proceed when evidence matches the workflow risk. Request remediation when gaps are specific and fixable. Restrict pilots when the vendor shows promise but lacks coverage for the full scope. Reject the vendor when critical claims cannot be verified under any reasonable review path.
Add coverage gaps to the statement of work
Coverage gaps should become testable obligations in the statement of work.
Useful contract inputs include acceptance criteria, release gates, remediation dates, reporting cadence, and pilot exit criteria. For example, a hallucination KPI gap should become a requirement for task-specific metric definitions, sampling rules, and trend reporting before launch.
Avoid vague terms such as "maintain high quality." Require artifacts the buyer can inspect.
Recheck coverage after major model, data, prompt, or product changes
Eval coverage is continuous. Recheck after model updates, retrieval configuration changes, prompt changes, product scope expansion, new jurisdictions, or new user populations.
A credible vendor should connect change management to regression tests, release gates, and updated eval coverage. The audit is not finished when the contract is signed. It becomes part of how the system is governed.















