Back to Blog Listing

Adding accuracy to AI visibility - meet the judge

Tom Fry

Tom Fry

AI visibility accuracy dashboard showing graded mentions with corrections and missed facts

For the last year, the only AI-visibility question worth asking was simple: are the models talking about us? Get cited, get into the answer, win the prompt - that was the bar. It was the right question for an early market. It is not the right question any more.

The new question is sharper, and harder: when AI is talking about us, is what it is saying actually true?

Today we are launching AI Visibility Accuracy.  It's a layer that sits on top of every AI visibility run in Agentcy and grades each mention of your brand for factual correctness, completeness, positioning, recency, and sentiment. It is built around an LLM judge with deep knowledge of your business, and it tells you not just whether a model mentioned you, but whether the answer it gave was one you would have wanted in front of a buyer.

Why visibility without accuracy is a trap

An AI answer is not a press hit. A journalist who gets a fact wrong files a correction; an AI model that gets a fact wrong delivers it confidently, in a clean paragraph, to a million users. The output reads as authoritative whether or not it is true. That asymmetry - confident tone, untested facts - is what makes inaccurate AI references so dangerous for a brand.

The risks compound across three axes:

  • Factual errors. Wrong numbers, wrong product names, wrong leadership team, wrong customer logos, dead acquisitions still listed as live. Each one becomes a discovery moment for a buyer who is forming an opinion before they have spoken to you.
  • Stale positioning. The model frames you as last year's company. It anchors on the message you stopped using two strategies ago. The pitch you spent the last quarter rolling out is invisible.
  • Brand-safety risk. The model places you next to a competitor's controversy, repeats a critical news cycle that has since been resolved, or attaches you to a category claim you do not want to own. Sentiment that nobody at the company would sign off on, surfaced as if it were your own.

None of these show up if you only measure mentions. They are everywhere if you start measuring quality.

The judge - and the deep knowledge behind it

Accuracy is graded by a dedicated LLM judge that runs after every AI visibility run. For each mention the judge produces:

  • An overall score and grade - Excellent, Good, Fair, Poor, or Critical - anchored on five sub-scores (factual, completeness, positioning, recency, sentiment).
  • Specific corrections - what the AI said, what is actually true, where the gap is.
  • Missed facts - relevant things you would expect to be in the answer that were not. The recent product launch the model never heard about. The customer story that should have been used. The differentiator that anchors your positioning.
  • Confirmations - what the model got right. Useful for tracking which messages have actually landed.
  • A brand-safety flag - explicit risk signal for any answer that places the brand somewhere it should not be.
  • A confidence score - how sure the judge itself is about the judgement.

None of this works without context. A judge that does not know your company will hallucinate corrections as readily as the model it is grading. So the judge does not work in isolation - it is grounded against a structured knowledge base for each client, built and refined inside Agentcy: business description, products, key messages, leadership, recent announcements, customer wins, positioning that matters versus positioning that is incidental. Every grade is checked against that ground truth.

The cost of building that knowledge base used to be the reason brands could not run accuracy at scale. Inside Agentcy, it is part of the platform - generated during onboarding, refined as the client evolves, and re-used by every accuracy run automatically.

What you see, and what you do with it

Every AI visibility run now produces an accuracy summary alongside the existing visibility metrics. You see the average overall score, the distribution across grades, the brand-safety risk count, the most common corrections across the run, and the most-frequently missed facts. Drill into any run and you can read the full judgement on every mention - the model's answer, what was wrong with it, what is true, what is missing - filterable by model, grade, and confidence.

The product loop is meant to be tight. A run highlights a recurring correction across half a dozen mentions. You take that signal into a press release, a thought-leadership piece, a content asset that addresses the gap directly. The next run measures whether the message landed - and whether it landed in the answers, not just in the coverage.

The bigger picture

The next phase of AI search is not about being mentioned more - it is about being mentioned well. Brands that ignore accuracy could end up with a customer base that has been quietly mis-pitched at scale. Brands that take it seriously will find the gaps before their buyers do.