Two memories, and only one you can audit
The 2020 RAG paper (Lewis et al., NeurIPS) drew a line that still organizes the whole problem: parametric versus non-parametric memory. Parametric memory is knowledge compressed into weights — fluent, fast, and lossy in the way any lossy compression is lossy. It cannot tell you where a fact came from, because by construction it no longer has a “where”; the fact is smeared across a billion coefficients. You correct it by retraining. Non-parametric memory is an index you read at inference time: every answer has a row it came from, you swap a row by editing the index, and you update the system by re-indexing tonight, not by scheduling a fine-tune next quarter.
For an FAE answering a customer about why a card renegotiated its PCIe link, that distinction is not philosophy. One memory can produce a paragraph that reads correct and points at nothing. The other produces a paragraph attached to the exact thread where a named engineer diagnosed the exact symptom on the exact silicon. Only the second is sendable, because the FAE’s signature goes under it. The paper’s own result was that retrieval made generations “more specific, diverse and factual” while supplying provenance — note the careful phrasing: more factual and attributable, not a hallucination counter dropping to zero. Anyone quoting a percentage there is selling.
The skeptic is half right
The reflexive objection from anyone who has watched these systems is: RAG just staples a link onto a hallucination. That instinct is correct often enough to take seriously — and naming exactly which half is true is the whole game. There are two independent properties a cited answer can have, and the literature keeps them apart for good reason. Factuality is whether a claim is true about the world. Faithfulness — groundedness — is whether the claim is entailed by the context you actually retrieved. They are orthogonal axes, not two names for one thing.
Orthogonal means all four corners exist. An answer can be faithful but not factual: it parrots a retrieved case that was itself wrong, and reproduces the error honestly. It can be factual but not faithful: it states something true about the hardware that the cited span does not support, because the model knew it parametrically and then decorated the leak with the nearest-looking link. That last quadrant is the one the skeptic smells. It is citation theater — a real claim, a real link, and no actual relationship between them.
The clean way to say it: a citation has exactly two failure modes. The source is wrong, or the source does not say this. Grounding machinery can only fix the second. The first is your team’s knowledge being wrong, which no retrieval architecture was ever going to repair — it is fixed by the same engineer correcting the same record. Honesty about that boundary is what separates a methodology from a demo.
What a citation is actually worth
If a link is not the unit of trust, what is? The Attributed QA work (Bohnet et al., 2022) and the AIS framework give the operational test: a statement is attributable to source S if and only if a reader, shown only S, agrees that “according to S, [statement]” holds. The currency is entailment between the claim and the cited span. A citation is worth precisely that entailment and nothing else — the hyperlink is packaging.
RAGAS (Es et al., 2023) turns that test into something you can run on every answer instead of arguing about it: decompose the generated reply into atomic claims, then check each claim against the retrieved context for inferability. The score is the fraction of claims the context actually supports. For an expert the picture is familiar — a citation is the “see ref [3]” comment in a pull request, worthless until someone diffs the code against ref [3]. Atomic-claim entailment is the unit test that asserts the comment still matches the code. AIS is the reviewer asking, plainly, does the source say this. We run that pass before a draft reaches a human, and claims that fail it do not get to keep their citation.
Why “I don’t know” is a return value, not a failure
A system that must always answer will, on the hard cases, answer wrong with a citation attached — the most expensive output it can produce, because it costs the FAE their credibility with a customer who checks. Selective prediction gives the model a third move beyond right and wrong: abstain, and trade coverage for risk. The prerequisite is calibration — the confidence has to track the actual hit rate, or abstention just moves the lie one level up.
Treat abstention the way a careful library returns a not-implemented signal rather than guessing a plausible value and corrupting everything downstream. When no retrieved case entails the answer, the correct output is not a confident paragraph; it is the system declining and routing to a human who knows. “I don’t know” caps the blast radius. In a customer-facing reply that ceiling is the feature — a missing answer is a follow-up, a confident wrong answer is a recall.
The graph is the easy part
Now the part a horizontal tool cannot retrofit. “PCIe link drop,” “x8 to x4 retrain,” and “lane width degradation” are three vocabularies for one failure. A board codename, its marketing name, and its part number are one object. Get this wrong and retrieval fails silently — it finds nothing and reports nothing, which reads identical to having no prior case. Furnas et al. measured the underlying problem in 1987: two people choose the same term for the same thing less than a fifth of the time. Hardware is worse, because it carries codenames, marketing names, part numbers, and customer slang for the same die, all live at once.
The failure runs in both directions. Synonymy is many forms for one entity — every alias for that link-retrain symptom. Polysemy is one form for many entities: “Hopper” is an architecture, a person the architecture is named for, and a parts bin on a shop floor. Naive string matching over-merges on one and under-merges on the other, and a graph built straight from extraction inherits both diseases. CESI (Vashishth et al., WWW 2018) named the state plainly: raw extraction yields a graph of strings — redundant, ambiguous — and a usable knowledge base requires a dedicated canonicalization pass to collapse those strings onto entities.
This is the unglamorous center of the whole thing. Standing up a graph is a weekend; resolving that two strings denote the same piece of silicon is irreducibly domain-specific work, and Fellegi and Sunter proved in 1969 that record linkage has a zone no threshold removes — the possible-matches that have to go to a human. The scale it takes to make “the same thing” machine-resolvable in one vertical is visible in SNOMED CT’s 350,000-plus clinical concepts and the Gene Ontology’s tens of thousands of hand-curated terms. A horizontal connector ships the graph and skips the canonicalization, which is why it demos well and retrieves nothing the first time the customer uses their own words.
Provenance to the person, and the human at the end
Carry attribution the whole way and one more field survives: not just the case and the source, but the teammate who solved it. That claim is not an IR result and we will not dress it as one — it is the Knowledge-Centered Service argument, solve-once-reuse-many, that the value of a resolved case includes who resolved it, so the next FAE knows whose judgment they are standing on and whom to ask when the new case rhymes but does not match.
So the honest definition, with nothing left out: a cited answer is trustworthy when every claim is entailed by its cited span, and the system abstained on the claims that were not. That is the entire bar. Everything that clears it is a reply your FAE can read, recognize, and send under their own name. Everything that fails it — fluent, linked, unentailed — is theater, and the last action being human is the load-bearing one: the model retrieves, resolves, drafts, and abstains, and a person presses send.