In 2024, researchers at Stanford’s RegLab and Institute for Human-Centered AI (HAI) tested the leading commercial legal-AI research products against their own marketing. Vendors including LexisNexis and Thomson Reuters had claimed that by using retrieval-augmented generation (RAG) - grounding a language model in a database of real legal documents - their tools “eliminated,” “avoided,” or guaranteed “hallucination-free” legal citations. The Stanford team built a preregistered dataset of over 200 open-ended legal queries and measured how often each system produced false information.
The results undercut the marketing. Lexis+ AI and Thomson Reuters’s Ask Practical Law AI produced incorrect information more than 17% of the time, and Westlaw’s AI-Assisted Research hallucinated more than 34% of the time. The tools did meaningfully reduce errors compared with a general-purpose model like GPT-4 - which the authors’ earlier work found hallucinated on legal queries 58% to 82% of the time - so RAG helped. But “helped” was a long way from the absolute claims, and the paper argued that providers had not even precisely defined what they meant by “hallucination.” The study introduced a typology distinguishing genuinely false statements from merely misgrounded ones.
The work landed amid a string of real-world incidents, including the New York lawyers sanctioned in 2023 for filing ChatGPT-invented cases, and a growing public database tracking hallucinated citations in court filings. Chief Justice John Roberts had flagged AI hallucinations in his 2023 year-end report on the judiciary.
Why business readers should care: the paper is a concrete warning that “RAG eliminates hallucinations” is a marketing claim, not an engineering guarantee. Even purpose-built, retrieval-grounded tools in a high-stakes domain were wrong on a meaningful share of queries - which is why human verification of AI output remains a professional obligation, not an optional step.