LongTracer

RAG Hallucination Detection and Multi-Project Tracing

RAG systems still hallucinate. Your LLM confidently states facts not in your documents. LongTracer catches this before it reaches your users - verifying every claim against your source documents using a two-stage STS and NLI pipeline. No LLM dependency. No vector store required. Just strings in, trust score out.

$pip install longtracer

View on GitHub Documentation Build With Us

verify.py

from longtracer import CitationVerifier

verifier = CitationVerifier()

result = verifier.verify_parallel(

response="The Eiffel Tower is 330m tall and in Berlin.",

sources=["The Eiffel Tower is in Paris, France. It is 330m tall."]

)

# Results

print(result.trust_score) # 0.5

print(result.hallucination_count) # 1

print(result.all_supported) # False

# "Berlin" contradicts "Paris" in source

Catch hallucinations before they reach users

Claim-level verification with full trace. Works with any RAG framework.

Claim-Level Verification

Verifies every individual claim in the LLM response against source documents. Pinpoints exactly which statements are hallucinated.

Trust Score (0.0–1.0)

Returns a trust score representing the proportion of supported claims. Threshold-based filtering for production quality gates.

Parallel Pipeline

STS relevance scoring runs alongside LLM generation. Minimal latency impact on your RAG system.

No LLM Dependency

Works with any RAG framework. Just strings in, verification out. No API keys, no LLM costs for verification.

Multi-Project Tracing

Track verification results across multiple projects and time periods with pluggable storage backends.

LangChain and LlamaIndex Ready

Native integration helpers for LangChain and LlamaIndex. Drop into your existing pipeline in minutes.

Part of the Long Suite

LongTracer is the verification layer of the complete RAG ecosystem

The Long Suite covers the full RAG lifecycle. LongParser prepares your documents. LongTrainer builds the chatbot. LongTracer verifies the answers. LongProbe tests the retrieval. Each tool works independently or together.

View the Long Suite ecosystem

Common questions

RAG hallucination occurs when an LLM generates a response that contains facts not supported by - or contradicted by - the retrieved source documents. It happens because LLMs are trained to generate fluent, confident text, and they sometimes fill gaps in retrieved context with plausible-sounding but incorrect information. LongTracer detects this by verifying every claim in the response against the source documents.

LongTracer uses a two-stage pipeline: first, a bi-encoder (STS model) finds the most semantically similar source sentence for each claim in the response. Then, a cross-encoder (NLI model) classifies the relationship as entailment, contradiction, or neutral. This approach is fast, cheap, and doesn't require an LLM - just the response text and source documents as strings.

The trust score is a float between 0.0 and 1.0 that represents the proportion of claims in the response that are supported by the source documents. A score of 1.0 means every claim is entailed by a source. A score of 0.5 means half the claims are supported. LongTracer also returns a list of flagged claims with their evidence mapping so you can see exactly which claims failed.

LongTracer's accuracy depends on the quality of the NLI model used. With a strong cross-encoder (e.g., a model fine-tuned on NLI benchmarks), it achieves high precision on factual claims. It works best on factual, verifiable statements and is less reliable on subjective or ambiguous claims. The parallel pipeline design minimizes latency impact on your RAG system.

Yes. LongTracer has no dependency on any specific RAG framework, vector store, or LLM. It takes two strings - the LLM response and the source documents - and returns a verification result. It integrates with LangChain, LlamaIndex, LongTrainer, or any custom RAG pipeline. There are also native integration helpers for LangChain and LlamaIndex.

STS (Semantic Textual Similarity) uses a bi-encoder to find the most relevant source sentence for each claim - it's fast but only measures similarity, not factual support. NLI (Natural Language Inference) uses a cross-encoder to classify whether the source sentence entails, contradicts, or is neutral to the claim - it's slower but more accurate. LongTracer runs STS first to narrow candidates, then NLI for precise verification.

Need a RAG system with built-in hallucination detection?

We build production RAG systems with LongTracer integrated for quality monitoring. Schedule a free consultation.

Schedule Free Consultation View All Open Source Tools