LongProbe

Sub-second RAG Regression Testing

You refactor chunking, upgrade LangChain, or add a new document — and retrieval silently degrades. LongProbe catches this before your users do. Define Golden Questions once, run longprobe check on every commit, and get an exact diff of which chunks were lost. Think pytest --watch for your RAG pipeline.

$pip install longprobe

Catch retrieval regressions before users do

Fast enough for every commit. Precise enough to show exactly which chunks were lost.

Sub-second Checks

Runs against small golden sets in under a second. Fast enough to run on every commit without slowing your dev loop.

Golden Questions YAML

Define test cases in simple YAML: question, required chunks, match mode, and top-K. Auto-generate from documents with `longprobe generate`.

3 Match Modes

Exact chunk ID match, text substring match, or semantic similarity match. Choose per question based on how your vector store works.

Baseline Tracking & Diff

Save baselines after good runs. Compare future runs to get an exact diff: which chunks were lost, which questions regressed, what improved.

pytest Integration

Drop into existing test suites with the pytest plugin. Write `assert report.overall_recall >= 0.85` like any other test.

CI/CD Ready

GitHub Actions output mode with PR annotations. Fails pipeline on regression. Zero external services — SQLite baseline store included.

Supported Adapters

Vector Stores

  • ChromaDB
  • Pinecone
  • Qdrant
  • HTTP API (any endpoint)

Frameworks

  • LangChain (adapter wrapper)
  • LlamaIndex (adapter wrapper)
  • Any custom retriever

Match Modes

  • ID match (exact chunk ID)
  • Text match (substring)
  • Semantic match (cosine similarity)
Part of the Long Suite

LongProbe is the testing layer of the complete RAG ecosystem

The Long Suite covers the full RAG lifecycle. LongParser prepares your documents. LongTrainer builds the chatbot. LongTracer verifies the answers. LongProbe tests the retrieval. Each tool works independently or together.

View the Long Suite ecosystem

Common questions

LongProbe is a regression testing harness, not a batch evaluation framework. DeepEval and RAGChecker are designed for comprehensive offline evaluation — they're thorough but slow. LongProbe is designed for your dev loop: define a small set of Golden Questions, run `longprobe check` in under a second on every commit, and get an exact diff of which chunks were lost. Think pytest for your RAG pipeline, not a full evaluation suite.
Golden Questions are your ground-truth test cases — a question paired with the specific document chunks that must appear in the retrieval results. You define them in a simple YAML file: the question text, the required chunks (by ID, text substring, or semantic similarity), and the top-K to retrieve. LongProbe also has a `longprobe generate` command that auto-generates golden questions from your documents.
After a successful run, you save the results as a named baseline with `longprobe baseline save --label v1.0`. On future runs, LongProbe compares the current recall scores against the saved baseline and reports an exact diff: which questions regressed, which chunks were lost, and which improved. The baseline is stored in a local SQLite database so it works without any external service.
Yes — LongProbe is CI/CD-ready out of the box. Use `longprobe check --output github` to get GitHub Actions annotations directly in your pull request. The tool exits with a non-zero code when recall drops below your threshold, so it automatically fails the pipeline on regression. A ready-to-use GitHub Actions workflow is in the documentation.
LongProbe supports ChromaDB, Pinecone, Qdrant, and HTTP API adapters directly via YAML config. For LangChain and LlamaIndex, it provides programmatic adapter wrappers — wrap your existing retriever in `LangChainRetrieverAdapter` or `LlamaIndexRetrieverAdapter` and pass it to LongProbe. Any retriever that returns documents can be adapted.
They solve different problems. LongTracer verifies whether the LLM's generated answer is supported by the retrieved documents — it catches hallucinations in the response. LongProbe tests whether the retrieval step itself is working correctly — it catches regressions in which chunks are being retrieved. LongTracer runs at inference time. LongProbe runs in your dev/CI loop. Together they cover both retrieval quality and answer quality.

Need a production RAG system with built-in quality testing?

We build enterprise RAG systems using the full Long Suite stack. Schedule a free consultation to discuss your requirements.