Sub-second RAG Regression Testing
You refactor chunking, upgrade LangChain, or add a new document — and retrieval silently degrades. LongProbe catches this before your users do. Define Golden Questions once, run longprobe check on every commit, and get an exact diff of which chunks were lost. Think pytest --watch for your RAG pipeline.
pip install longprobeFast enough for every commit. Precise enough to show exactly which chunks were lost.
Runs against small golden sets in under a second. Fast enough to run on every commit without slowing your dev loop.
Define test cases in simple YAML: question, required chunks, match mode, and top-K. Auto-generate from documents with `longprobe generate`.
Exact chunk ID match, text substring match, or semantic similarity match. Choose per question based on how your vector store works.
Save baselines after good runs. Compare future runs to get an exact diff: which chunks were lost, which questions regressed, what improved.
Drop into existing test suites with the pytest plugin. Write `assert report.overall_recall >= 0.85` like any other test.
GitHub Actions output mode with PR annotations. Fails pipeline on regression. Zero external services — SQLite baseline store included.
The Long Suite covers the full RAG lifecycle. LongParser prepares your documents. LongTrainer builds the chatbot. LongTracer verifies the answers. LongProbe tests the retrieval. Each tool works independently or together.
View the Long Suite ecosystemWe build enterprise RAG systems using the full Long Suite stack. Schedule a free consultation to discuss your requirements.