LongParser

Privacy-First Document Intelligence for Production RAG Pipelines

Most RAG pipelines fail at the data layer. Hallucinations, missed tables, garbled equations, and unverified citations stem from poor document parsing - not from the LLM itself. LongParser solves the input problem. Parse PDFs, DOCX, PPTX, XLSX, and CSV into validated, AI-ready chunks with HITL review and citation tracking.

$pip install "longparser[gpu]"

View on GitHub Documentation Build With Us

parse.py

from longparser import DocumentPipeline, ProcessingConfig

pipeline = DocumentPipeline(ProcessingConfig())

doc = pipeline.process("report.pdf")

# Results

print(f"Extracted {len(doc.blocks)} blocks")

print(f"Created {len(doc.chunks)} chunks")

# Each chunk includes:

# - text content

# - source citation (page, section)

# - confidence score

# - LangChain/LlamaIndex ready

Fix your RAG pipeline at the data layer

Most RAG failures happen before the LLM even sees the data. LongParser fixes the input.

Multi-Format Extraction

PDF (native and scanned), DOCX, PPTX, XLSX, and CSV. Format-specific parsers that preserve structure, tables, and equations.

6 Hybrid Chunking Strategies

Fixed-size, sentence-based, semantic, recursive, structure-aware, and sliding window. Auto-select or configure per document type.

HITL Review Workflow

Human-in-the-loop review for low-confidence extractions. Critical for regulated industries where document accuracy is non-negotiable.

Citation Validation

Every chunk is tracked back to its source location in the original document. RAG responses can cite exact page, section, and paragraph.

LaTeX and Equation Parsing

Extracts and preserves mathematical equations in LaTeX notation. Essential for scientific, technical, and academic document processing.

Docker-Ready Server

Built-in FastAPI server with Docker support. Deploy as a microservice in your RAG infrastructure. LangChain and LlamaIndex ready.

Pipeline Architecture

Document

→

Extract

→

Validate

→

HITL Review

→

Chunk

→

Embed

→

Index

→

Chat Engine

Every step is configurable. Skip HITL for automated pipelines, or add custom validation steps.

Part of the Long Suite

LongParser is the ingestion layer of the complete RAG ecosystem

The Long Suite covers the full RAG lifecycle. LongParser prepares your documents. LongTrainer builds the chatbot. LongTracer verifies the answers. LongProbe tests the retrieval. Each tool works independently or together.

View the Long Suite ecosystem

Common questions

PyMuPDF and pdfplumber are low-level extraction libraries - they give you raw text but leave chunking, validation, and RAG preparation to you. Unstructured is a general-purpose document processor. LongParser is purpose-built for RAG pipelines: it handles extraction, validation, HITL review, 6 chunking strategies, citation tracking, and LangChain/LlamaIndex integration in a single pipeline. It also supports LaTeX equations, RTL languages, and has a Docker-ready server.

LongParser supports 6 hybrid chunking strategies: fixed-size chunking, sentence-based chunking, semantic chunking (embedding-based), recursive character splitting, document-structure-aware chunking (respects headings and sections), and sliding window chunking. You can configure the strategy per document type or use the auto-select mode which picks the best strategy based on document structure.

Yes. LongParser includes OCR support for scanned PDFs using Tesseract and PaddleOCR. For GPU-accelerated OCR, install with `pip install longparser[gpu]`. The pipeline automatically detects whether a PDF is native (text-based) or scanned and applies the appropriate extraction method.

HITL (Human-in-the-Loop) review is a workflow step where extracted document blocks are flagged for human review before being chunked and indexed. LongParser flags blocks with low confidence scores, unusual formatting, or potential extraction errors. This is particularly important for regulated industries (legal, medical, financial) where document accuracy is critical.

Yes. LongParser outputs are compatible with LangChain Document objects and LlamaIndex Node objects out of the box. The pipeline includes integration helpers that convert LongParser chunks directly into the format expected by both frameworks, so you can drop it into your existing RAG pipeline without any conversion code.

LongParser uses a combination of layout analysis and specialized parsers for tables and equations. Tables are extracted as structured data (preserving rows and columns) and can be serialized as Markdown or JSON for LLM consumption. LaTeX equations are extracted and preserved in their original notation. Both are tracked with source citations so the RAG system can reference the exact location in the original document.

Need a production document AI pipeline built?

We build enterprise document processing pipelines using LongParser. Schedule a free consultation.

Schedule Free Consultation View All Open Source Tools