How to add PDF search to a web application

TL;DR: Upload your PDFs to Ragex that handles parsing, OCR, chunking, and embedding automatically. Poll until processing completes, then search with a single API call. You get accurate results from tables, scanned documents, and complex layouts — working search in under 5 minutes with five API calls.

Why is PDF search hard to build from scratch?

PDFs were designed for printing, not searching. A PDF file is a collection of positioned glyphs on a canvas — there is no guaranteed reading order, no semantic structure, and no concept of a "paragraph." That creates real problems when you need to search document contents from a web app.

The specific challenges:

Tables — Column alignment is visual, not structural. Naive text extraction turns tabular data into meaningless strings.
Scanned documents — Image-only PDFs contain zero extractable text. You need OCR to convert pixel data into searchable content.
Mixed layouts — Headers, footers, sidebars, multi-column layouts, and embedded images all confuse simple extraction tools.
Scale — A single PDF can be hundreds of pages. Multiply that across thousands of documents and you need chunking, embedding, and indexing infrastructure.

Building this pipeline yourself means stitching together a PDF parser, an OCR engine, a text chunker, an embedding model, and a vector database. Each component has its own failure modes. Ragex handles the entire pipeline behind a single endpoint.

How does the upload-to-search workflow work?

The process follows three steps: upload the PDF, wait for async processing, then search.

When you upload a PDF, the API processes it through a pipeline: pending → parsing → chunking → embedding → ready. Parsing extracts text while preserving table structure and running OCR on images. Chunking splits the content into searchable segments. Embedding converts each chunk into a vector for similarity search. The entire flow is automatic — you just poll the document status until it reaches ready.

Documents can be up to 50 MB and 500 pages. PDFs are a Tier 1 file type with advanced parsing, which means tables, images, and scanned content are all handled. The API supports 16 file types total if you need to search beyond PDFs.

How do you implement this in Python?

Here is a complete example — create a knowledge base, upload a PDF, wait for processing, and search:

from ragex import RagexClient
import time

client = RagexClient(api_key="YOUR_API_KEY")

# Step 1: Create a knowledge base for your PDFs
kb = client.create_knowledge_base(name="Legal Contracts")

# Step 2: Upload a PDF
doc = client.upload_document(kb["id"], "contract-2026.pdf")

# Step 3: Poll until processing finishes
while doc["status"] not in ("ready", "failed"):
    time.sleep(2)
    doc = client.get_document(kb["id"], doc["id"])

if doc["status"] == "failed":
    raise Exception(f"Document processing failed: {doc['id']}")

# Step 4: Search the PDF content
results = client.search(
    kb["id"],
    query="What are the termination clauses?",
    top_k=5,
)

for result in results["results"]:
    print(f"[{result['score']:.2f}] {result['text'][:200]}")

That is five API calls: sign up, create a knowledge base, upload, check status, and search. From there, you pass the search results as context to your LLM for answer generation — that is retrieval-augmented generation.

How do you connect this to a web app?

Your web app sends search queries to your backend, which proxies them to the API. A typical pattern:

Backend route — Create an endpoint (e.g., /api/search) that accepts a query string from your frontend.
Call the search API — Forward the query to your knowledge base and return the results.
Display results — Render the matched text chunks in your UI, or feed them to an LLM for a synthesized answer.

You do not need to expose your API key to the client. Keep it server-side and proxy all requests through your backend.

For production use, replace polling with webhooks. Register a webhook URL and the API notifies your server when a document reaches ready or failed status — no polling loop needed.

What about search quality and filtering?

The API applies reranking by default, which reorders initial results using a cross-encoder for higher accuracy. You can also filter results by metadata:

results = client.search(
    kb["id"],
    query="payment terms",
    top_k=10,
    filter={"department": {"$eq": "legal"}, "year": {"$gte": 2025}},
)

Attach metadata when uploading documents to enable filtering by department, document type, date, or any custom field relevant to your use case.

FAQ

What file size and page limits apply to PDF uploads?

Each PDF can be up to 50 MB and 500 pages. Documents exceeding these limits need to be split before uploading. For most web applications, these limits cover standard contracts, reports, manuals, and technical documentation without any preprocessing on your end.

How long does PDF processing take?

Processing time depends on document size and complexity. A typical 20-page PDF completes in under a minute. Scanned documents take longer because OCR adds an extra extraction step. You can poll the document status endpoint or register a webhook to get notified when processing finishes.

Does the API handle scanned PDFs and images inside PDFs?

Yes. The API runs OCR automatically on image-only pages and embedded images within PDFs. Tables in scanned documents are extracted with layout-aware parsing that preserves row and column structure, so your search results reflect the actual data rather than garbled text.

How much does it cost to add PDF search?

Plans start at $29 per month. That includes document parsing, chunking, embedding, storage, and search queries. You do not need to pay separately for an OCR service, an embedding model provider, or a vector database — Ragex bundles the full pipeline into one price.

Can I search across multiple PDFs at once?

Yes. All documents in a knowledge base are searchable together. Upload your PDFs to the same knowledge base and every search query runs across all of them. Use metadata filters to narrow results to specific documents, categories, or date ranges when needed.

Last updated: 2026-02-26