How to search across PDF, DOCX, and CSV files with one API

Q: Can I search only within a specific file type?

Yes. Attach metadata at upload time (e.g., `{"file_type": "csv"}`) and use metadata filtering in your search query. The filter `{"file_type": {"$eq": "csv"}}` restricts results to CSV-sourced chunks only.

TL;DR: Upload files of any supported type to Ragex knowledge base. The API parses each format automatically — PDF layout extraction, DOCX text unpacking, CSV row-by-row ingestion — then chunks, embeds, and indexes everything into a unified search space. One query searches across all formats simultaneously.

Why is cross-format search hard to build yourself?

Each file format requires a different parser:

PDFs need layout analysis, table extraction, and OCR for scanned pages
DOCX files need XML parsing to extract text, styles, and tables
CSV files need column-aware ingestion so header-value pairs stay together
PPTX needs slide-by-slide extraction with speaker notes
Images (PNG, JPG) need OCR to convert visual text to searchable text

Building or integrating a parser per format is a multi-week project. Maintaining them as libraries update and edge cases appear (merged cells, rotated PDFs, password-protected DOCX) is an ongoing burden. And after parsing, you still need chunking, embedding, and indexing.

How does Ragex handle this?

The API provides a single upload endpoint that accepts any of 16 supported file types. It detects the format, applies the appropriate parser, chunks the extracted text, generates embeddings, and indexes everything into one searchable knowledge base.

from ragex import RagexClient

client = RagexClient(api_key="YOUR_API_KEY")
kb = client.create_knowledge_base(name="Company Data")

# Upload different formats — same endpoint, same code
client.upload_document(kb["id"], "quarterly-report.pdf")
client.upload_document(kb["id"], "employee-handbook.docx")
client.upload_document(kb["id"], "inventory.csv")
client.upload_document(kb["id"], "sales-deck.pptx")
client.upload_document(kb["id"], "whiteboard-photo.png")

All five files are processed concurrently. Once each reaches ready status, it is searchable. A single query searches across all of them:

results = client.search(
    kb["id"],
    query="What was Q3 revenue?",
    top_k=5,
)

for r in results["results"]:
    print(f"{r['document_name']}: {r['text'][:100]}")

The search might return a chunk from the PDF report, a row from the CSV, and a slide from the PPTX — all ranked by relevance using a cross-encoder reranker.

How does CSV search work?

CSV files are ingested row by row. Each row becomes a chunk with column headers preserved in the text, so searching for "inventory for SKU-1234" matches the relevant row even if the query does not use exact column names. The semantic search understands the relationship between your query and the structured data.

For large spreadsheets, this approach works better than keyword search because it matches meaning. A query like "highest-selling product" can match a row where the "units_sold" column has the largest value, even though the word "highest" does not appear in the CSV.

Can I tell which format a result came from?

Yes. Every search result includes document_name (the original filename with extension) and document_id. You can display this in your UI so users know whether a result came from a PDF, DOCX, or CSV. Page numbers and section headings are included in chunk metadata for formats that support them.

What about scanned documents and images?

Scanned PDFs and image files (PNG, JPG, WEBP, TIFF) are processed through OCR. Text is extracted from the image, chunked, and embedded just like parsed text from a native PDF. The search quality depends on image clarity — clean screenshots and well-scanned documents work reliably. Heavily distorted or low-resolution images may produce partial or inaccurate text extraction.

FAQ

Do I need to specify the file type when uploading?

No. The API detects the format from the file content and extension. Upload via multipart form data with the original filename and the API applies the correct parser automatically.

Can I search only within a specific file type?

Yes. Attach metadata at upload time (e.g., {"file_type": "csv"}) and use metadata filtering in your search query. The filter {"file_type": {"$eq": "csv"}} restricts results to CSV-sourced chunks only.

What happens with very large CSV or XLSX files?

Large files are processed asynchronously like any other document. Processing time scales with file size — a 10,000-row CSV takes longer than a 100-row one but still processes automatically. The API chunks the content into searchable segments, so even massive spreadsheets are queryable after processing.

Last updated: 2026-03-09