RAG API that handles multiple document formats
Managed RAG APIs support 16+ file types in a single upload endpoint — PDF, DOCX, PPTX, XLSX, images, and text formats. No separate parsers needed for each format.
TL;DR: Ragex supports 16 file types through a single upload endpoint — no separate parsers per format. Nine types (PDF, DOCX, PPTX, XLSX, PNG, JPG, WEBP, TIFF, CSV) get advanced parsing that handles tables, images, and complex layouts. Seven types (TXT, MD, HTML, TSV, JSON) are ingested directly as text. Upload any supported file and search across all of them with one query.
Why is multi-format support hard to build yourself?
Each document format requires a different parser. PDFs need layout analysis and table extraction. DOCX files need XML unpacking. Spreadsheets need cell-by-cell reading. Scanned images need OCR. Building or integrating a parser for each format means managing multiple libraries, handling edge cases per format, and maintaining the stack as libraries update.
Most teams start with PDF support, then discover their users also upload DOCX, PPTX, and images. Adding each format is another integration project. Ragex eliminates this problem by handling all 16 formats behind a single upload_document call.
Which file types get advanced parsing?
The nine Tier 1 formats receive full document intelligence:
| Format | What Gets Parsed |
|---|---|
| Text, tables, headers, page structure, OCR for scanned pages | |
| DOCX | Text, styles, tables, embedded images |
| PPTX | Slide text, speaker notes, tables |
| XLSX | Cell values, sheet names, formulas resolved to values |
| PNG, JPG, WEBP, TIFF | OCR text extraction from images |
| CSV | Row-by-row structured data |
The seven Tier 2 formats (TXT, MD, HTML, TSV, JSON, and others) are ingested as plain text without specialized parsing — the content is already machine-readable.
How does a single search work across mixed formats?
After upload, every document — regardless of format — is chunked into text segments and embedded into the same vector space. A search query matches against all chunks across all documents in a knowledge base. You do not need to specify which format to search or run separate queries per format.
from ragex import RagexClient
client = RagexClient(api_key="YOUR_API_KEY")
kb = client.create_knowledge_base(name="Company Docs")
# Upload different formats to the same knowledge base
client.upload_document(kb["id"], "handbook.pdf")
client.upload_document(kb["id"], "policies.docx")
client.upload_document(kb["id"], "financials.xlsx")
client.upload_document(kb["id"], "architecture.png")
# One search query covers all formats
results = client.search(kb["id"], query="What is the travel policy?", top_k=5)
Results include the source document name and file type, so you can show users which document a chunk came from.
What about documents with tables and images?
Tables are extracted row by row with column headers preserved, so chunks containing table data are searchable by column names and cell values. Images within documents (diagrams in DOCX, charts in PPTX) are processed via OCR if they contain text.
Standalone image files (PNG, JPG, WEBP, TIFF) are fully OCR-processed. This means scanned receipts, photographed whiteboards, or screenshots of text are all searchable after upload.
FAQ
Can I upload raw text instead of a file?
Yes. The API has a text ingestion endpoint that accepts a string directly. This is useful for content you already have in memory — scraped web pages, generated text, database records. The text goes through the same chunking and embedding pipeline as file uploads.
What happens if I upload an unsupported file type?
The API returns an error at upload time with a clear message about which formats are supported. You do not get a silent failure or corrupted processing.
Can I filter search results by file type?
Yes. Attach metadata at upload time (e.g., {"format": "pdf"}) and filter on it during search. The API supports operators like $eq, $in, and $ne for metadata filtering, so you can scope a search to only PDFs or exclude spreadsheets.
Last updated: 2026-03-09