What file types can RAG APIs process

Most managed RAG APIs process 10-20 file types. A typical full-pipeline service handles 16 formats including PDF, DOCX, PPTX, XLSX, images with OCR, plus direct ingestion of TXT, Markdown, HTML, CSV, and JSON — no parsing code required.

TL;DR: A typical managed RAG API processes 16 file types across two tiers: 9 formats with advanced parsing (PDF, DOCX, PPTX, XLSX, and 5 image types with OCR) and 7 formats with direct text ingestion (TXT, Markdown, HTML, CSV, TSV, and JSON). Most also accept raw text via API.

Which file types get advanced parsing?

Tier 1 formats require document parsing to extract structured content. The API handles tables, embedded images, scanned pages, and complex layouts automatically — you upload the file and get searchable chunks back.

Format Extensions What gets extracted
PDF .pdf Text, tables, images, scanned pages via OCR
Word .docx Body text, tables, headers, embedded images
PowerPoint .pptx Slide text, speaker notes, image descriptions
Excel .xlsx Sheet data converted to structured text and tables
Images .png, .jpg, .jpeg, .webp, .tiff OCR text extraction and visual content description

That is 5 format categories covering 9 individual file extensions. These are the most common document types in enterprise knowledge bases — contracts, reports, slide decks, spreadsheets, and scanned paperwork.

Which file types skip parsing entirely?

Tier 2 formats are already text-based, so they go straight to chunking and embedding with no parsing step. This means faster processing and no parsing cost.

Format Extensions How it is processed
Plain Text .txt Direct chunking
Markdown .md Heading-aware chunking preserves document structure
HTML .html, .htm Tags stripped, text extracted and chunked
CSV/TSV .csv, .tsv Rows converted to searchable text segments
JSON .json Flattened and chunked

These 7 formats are useful for ingesting structured data exports, documentation, web content, and configuration files without any preprocessing on your side.

What are the file size and batch limits?

Every upload has constraints. Here are the limits for a typical managed RAG API:

Constraint Limit
Maximum file size 50 MB
Maximum pages per document 500
Maximum files per batch upload 20
Text file encoding UTF-8

Documents exceeding these limits need to be split before uploading. For most use cases — technical manuals, legal contracts, quarterly reports — these limits are more than sufficient.

What about raw text that is not in a file?

Most RAG APIs also support raw text ingestion. Instead of uploading a file, you POST a string of text directly to the documents endpoint. The text is stored and processed through the same chunking and embedding pipeline as file uploads.

This is useful when your source data lives in a database, a CMS, or an application rather than in files. You can ingest chat transcripts, support tickets, wiki entries, or any other text content without writing it to a file first.

How does the API handle complex documents?

The hardest parsing problems are tables, scanned pages, and mixed layouts. Ragex handles each of these:

  • Tables — Layout-aware extraction preserves row and column structure. Table data stays intact as a single chunk rather than getting split into meaningless fragments.
  • Scanned documents — Image-only pages and embedded images go through OCR automatically. You do not need a separate OCR service.
  • Mixed layouts — Multi-column pages, headers, footers, and sidebars are parsed with structural awareness so the extracted text follows a logical reading order.

This matters because a RAG system is only as good as its parsed content. If tables turn into garbled strings or scanned pages return empty text, search quality collapses regardless of how good the embedding or retrieval layer is.

FAQ

Can I upload multiple file types to the same knowledge base?

Yes. All 16 supported file types can coexist in a single knowledge base. Upload PDFs, spreadsheets, images, and text files together — they all get processed into the same searchable index. Use metadata filters to narrow search results to specific document types when needed.

Do Tier 2 files cost less to process than Tier 1?

Tier 2 files skip the external parsing step, which means they process faster and avoid parsing overhead. The cost difference depends on the provider. Plans typically start at $29 per month and meter by pages processed, with Tier 2 files counted by token equivalent rather than literal page count.

What happens if I upload an unsupported file type?

The API rejects the upload at validation time and returns an error before any processing begins. You get an immediate response identifying the unsupported format. Convert the file to a supported type (PDF or plain text are the safest options) and re-upload.

Is there a way to ingest content from a URL instead of a file?

Some managed RAG APIs support URL-based ingestion where you provide a link and the API fetches, parses, and indexes the content. Alternatively, you can fetch the page yourself, extract the HTML or text, and use the raw text ingestion endpoint to send it directly.


Last updated: 2026-02-26