RAG API for AI assistants
RAG API for AI Assistants | Ragex
A RAG API for AI assistants grounds answers in your actual documents. Handles parsing, embedding, and retrieval so you can focus on the assistant experience.
AI assistants without access to your documents are limited to their training data. When a user asks about your product's return policy, deployment process, or pricing tiers, a vanilla LLM either hallucinates or deflects. A RAG API for AI assistants grounds responses in your actual content — upload documents, search on each user query, and pass relevant context to the LLM.
Why AI assistants need retrieval
The gap between what users expect and what a base LLM can answer is the #1 complaint in assistant deployments. Users assume the assistant knows everything about the product, company, or domain. Without retrieval, it does not.
Consider a developer building an internal knowledge base assistant for their company. Employees ask questions like "What's the PTO policy for contractors?" or "How do I file an expense report?" — questions that an LLM cannot answer without seeing the actual HR documents. Retrieval-augmented generation bridges this gap by searching the company's documents on every query and feeding the results to the LLM as context.
The alternative — fine-tuning the model on your documents — is expensive ($500-5,000+ per training run), becomes stale when documents update, and still halluccinates when asked about topics not in the training data. RAG keeps the knowledge current because search results always reflect the latest uploaded documents.
How Ragex powers assistant features
Ragex handles the retrieval side of the assistant architecture. On each user message:
- Search — the API finds relevant text chunks from your knowledge base
- Context — you inject those chunks into the LLM's prompt as context
- Generate — the LLM produces a grounded answer based on the context
from ragex import RagexClient
from openai import OpenAI
rag = RagexClient(api_key="YOUR_RAGEX_API_KEY")
llm = OpenAI(api_key="YOUR_OPENAI_KEY")
def assistant_respond(user_message: str, kb_id: str) -> str:
# Step 1: Search documents
results = rag.search(kb_id, query=user_message, top_k=5)
context = "\n\n".join(r["text"] for r in results["results"])
# Step 2: Generate grounded answer
response = llm.chat.completions.create(
model="gpt-5-mini",
messages=[
{"role": "system", "content": f"Answer from this context only. If the context doesn't cover the question, say so.\n\n{context}"},
{"role": "user", "content": user_message}
],
)
return response.choices[0].message.content
The managed API handles 16 file types automatically — PDFs, DOCX, spreadsheets, images with OCR, and text formats. You do not build separate parsers for each format. Reranking is enabled by default, so the top results passed to the LLM are the most relevant, not just the closest vector matches.
Multi-tenant assistants for SaaS products
SaaS companies building embedded AI assistants face a specific challenge: each customer's assistant needs to answer from that customer's documents only. Data isolation is critical.
Two patterns work with Ragex:
Separate knowledge bases per tenant. Create a knowledge base for each customer and route queries to the right one:
# Each customer has their own knowledge base
customer_kb_map = {
"acme_corp": "kb_acme123",
"startup_co": "kb_start456",
}
def handle_query(customer_id: str, question: str) -> str:
kb_id = customer_kb_map[customer_id]
return assistant_respond(question, kb_id)
Metadata filtering in a shared knowledge base. Tag documents with a customer ID at upload and filter at search time. This uses fewer API resources but requires careful metadata management.
Teams building customer support assistants typically start with the separate-KB approach for simpler isolation guarantees.
What you eliminate vs what you build
Ragex handles the infrastructure that most teams spend weeks building:
Handled by the API:
- Document parsing for 16 file types (no separate parser per format)
- Text chunking with automatic segmentation
- Embedding and vector storage (no database to provision)
- Semantic search with cross-encoder reranking
- Document lifecycle management (upload, update, delete)
You build:
- The assistant's conversation logic and personality
- Prompt engineering for your specific domain
- User interface and conversation history
- Integration with your LLM provider (OpenAI, Anthropic, or open-source models)
- Authentication and rate limiting for your users
This separation lets you iterate on the assistant experience without touching the retrieval infrastructure. When you change the prompt or switch LLM providers, the search pipeline stays the same. Teams in regulated industries like healthcare benefit from this architecture because the retrieval layer handles compliance-sensitive document processing independently from the LLM integration.
Scaling from prototype to production assistant
The code above is production-ready, not a prototype hack. Upload more documents and the knowledge base grows automatically — no database resizing or index tuning. The managed API scales with your usage.
Plans start at $29/mo (Starter), with Pro at $79/mo and Scale at $199/mo. Compare this to building a DIY RAG pipeline — a vector database plus embedding API plus document parser typically runs $200-500/mo before you write a line of assistant logic.
For teams considering alternatives, see how Ragex compares to vector databases like Pinecone.
If you prefer a framework-based approach, explore integrations with LangChain or LlamaIndex.
FAQ
Can the assistant cite which document an answer came from?
Yes. Each search result includes the source document name, page number, section heading, and relevance score. Include this metadata in the LLM's prompt so it can reference specific sources in its answer. Users trust assistants more when they can verify the source.
How do I handle questions the documents do not cover?
Set a score threshold on the search endpoint (e.g., score_threshold=0.3). If no results meet the threshold, tell the user the assistant does not have that information rather than letting the LLM guess. This is better than hallucinating — users prefer an honest "I don't know" to a confident wrong answer.
Does the assistant get smarter as I add more documents?
The assistant's retrieval coverage expands with every document you upload. Adding more relevant content means more user questions get accurate answers. The LLM itself does not change, but the context it receives improves. This is one of RAG's key advantages over fine-tuning — you expand knowledge by uploading files, not retraining models.
Can I add conversation memory alongside document search?
Yes. Manage conversation history in your application and include recent messages in the prompt alongside search results. Ragex handles document search; you handle conversation state. This is the standard pattern for chatbot implementations that need both document grounding and conversational context.
Last updated: 2026-03-09