RAG API with LlamaIndex
RAG API with LlamaIndex | Ragex
Build a RAG API with LlamaIndex using a managed retrieval service. Replace loaders, parsers, and vector stores with one API call. Working code included.
LlamaIndex gives you query engines, chat engines, and response synthesizers. But wiring up the retrieval side — document parsers, embedding models, vector stores, rerankers — takes days of configuration. Ragex with LlamaIndex replaces that entire retrieval stack with a single API call. Write one custom retriever class and start querying in minutes.
Why LlamaIndex Developers Need a Managed Retrieval API
A standard LlamaIndex RAG pipeline requires you to configure at least four or five separate components before you write your first query. You pick a document loader like SimpleDirectoryReader or a format-specific reader for PDFs. You configure a node parser — SentenceSplitter or TokenTextSplitter — and tune chunk sizes and overlap. You choose an embedding model, set up a vector store like Pinecone or Chroma or Qdrant, and optionally add a node postprocessor for reranking.
Each of those components comes from a different vendor with its own API keys, configuration surface, and failure modes. When retrieval quality drops, you debug across the full stack. Did the parser miss a table? Is the chunk size too large? Is the embedding model underperforming on your domain? Most teams spend days tuning these knobs before they even get to building the query engine — the part of LlamaIndex that actually matters for their application.
If you have built customer support systems or chatbots with LlamaIndex before, you have felt this pain firsthand. The framework is excellent at orchestrating retrieval-augmented generation, but the retrieval pipeline itself is your problem to build and maintain.
How the Integration Works
The architecture is straightforward. Ragex handles everything on the retrieval side: document parsing (16 file types including PDFs with OCR, DOCX, images, and spreadsheets), chunking, embedding, indexing, and reranking. Your LlamaIndex application handles everything on the synthesis side: query engines, response synthesizers, prompt templates, chat engines, and agents.
The bridge between the two is a custom retriever class. LlamaIndex's BaseRetriever interface defines a _retrieve() method that takes a query and returns a list of NodeWithScore objects. Your custom retriever implements that method by calling the Ragex /search endpoint and converting the results into the format LlamaIndex expects.
This means every LlamaIndex component that accepts a retriever — RetrieverQueryEngine, CitationQueryEngine, CondenseQuestionChatEngine, ContextChatEngine — works with no modifications. You swap out the retrieval backend while keeping all your orchestration logic intact. The same pattern works whether you are building document search or an internal knowledge base.
Complete Working Example
Here is the full, runnable implementation. This code defines a RagexRetriever that extends BaseRetriever, calls the managed search API, and plugs into a standard RetrieverQueryEngine with an OpenAI LLM for response synthesis.
import httpx
from llama_index.core import QueryEngine
from llama_index.core.retrievers import BaseRetriever
from llama_index.core.schema import NodeWithScore, TextNode, QueryBundle
from llama_index.core.query_engine import RetrieverQueryEngine
from llama_index.core.response_synthesizers import get_response_synthesizer
from llama_index.llms.openai import OpenAI
class RagexRetriever(BaseRetriever):
"""LlamaIndex retriever backed by Ragex."""
def __init__(
self,
api_key: str,
kb_id: str,
base_url: str = "https://api.useragex.com/api/v1",
top_k: int = 5,
rerank: bool = True,
):
super().__init__()
self._client = httpx.Client(
base_url=base_url,
headers={"Authorization": f"Bearer {api_key}"},
timeout=10.0,
)
self._kb_id = kb_id
self._top_k = top_k
self._rerank = rerank
def _retrieve(self, query_bundle: QueryBundle) -> list[NodeWithScore]:
"""Call Ragex /search and return LlamaIndex NodeWithScore objects."""
response = self._client.post(
f"/knowledge-bases/{self._kb_id}/search",
json={
"query": query_bundle.query_str,
"top_k": self._top_k,
"rerank": self._rerank,
},
)
response.raise_for_status()
results = response.json()["results"]
return [
NodeWithScore(
node=TextNode(
text=r["text"],
metadata={
"document_name": r["document_name"],
"chunk_index": r["metadata"]["chunk_index"],
"start_page": r["metadata"].get("start_page"),
"section_heading": r["metadata"].get("section_heading"),
},
),
score=r["score"],
)
for r in results
]
# Usage: plug into a LlamaIndex query engine
retriever = RagexRetriever(
api_key="rag_live_abc123",
kb_id="kb_x1y2z3w4v5",
top_k=5,
rerank=True,
)
llm = OpenAI(model="gpt-5-mini")
synth = get_response_synthesizer(llm=llm)
query_engine = RetrieverQueryEngine(retriever=retriever, response_synthesizer=synth)
# Query your documents — retrieval handled by Ragex, synthesis by LlamaIndex
response = query_engine.query("What is our refund policy?")
print(response.response)
print(f"Sources: {[n.node.metadata['document_name'] for n in response.source_nodes]}")
Upload your documents through the API first. The service parses, chunks, embeds, and indexes them automatically. Then the retriever above handles search at query time. If you have used the LangChain integration, the pattern is similar — the difference is that LlamaIndex uses NodeWithScore objects instead of LangChain Document objects.
What Changes in Your Codebase
When you switch to a managed retrieval API, a significant portion of your LlamaIndex code goes away while the parts you care about stay exactly the same.
Code you delete:
- Document loader setup —
SimpleDirectoryReader,PDFReader, and any custom readers - Node parser configuration — chunk size, overlap, heading-aware splitting logic
- Embedding model selection, hosting, and API key management
- Vector store provisioning and maintenance for Pinecone, Chroma, Qdrant, or other backends
- Node postprocessor setup for reranking
- Index construction and rebuild logic
Code you keep:
- LlamaIndex query engine orchestration
- Response synthesizers and prompt templates
- Chat engine and conversation memory
- Agent and tool abstractions
- Your choice of LLM — GPT-5-mini, Claude, Llama, Mistral, or any other model
- Custom output parsing and structured extraction
The net result: your codebase gets smaller. The retrieval complexity lives behind an API, and you focus on the query engine and synthesis logic that defines your application. Compare this to the typical DIY approach described in our Pinecone comparison — building the pipeline yourself means maintaining all of those components indefinitely.
Common Patterns
Chat Engine with Conversation Memory
Pass the RagexRetriever to a LlamaIndex chat engine for multi-turn conversations. The chat engine handles question condensation and memory while the retriever fetches relevant context from the managed API on every turn.
from llama_index.core.chat_engine import CondenseQuestionChatEngine
chat_engine = CondenseQuestionChatEngine.from_defaults(
query_engine=query_engine,
llm=llm,
)
response = chat_engine.chat("What's our return window?")
print(response.response)
follow_up = chat_engine.chat("Does that apply to international orders?")
print(follow_up.response)
This pattern is particularly useful when building chatbot applications where users ask follow-up questions that reference earlier parts of the conversation. The API is stateless, so each search call is independent — conversation state lives entirely in LlamaIndex's chat engine.
Citation Query Engine
Use CitationQueryEngine to get inline citations that trace answers back to specific source documents. The metadata from the search results — document name, page number, section heading — flows through automatically because the retriever populates TextNode.metadata.
from llama_index.core.query_engine import CitationQueryEngine
citation_engine = CitationQueryEngine.from_args(
retriever=retriever,
llm=llm,
)
response = citation_engine.query("What are the shipping costs?")
print(response.response)
for source in response.source_nodes:
print(f" [{source.node.metadata['document_name']}] {source.node.get_text()[:100]}...")
Citation support matters for healthcare applications and other domains where users need to verify the source of every answer. It also works well for document search tools where pointing back to the original file builds user trust.
Getting Started
Prerequisites: Python 3.10+, a Ragex API key, and an OpenAI API key (or any LLM provider).
Step 1. Install the required packages:
pip install llama-index httpx
Step 2. Upload your documents to the API. You can do this through the dashboard or programmatically:
curl -X POST https://api.useragex.com/api/v1/knowledge-bases \
-H "Authorization: Bearer $RAGEX_API_KEY" \
-H "Content-Type: application/json" \
-d '{"name": "my-docs"}'
curl -X POST https://api.useragex.com/api/v1/knowledge-bases/YOUR_KB_ID/documents \
-H "Authorization: Bearer $RAGEX_API_KEY" \
-F "file=@your-document.pdf"
Step 3. Copy the RagexRetriever class from the code sample above into your project. Set your API key and knowledge base ID.
Step 4. Wire it into your query engine:
retriever = RagexRetriever(api_key="your_key", kb_id="your_kb_id")
query_engine = RetrieverQueryEngine(retriever=retriever, response_synthesizer=synth)
response = query_engine.query("Your first question here")
The typical setup time is under 5 minutes from sign-up to first query. Compare that to the 2-5 days most teams spend configuring a full LlamaIndex retrieval pipeline with loaders, parsers, vector stores, and embedding models.
For teams evaluating different managed retrieval options, our Pinecone alternatives guide covers how the managed API approach compares to other solutions in the space. And if you are building with the Vercel AI SDK on the frontend, the Vercel AI SDK integration shows how to connect your LlamaIndex backend to a streaming React UI.
Pricing starts at $29/month for the Starter plan (500 pages, 5,000 queries), which includes enough capacity for most development and early production workloads. The Pro plan at $79/month (2,000 pages, 15,000 queries), Business plan at $229/month (6,500 pages, 50,000 queries), and Scale plan at $499/month (15,000 pages, 120,000 queries) add higher limits for growing applications.
FAQ
Can I use this as a LlamaIndex custom retriever?
Yes. Subclass BaseRetriever and implement _retrieve() to call the search endpoint, converting each result into a NodeWithScore with a TextNode. The code sample above shows the full implementation. The retriever plugs directly into RetrieverQueryEngine, CitationQueryEngine, or any LlamaIndex component that accepts a retriever.
Does this work with LlamaIndex chat engines?
Yes. Pass the RagexRetriever to CondenseQuestionChatEngine or ContextChatEngine. The chat engine calls _retrieve() on each turn, which hits the search API. Conversation memory and question condensation are handled by LlamaIndex as usual. The API is stateless REST, so each search call is independent.
Do I still need LlamaIndex's document loaders and node parsers?
No. The API handles document loading, parsing (including OCR for scanned PDFs and layout-aware table extraction), chunking, and embedding. You upload files directly instead of processing them through LlamaIndex's ingestion pipeline. This eliminates SimpleDirectoryReader, all format-specific readers, and node parser configuration entirely.
How does retrieval quality compare to a self-managed LlamaIndex pipeline?
The managed API applies automatic reranking and uses production-grade embedding and retrieval components that are continuously updated. Most self-managed pipelines use mid-tier embedding models without reranking, which tends to produce lower-quality results. With the managed approach, when better models become available, your retrieval quality improves without any code change on your end. You can read more about the managed versus self-hosted tradeoffs in our comparison guide.
What LLMs can I use with this integration?
Any LLM that works with LlamaIndex. The retriever handles search only — it returns NodeWithScore objects that feed into whatever response synthesizer and LLM you choose. GPT-5-mini, Claude, Llama, Mistral, and any other model supported by LlamaIndex all work. Your LLM choice is completely independent of the retrieval backend, which is one reason this approach pairs well with customer support systems and internal knowledge bases that may have different LLM requirements.
Last updated: 2026-02-20