RAG API with Vercel AI SDK
RAG API with Vercel AI SDK | Ragex
Build a RAG API with Vercel AI SDK using a managed retrieval backend. Stream grounded answers from your own documents in under 5 minutes with one endpoint.
Ragex with Vercel AI SDK replaces the entire retrieval pipeline behind your streaming app with a single REST endpoint. Upload your documents, call /search, and feed the results to streamText. No vector database to provision, no embedding model to choose, no document parser to maintain.
Why Vercel AI SDK Developers Need a Managed RAG API
The Vercel AI SDK handles the hard parts of building AI-powered interfaces: server-sent events, React hooks like useChat, and provider-agnostic model switching across OpenAI, Anthropic, Google, and more. But it has no built-in retrieval layer. If your app needs to answer questions grounded in your own documents, the retrieval pipeline is entirely on you.
A typical DIY RAG setup for a Next.js app requires five or more components from different vendors:
- Document parsing -- extracting text from PDFs, DOCX files, spreadsheets, and images. Each format needs its own parser, and scanned documents need OCR.
- Chunking -- splitting documents into retrieval-friendly segments. Chunk size, overlap, heading preservation, and table handling all affect result quality.
- Embedding -- converting text chunks into vectors using an embedding model API.
- Vector storage -- hosting and indexing vectors in a database that handles similarity search at scale.
- Reranking -- running a cross-encoder model to re-score results after initial vector search, improving relevance.
Each component comes with its own API keys, rate limits, and failure modes. Most teams spend 2-3 weeks integrating these before they can stream a single grounded answer. When results are poor, you debug across five services instead of one. This overhead is the same challenge teams face when building chatbot applications or document search tools from scratch.
How It Works
The architecture splits cleanly: Ragex handles everything below the generation layer, and the Vercel AI SDK handles everything above it.
User question → Your API route → RAG API /search → Retrieved chunks
↓
← Streamed response ← streamText() ← Chunks as context
Here is the flow in detail:
- A user sends a message through your
useChathook on the frontend. - Your Next.js API route receives the message.
- Your route calls Ragex's
/searchendpoint with the user's question. The API returns the top matching document chunks, ranked by relevance with reranking enabled by default. - Your route formats those chunks into a context string and passes it as part of the system prompt to the Vercel AI SDK's
streamTextfunction. streamTextcalls your chosen LLM provider (OpenAI, Anthropic, Google, or any other supported provider) and streams the response back to the client.
Ragex is stateless and provider-agnostic. It returns JSON search results. You control what happens with those results -- which LLM receives them, what system prompt wraps them, and how the response renders. This same retrieval pattern powers customer support applications and internal knowledge base tools built on Ragex.
Complete Working Example
This Next.js App Router API route receives a chat message, retrieves relevant context from Ragex, and streams a grounded answer using the Vercel AI SDK. Copy this into your project and set two environment variables to get started.
import { streamText } from 'ai';
import { openai } from '@ai-sdk/openai';
// Next.js App Router API route
export async function POST(req: Request) {
const { messages } = await req.json();
const userMessage = messages[messages.length - 1].content;
// Step 1: Search your knowledge base via Ragex
const searchResponse = await fetch(
`https://api.useragex.com/api/v1/knowledge-bases/${process.env.RAGEX_KB_ID}/search`,
{
method: 'POST',
headers: {
'Authorization': `Bearer ${process.env.RAGEX_API_KEY}`,
'Content-Type': 'application/json',
},
body: JSON.stringify({
query: userMessage,
top_k: 5,
rerank: true,
}),
}
);
const { results } = await searchResponse.json();
// Step 2: Format retrieved chunks as context
const context = results
.map(
(r: { content: string; file_name: string; score: number }, i: number) =>
`[${i + 1}] (${r.file_name}, score: ${r.score.toFixed(2)})\n${r.content}`
)
.join('\n\n');
// Step 3: Stream the response using Vercel AI SDK
const result = streamText({
model: openai('gpt-5-mini'),
system: `You are a helpful assistant. Answer the user's question using ONLY the context below. If the context doesn't contain the answer, say so.\n\nContext:\n${context}`,
messages,
});
return result.toDataStreamResponse();
}
On the frontend, connect to this route with the Vercel AI SDK's useChat hook:
'use client';
import { useChat } from 'ai/react';
export default function ChatPage() {
const { messages, input, handleInputChange, handleSubmit } = useChat({
api: '/api/chat',
});
return (
<div>
{messages.map((m) => (
<div key={m.id}>
<strong>{m.role}:</strong> {m.content}
</div>
))}
<form onSubmit={handleSubmit}>
<input value={input} onChange={handleInputChange} />
<button type="submit">Send</button>
</form>
</div>
);
}
That is the entire integration. One fetch call for retrieval, one streamText call for generation. Everything else -- streaming, UI state, message history -- is handled by the Vercel AI SDK the way you already use it.
What Changes in Your Codebase
When you adopt Ragex, most of your existing Vercel AI SDK code stays exactly the same. Here is what changes and what stays.
Code you delete:
- Embedding model selection, configuration, and API calls
- Vector database provisioning, connection management, and scaling logic
- Document parsing pipelines for each file format (PDF, DOCX, PPTX, XLSX, images)
- Reranker configuration and inference code
- Chunk strategy tuning -- size, overlap, heading awareness
Code you keep:
- Vercel AI SDK streaming UI (
useChat,streamText,streamUI) - Your choice of LLM provider (OpenAI, Anthropic, Google, or any supported model)
- Your Next.js app structure and deployment on Vercel
- Your prompt engineering and system instructions
- Your frontend components and user experience
Code you add:
- One
fetchcall to the/searchendpoint in your API route (roughly 15 lines) - Two environment variables:
RAGEX_API_KEYandRAGEX_KB_ID
The net effect is less code, fewer dependencies, and fewer moving parts. When better retrieval models or strategies become available, your search quality improves without a code change. Teams building healthcare AI applications and other regulated use cases benefit especially from this separation -- the retrieval infrastructure stays managed and up to date while application logic remains under your control.
Common Patterns
Chatbot With Document Context
The most common pattern is a chatbot that answers questions from uploaded documents. The code sample above implements this directly. Users ask questions, the API retrieves relevant chunks from your knowledge base, and the LLM generates grounded answers. This is the same architecture behind RAG-powered chatbots -- the Vercel AI SDK just adds streaming and React integration on top.
Multi-Source Knowledge Base
For applications that pull from multiple document collections -- say product docs, support tickets, and engineering specs -- call /search against each knowledge base and merge the results before passing context to streamText. This pattern is especially useful for internal knowledge base tools where different teams own different document sets.
const [docsResults, supportResults] = await Promise.all([
searchKnowledgeBase(process.env.DOCS_KB_ID!, userMessage),
searchKnowledgeBase(process.env.SUPPORT_KB_ID!, userMessage),
]);
const combinedContext = [...docsResults, ...supportResults]
.sort((a, b) => b.score - a.score)
.slice(0, 8)
.map((r, i) => `[${i + 1}] (${r.file_name})\n${r.content}`)
.join('\n\n');
Streaming With Source Citations
Add source attribution to streamed responses by passing file metadata alongside the content. The retrieval results include file_name and score fields that you can surface in your UI after the stream completes, giving users transparency into which documents informed the answer.
Getting Started
Getting from zero to streaming grounded answers takes five steps and under an hour. You need a Next.js project with the Vercel AI SDK installed, a Ragex API key from the dashboard, and an LLM provider key (OpenAI, Anthropic, or any other supported provider). Everything below is copy-pasteable.
Step 1: Install dependencies
npm install ai @ai-sdk/openai
Step 2: Set environment variables
RAGEX_API_KEY=your_api_key_here # add to .env.local
RAGEX_KB_ID=your_knowledge_base_id
OPENAI_API_KEY=your_openai_key
Step 3: Create a knowledge base and upload documents
Use the Ragex API to create a knowledge base and upload your files. The API accepts 16 file types including PDF, DOCX, PPTX, XLSX, images, and plain text formats. Documents are parsed, chunked, and indexed automatically. Processing takes seconds for text files and under a minute for longer PDFs.
Step 4: Add the API route
Copy the complete working example from above into app/api/chat/route.ts.
Step 5: Add the chat UI
Copy the useChat frontend example into a page component. Deploy to Vercel and your app is streaming grounded answers from your documents.
The whole setup takes under an hour. If you are comparing managed retrieval options, see how Ragex stacks up in our comparison with Pinecone or browse Pinecone alternatives. For Python-based retrieval pipelines, check the LangChain integration or the LlamaIndex integration.
Pricing starts at $29/mo for the Starter plan (500 pages, 5,000 queries), which covers most prototypes and early-stage products. Pro at $79/mo handles 2,000 pages and 15,000 queries. Business at $229/mo supports 6,500 pages and 50,000 queries. Scale at $499/mo supports 15,000 pages and 120,000 queries.
FAQ
Does Ragex add noticeable latency to my streamed responses?
The retrieval step completes quickly -- fast enough that it adds only a small fraction to the total response time. Since LLM generation via streamText typically takes 1-4 seconds to produce a full response, the retrieval overhead is minimal. For latency-critical paths like autocomplete suggestions, you can disable reranking by setting rerank to false in your search request, which makes retrieval even faster.
Can I use Ragex with providers other than OpenAI in the Vercel AI SDK?
Yes. Ragex is provider-agnostic -- it handles retrieval only and returns JSON results. You can pass those results as context to any LLM provider supported by the Vercel AI SDK, including Anthropic (Claude), Google (Gemini), Mistral, Cohere, or any OpenAI-compatible endpoint. Swap the model parameter in streamText -- for example, use anthropic('claude-sonnet-4-20250514') instead of openai('gpt-5-mini') -- and the rest of the code stays the same. This is the same flexibility you get with the LangChain integration or the LlamaIndex integration.
How do I upload and update documents in the knowledge base?
Use the POST /v1/knowledge-bases/:kb_id/documents endpoint to upload files. The API accepts 16 file types including PDF, DOCX, PPTX, XLSX, images, and plain text formats. Documents are parsed, chunked, and indexed automatically. Processing takes about 4 seconds for text files and under 60 seconds for a 10-page PDF. To update a document, re-upload it to the same knowledge base -- the API replaces the old version and re-indexes automatically.
What does this cost compared to building my own retrieval pipeline?
The Starter plan is $29/mo and includes 500 pages and 5,000 queries, which covers most prototypes and early-stage products. Pro at $79/mo handles 2,000 pages and 15,000 queries. Business at $229/mo supports 6,500 pages and 50,000 queries. Scale at $499/mo supports 15,000 pages and 120,000 queries. A DIY pipeline typically costs more when you factor in a hosted vector database ($25-70/mo), embedding API usage ($10-50/mo), and the engineering time to build and maintain the pipeline across multiple vendors.
Last updated: 2026-02-20