How We Build RAG
Retrieval Augmented Generation - giving AI context to answer accurately
How We Build RAG
RAG (Retrieval Augmented Generation) gives LLMs access to information they weren't trained on. Instead of hoping the model knows something, we find relevant context and include it in the prompt.
Core principle: Don't ask the LLM to remember — give it the answer and ask it to use it.
Why RAG?
| Without RAG | With RAG |
|---|---|
| LLM makes up facts | LLM cites sources |
| Can't use private data | Searches your data |
| Knowledge cutoff limits | Always current |
| Hallucinations common | Grounded responses |
The RAG Pipeline
RAG has four distinct stages:
- Document Processing — Split documents into searchable chunks (chunking)
- Embedding — Convert chunks to vectors for similarity search
- Retrieval — Find relevant chunks for a query
- Generation — Answer the question using retrieved context
Each stage has its own decisions and trade-offs.
Stage 1: Document Processing
The Chunking Decision
Documents must be split into searchable pieces. Chunk size affects retrieval quality.
| Chunk Size | Pros | Cons |
|---|---|---|
| Small (200-500 chars) | Precise retrieval | May lack context |
| Medium (500-1000 chars) | Good balance | Standard choice |
| Large (1000-2000 chars) | More context | Less precise matching |
Rule of thumb: Start with 500-1000 characters with 100-200 character overlap between chunks.
Chunking Strategies
| Strategy | Best For | Description |
|---|---|---|
| Fixed size | General text | Split every N characters |
| Paragraph-based | Articles, blogs | Split on natural breaks |
| Sentence-based | Dense content | Split every 2-5 sentences |
| Semantic | Technical docs | Split by section/heading |
Metadata Matters
Always store metadata with chunks:
- Source: Where did this come from?
- Section: What part of the document?
- Date: When was it created/updated?
- Any filtering attributes: Category, author, etc.
Without metadata, you can't cite sources or filter results.
Stage 2: Embedding
Convert text chunks to vectors that capture semantic meaning.
Embedding Model Selection
| Model | Dimensions | Best For |
|---|---|---|
| text-embedding-3-small | 1536 | General use, cost-effective |
| text-embedding-3-large | 3072 | Higher quality, more cost |
| Multilingual models | Varies | Non-English content |
Considerations:
- Match model to your language (Vietnamese content needs multilingual support)
- Higher dimensions = better quality but more storage and slower search
- Batch embedding requests (50-100 items) to reduce latency
Storage: Vector Databases
We use Supabase with pgvector for simplicity. Other options:
| Option | Best For |
|---|---|
| Supabase + pgvector | Simple setup, integrated with our stack |
| Pinecone | Large scale, managed service |
| Weaviate | Complex filtering needs |
| FAISS | Local/embedded use cases |
Stage 3: Retrieval
Finding the right chunks for a query.
Similarity Search
The basic approach: embed the query, find chunks with highest cosine similarity.
Key parameters:
- Limit: How many chunks to retrieve (typically 3-10)
- Threshold: Minimum similarity score (typically 0.7-0.8)
| Threshold | Effect |
|---|---|
| 0.8+ | High precision, few results, may miss relevant content |
| 0.7 | Balanced (default) |
| 0.6 | High recall, more noise, catches edge cases |
Hybrid Search
Combine vector search with keyword search for better results.
Why hybrid?
- Vector search: Finds semantically similar content
- Keyword search: Finds exact matches (names, codes, specific terms)
Together they cover both meaning and specifics.
Retrieval Strategies
| Strategy | When to Use |
|---|---|
| Simple similarity | Single, straightforward queries |
| Hybrid (vector + keyword) | Queries with specific terms or names |
| Query expansion | Vague or short queries |
| Multi-query | Complex questions needing multiple perspectives |
Stage 4: Generation
Using retrieved context to generate answers.
Context Assembly
How you present context to the LLM matters.
Include:
- Clear separation between chunks
- Source attribution for each chunk
- Instructions to use only the provided context
- Guidance on what to do when context lacks the answer
Avoid:
- Dumping raw chunks without structure
- No instructions on how to use context
- Expecting the LLM to figure out source attribution
Handling "No Answer"
What happens when retrieved context doesn't contain the answer?
Good behavior: "I don't have information about that in my knowledge base."
Bad behavior: Making up an answer based on training data.
The system prompt must explicitly instruct the LLM to admit when it doesn't know.
Advanced Patterns
Query Expansion
For vague queries, generate alternative phrasings before searching.
Example:
- Original: "How do I deploy?"
- Expanded: "How do I deploy?", "deployment process", "releasing to production", "CI/CD pipeline"
Search with all variants, combine and dedupe results.
Conversational RAG
In multi-turn conversations, the current query may reference previous context.
Problem: "What about the pricing?" — pricing of what?
Solution: Reformulate queries to be standalone before searching. Use conversation history to resolve references.
Chunk Overlap
Overlap between chunks prevents cutting sentences mid-thought.
No overlap: Information at chunk boundaries gets lost. With overlap: Boundary content appears in multiple chunks, ensuring it can be found.
Common Mistakes
Chunks too large or too small
Signs: Irrelevant results, missing context, token limits exceeded.
Fix: Start with 500-1000 characters. Test with your actual content.
No metadata
Signs: Can't cite sources, can't filter by date or category.
Fix: Always store source, date, and relevant filtering attributes.
Ignoring similarity threshold
Signs: Including irrelevant low-similarity results in context.
Fix: Set appropriate threshold (0.7-0.8) and test with real queries.
Not handling "no results"
Signs: LLM hallucinates when nothing is found.
Fix: Explicit handling when retrieval returns empty or low-quality results.
Stuffing too much context
Signs: Token limits exceeded, important context buried.
Fix: Limit to 5-10 relevant chunks. Quality over quantity.
Evaluation Checklist
Your RAG is working if:
- Retrieved chunks are relevant to queries
- Responses cite sources accurately
- LLM says "don't know" when context lacks answer
- Different queries return different contexts
- Latency is acceptable (< 3s total)
Your RAG needs work if:
- Irrelevant chunks appear in results
- LLM ignores context and makes things up
- Same context returned for different queries
- Very slow retrieval (> 5s)
- No way to update or delete documents
Quick Reference
Pipeline Steps
- Chunk — Split documents (500-1000 chars, with overlap)
- Embed — Convert to vectors (OpenAI text-embedding-3-small)
- Store — Save with metadata (Supabase + pgvector)
- Search — Find similar chunks (cosine similarity > 0.7)
- Generate — Answer with context (Claude with source citation)
Key Decisions
| Decision | Recommendation |
|---|---|
| Chunk size | 500-1000 characters |
| Chunk overlap | 100-200 characters |
| Embedding model | text-embedding-3-small for most cases |
| Retrieval limit | 5-10 chunks |
| Similarity threshold | 0.7-0.8 |
What to Store with Each Chunk
- Unique ID
- Content text
- Source document name/URL
- Section or heading
- Creation/update date
- Embedding vector
- Any filtering attributes (category, author, etc.)