How We Build RAG

RAG (Retrieval Augmented Generation) gives LLMs access to information they weren't trained on. Instead of hoping the model knows something, we find relevant context and include it in the prompt.

Core principle: Don't ask the LLM to remember — give it the answer and ask it to use it.

Why RAG?

Without RAG	With RAG
LLM makes up facts	LLM cites sources
Can't use private data	Searches your data
Knowledge cutoff limits	Always current
Hallucinations common	Grounded responses

The RAG Pipeline

RAG has four distinct stages:

Document Processing — Split documents into searchable chunks (chunking)
Embedding — Convert chunks to vectors for similarity search
Retrieval — Find relevant chunks for a query
Generation — Answer the question using retrieved context

Each stage has its own decisions and trade-offs.

Stage 1: Document Processing

The Chunking Decision

Documents must be split into searchable pieces. Chunk size affects retrieval quality.

Chunk Size	Pros	Cons
Small (200-500 chars)	Precise retrieval	May lack context
Medium (500-1000 chars)	Good balance	Standard choice
Large (1000-2000 chars)	More context	Less precise matching

Rule of thumb: Start with 500-1000 characters with 100-200 character overlap between chunks.

Chunking Strategies

Strategy	Best For	Description
Fixed size	General text	Split every N characters
Paragraph-based	Articles, blogs	Split on natural breaks
Sentence-based	Dense content	Split every 2-5 sentences
Semantic	Technical docs	Split by section/heading

Metadata Matters

Always store metadata with chunks:

Source: Where did this come from?
Section: What part of the document?
Date: When was it created/updated?
Any filtering attributes: Category, author, etc.

Without metadata, you can't cite sources or filter results.

Stage 2: Embedding

Convert text chunks to vectors that capture semantic meaning.

Embedding Model Selection

Model	Dimensions	Best For
text-embedding-3-small	1536	General use, cost-effective
text-embedding-3-large	3072	Higher quality, more cost
Multilingual models	Varies	Non-English content

Considerations:

Match model to your language (Vietnamese content needs multilingual support)
Higher dimensions = better quality but more storage and slower search
Batch embedding requests (50-100 items) to reduce latency

Storage: Vector Databases

We use Supabase with pgvector for simplicity. Other options:

Option	Best For
Supabase + pgvector	Simple setup, integrated with our stack
Pinecone	Large scale, managed service
Weaviate	Complex filtering needs
FAISS	Local/embedded use cases

Stage 3: Retrieval

Finding the right chunks for a query.

Similarity Search

The basic approach: embed the query, find chunks with highest cosine similarity.

Key parameters:

Limit: How many chunks to retrieve (typically 3-10)
Threshold: Minimum similarity score (typically 0.7-0.8)

Threshold	Effect
0.8+	High precision, few results, may miss relevant content
0.7	Balanced (default)
0.6	High recall, more noise, catches edge cases

Hybrid Search

Combine vector search with keyword search for better results.

Why hybrid?

Vector search: Finds semantically similar content
Keyword search: Finds exact matches (names, codes, specific terms)

Together they cover both meaning and specifics.

Retrieval Strategies

Strategy	When to Use
Simple similarity	Single, straightforward queries
Hybrid (vector + keyword)	Queries with specific terms or names
Query expansion	Vague or short queries
Multi-query	Complex questions needing multiple perspectives

Stage 4: Generation

Using retrieved context to generate answers.

Context Assembly

How you present context to the LLM matters.

Include:

Clear separation between chunks
Source attribution for each chunk
Instructions to use only the provided context
Guidance on what to do when context lacks the answer

Avoid:

Dumping raw chunks without structure
No instructions on how to use context
Expecting the LLM to figure out source attribution

Handling "No Answer"

What happens when retrieved context doesn't contain the answer?

Good behavior: "I don't have information about that in my knowledge base."

Bad behavior: Making up an answer based on training data.

The system prompt must explicitly instruct the LLM to admit when it doesn't know.

Advanced Patterns

Query Expansion

For vague queries, generate alternative phrasings before searching.

Example:

Original: "How do I deploy?"
Expanded: "How do I deploy?", "deployment process", "releasing to production", "CI/CD pipeline"

Search with all variants, combine and dedupe results.

Conversational RAG

In multi-turn conversations, the current query may reference previous context.

Problem: "What about the pricing?" — pricing of what?

Solution: Reformulate queries to be standalone before searching. Use conversation history to resolve references.

Chunk Overlap

Overlap between chunks prevents cutting sentences mid-thought.

No overlap: Information at chunk boundaries gets lost. With overlap: Boundary content appears in multiple chunks, ensuring it can be found.

Common Mistakes

Chunks too large or too small

Signs: Irrelevant results, missing context, token limits exceeded.

Fix: Start with 500-1000 characters. Test with your actual content.

No metadata

Signs: Can't cite sources, can't filter by date or category.

Fix: Always store source, date, and relevant filtering attributes.

Ignoring similarity threshold

Signs: Including irrelevant low-similarity results in context.

Fix: Set appropriate threshold (0.7-0.8) and test with real queries.

Not handling "no results"

Signs: LLM hallucinates when nothing is found.

Fix: Explicit handling when retrieval returns empty or low-quality results.

Stuffing too much context

Signs: Token limits exceeded, important context buried.

Fix: Limit to 5-10 relevant chunks. Quality over quantity.

Evaluation Checklist

Your RAG is working if:

Retrieved chunks are relevant to queries
Responses cite sources accurately
LLM says "don't know" when context lacks answer
Different queries return different contexts
Latency is acceptable (< 3s total)

Your RAG needs work if:

Irrelevant chunks appear in results
LLM ignores context and makes things up
Same context returned for different queries
Very slow retrieval (> 5s)
No way to update or delete documents

Quick Reference

Pipeline Steps

Chunk — Split documents (500-1000 chars, with overlap)
Embed — Convert to vectors (OpenAI text-embedding-3-small)
Store — Save with metadata (Supabase + pgvector)
Search — Find similar chunks (cosine similarity > 0.7)
Generate — Answer with context (Claude with source citation)

Key Decisions

Decision	Recommendation
Chunk size	500-1000 characters
Chunk overlap	100-200 characters
Embedding model	text-embedding-3-small for most cases
Retrieval limit	5-10 chunks
Similarity threshold	0.7-0.8

What to Store with Each Chunk

Unique ID
Content text
Source document name/URL
Section or heading
Creation/update date
Embedding vector
Any filtering attributes (category, author, etc.)