How We Build RAG

Retrieval Augmented Generation - giving AI context to answer accurately

How We Build RAG

RAG (Retrieval Augmented Generation) gives LLMs access to information they weren't trained on. Instead of hoping the model knows something, we find relevant context and include it in the prompt.

Core principle: Don't ask the LLM to remember — give it the answer and ask it to use it.

Why RAG?

Without RAGWith RAG
LLM makes up factsLLM cites sources
Can't use private dataSearches your data
Knowledge cutoff limitsAlways current
Hallucinations commonGrounded responses

The RAG Pipeline

RAG has four distinct stages:

  1. Document Processing — Split documents into searchable chunks (chunking)
  2. Embedding — Convert chunks to vectors for similarity search
  3. Retrieval — Find relevant chunks for a query
  4. Generation — Answer the question using retrieved context

Each stage has its own decisions and trade-offs.

Stage 1: Document Processing

The Chunking Decision

Documents must be split into searchable pieces. Chunk size affects retrieval quality.

Chunk SizeProsCons
Small (200-500 chars)Precise retrievalMay lack context
Medium (500-1000 chars)Good balanceStandard choice
Large (1000-2000 chars)More contextLess precise matching

Rule of thumb: Start with 500-1000 characters with 100-200 character overlap between chunks.

Chunking Strategies

StrategyBest ForDescription
Fixed sizeGeneral textSplit every N characters
Paragraph-basedArticles, blogsSplit on natural breaks
Sentence-basedDense contentSplit every 2-5 sentences
SemanticTechnical docsSplit by section/heading

Metadata Matters

Always store metadata with chunks:

  • Source: Where did this come from?
  • Section: What part of the document?
  • Date: When was it created/updated?
  • Any filtering attributes: Category, author, etc.

Without metadata, you can't cite sources or filter results.

Stage 2: Embedding

Convert text chunks to vectors that capture semantic meaning.

Embedding Model Selection

ModelDimensionsBest For
text-embedding-3-small1536General use, cost-effective
text-embedding-3-large3072Higher quality, more cost
Multilingual modelsVariesNon-English content

Considerations:

  • Match model to your language (Vietnamese content needs multilingual support)
  • Higher dimensions = better quality but more storage and slower search
  • Batch embedding requests (50-100 items) to reduce latency

Storage: Vector Databases

We use Supabase with pgvector for simplicity. Other options:

OptionBest For
Supabase + pgvectorSimple setup, integrated with our stack
PineconeLarge scale, managed service
WeaviateComplex filtering needs
FAISSLocal/embedded use cases

Stage 3: Retrieval

Finding the right chunks for a query.

The basic approach: embed the query, find chunks with highest cosine similarity.

Key parameters:

  • Limit: How many chunks to retrieve (typically 3-10)
  • Threshold: Minimum similarity score (typically 0.7-0.8)
ThresholdEffect
0.8+High precision, few results, may miss relevant content
0.7Balanced (default)
0.6High recall, more noise, catches edge cases

Combine vector search with keyword search for better results.

Why hybrid?

  • Vector search: Finds semantically similar content
  • Keyword search: Finds exact matches (names, codes, specific terms)

Together they cover both meaning and specifics.

Retrieval Strategies

StrategyWhen to Use
Simple similaritySingle, straightforward queries
Hybrid (vector + keyword)Queries with specific terms or names
Query expansionVague or short queries
Multi-queryComplex questions needing multiple perspectives

Stage 4: Generation

Using retrieved context to generate answers.

Context Assembly

How you present context to the LLM matters.

Include:

  • Clear separation between chunks
  • Source attribution for each chunk
  • Instructions to use only the provided context
  • Guidance on what to do when context lacks the answer

Avoid:

  • Dumping raw chunks without structure
  • No instructions on how to use context
  • Expecting the LLM to figure out source attribution

Handling "No Answer"

What happens when retrieved context doesn't contain the answer?

Good behavior: "I don't have information about that in my knowledge base."

Bad behavior: Making up an answer based on training data.

The system prompt must explicitly instruct the LLM to admit when it doesn't know.

Advanced Patterns

Query Expansion

For vague queries, generate alternative phrasings before searching.

Example:

  • Original: "How do I deploy?"
  • Expanded: "How do I deploy?", "deployment process", "releasing to production", "CI/CD pipeline"

Search with all variants, combine and dedupe results.

Conversational RAG

In multi-turn conversations, the current query may reference previous context.

Problem: "What about the pricing?" — pricing of what?

Solution: Reformulate queries to be standalone before searching. Use conversation history to resolve references.

Chunk Overlap

Overlap between chunks prevents cutting sentences mid-thought.

No overlap: Information at chunk boundaries gets lost. With overlap: Boundary content appears in multiple chunks, ensuring it can be found.

Common Mistakes

Chunks too large or too small

Signs: Irrelevant results, missing context, token limits exceeded.

Fix: Start with 500-1000 characters. Test with your actual content.

No metadata

Signs: Can't cite sources, can't filter by date or category.

Fix: Always store source, date, and relevant filtering attributes.

Ignoring similarity threshold

Signs: Including irrelevant low-similarity results in context.

Fix: Set appropriate threshold (0.7-0.8) and test with real queries.

Not handling "no results"

Signs: LLM hallucinates when nothing is found.

Fix: Explicit handling when retrieval returns empty or low-quality results.

Stuffing too much context

Signs: Token limits exceeded, important context buried.

Fix: Limit to 5-10 relevant chunks. Quality over quantity.

Evaluation Checklist

Your RAG is working if:

  • Retrieved chunks are relevant to queries
  • Responses cite sources accurately
  • LLM says "don't know" when context lacks answer
  • Different queries return different contexts
  • Latency is acceptable (< 3s total)

Your RAG needs work if:

  • Irrelevant chunks appear in results
  • LLM ignores context and makes things up
  • Same context returned for different queries
  • Very slow retrieval (> 5s)
  • No way to update or delete documents

Quick Reference

Pipeline Steps

  1. Chunk — Split documents (500-1000 chars, with overlap)
  2. Embed — Convert to vectors (OpenAI text-embedding-3-small)
  3. Store — Save with metadata (Supabase + pgvector)
  4. Search — Find similar chunks (cosine similarity > 0.7)
  5. Generate — Answer with context (Claude with source citation)

Key Decisions

DecisionRecommendation
Chunk size500-1000 characters
Chunk overlap100-200 characters
Embedding modeltext-embedding-3-small for most cases
Retrieval limit5-10 chunks
Similarity threshold0.7-0.8

What to Store with Each Chunk

  • Unique ID
  • Content text
  • Source document name/URL
  • Section or heading
  • Creation/update date
  • Embedding vector
  • Any filtering attributes (category, author, etc.)