How We Handle Embeddings
Principles for working with vector embeddings effectively
How We Handle Embeddings
Embeddings are the foundation of modern semantic AI—they transform text into vectors where meaning becomes distance. But embeddings are also large (thousands of dimensions), expensive to compute, and easy to misuse. Using cosine similarity when you need Euclidean distance, or skipping dimensionality reduction before clustering, produces subtly wrong results. This document establishes how we work with embeddings correctly and efficiently.
Core Question: "Am I using the right embedding, the right distance metric, and the right dimensionality for this task?"
Principles
Principle 1: Match Embedding Model to Task
The mistake: Using one embedding model for everything. General-purpose embeddings for domain-specific tasks. Large models for simple classification.
The principle: Choose embedding models based on task requirements. Consider: semantic similarity vs. search retrieval, domain (general vs. specialized), language support, and dimension size vs. accuracy tradeoffs. Document why you chose a specific model.
Why it matters: A code embedding model outperforms a general model on code. A multilingual model handles Vietnamese better than an English-only model. Wrong model choice means garbage in, garbage out—regardless of downstream algorithm quality.
Principle 2: Reduce Dimensions Purposefully
The mistake: Using 3,072-dimensional embeddings directly for clustering (curse of dimensionality), or reducing to 2D for everything (losing information).
The principle: Use different dimensionality for different purposes. High dimensions (50-100) for clustering algorithms like HDBSCAN. Low dimensions (2-3) for visualization only. Preserve original embeddings for re-processing with different parameters.
Why it matters: Clustering in very high dimensions fails due to distance concentration—all points appear equidistant. But aggressive reduction loses semantic distinctions. The right dimension depends on the task.
Principle 3: Choose Distance Metrics Deliberately
The mistake: Defaulting to Euclidean distance for everything, including normalized embeddings where it's suboptimal.
The principle: Use cosine similarity/distance for semantic similarity when embeddings are normalized. Use Euclidean distance after dimensionality reduction (UMAP, PCA). Match the metric to what the embedding model was trained for.
Why it matters: Many embedding models produce normalized vectors where cosine similarity is the intended comparison method. Using Euclidean distance on these vectors still works but is less accurate.
Principle 4: Batch for Efficiency, Preserve for Flexibility
The mistake: Computing embeddings one at a time (slow), or discarding embeddings after immediate use (wasteful).
The principle: Always batch embedding requests (50-100 items per request). Store computed embeddings with their source text. Enable re-clustering or re-analysis without re-computing embeddings.
Why it matters: Embedding computation is the slowest and most expensive pipeline step. Batching reduces latency by 10x. Preservation enables parameter experimentation without repeating the expensive step.
Principle 5: Handle Edge Cases Explicitly
The mistake: Assuming all text produces valid embeddings. Empty strings, very long text, or special characters cause silent failures or truncation.
The principle: Validate text before embedding. Handle empty strings (skip or use placeholder). Manage text length (truncate with warning or chunk and aggregate). Document behavior for edge cases.
Why it matters: Edge cases that produce null embeddings or truncated representations corrupt downstream analysis. A single null vector in a clustering operation can crash or skew results.
Decision Framework
When should I reduce dimensionality?
Always reduce before:
- Clustering with density-based algorithms (HDBSCAN, DBSCAN)
- K-means or hierarchical clustering on >1000 dimensions
- Visualization (always reduce to 2-3D)
Keep full dimensions for:
- Nearest neighbor search with optimized libraries (FAISS, Annoy)
- When using algorithms designed for high dimensions
- Storage when you may need to re-reduce later
What target dimension should I use?
| Purpose | Recommended Dimensions | Reasoning |
|---|---|---|
| Visualization | 2-3 | Human perception limit |
| Clustering | 30-50 | Balance detail vs. curse of dimensionality |
| Search/retrieval | Full or 256-512 | Preserve semantic precision |
| Storage (compressed) | 128-256 | Reduce size while keeping utility |
When should I use UMAP vs. PCA vs. t-SNE?
UMAP when:
- You need to preserve both local and global structure
- Target dimension is 2-50
- Speed matters for large datasets
- Results will be used for clustering
PCA when:
- You need deterministic, reproducible results
- Speed is critical and approximation is acceptable
- Linear relationships are sufficient
- You need interpretable components
t-SNE when:
- Visualization is the only goal
- Local structure matters more than global
- Dataset is small (fewer than 10,000 items)
- Cluster separation visibility is priority
Common Mistakes
Mistake 1: Clustering in original high dimensions
Signs: Clustering produces one giant cluster and many outliers. Cluster quality is poor regardless of parameters. Fix: Reduce to 30-50 dimensions before clustering. Use UMAP with metric matching your similarity measure.
Mistake 2: Using 2D embeddings for analysis
Signs: Clusters that look good visually have mixed content. Semantic neighbors in 2D aren't actually related. Fix: 2D is for visualization only. Use higher dimensions (30-50) for actual clustering, then project to 2D for display.
Mistake 3: Ignoring embedding model's intended use
Signs: Symmetric similarity model used for search. Retrieval model used for classification. Fix: Read model documentation. Some models are trained for query-document retrieval (asymmetric), others for sentence similarity (symmetric).
Mistake 4: Not handling text length limits
Signs: Long documents produce identical embeddings. Semantic meaning is lost for detailed content. Fix: Know your model's token limit (usually 512-8192 tokens). Chunk long text and aggregate embeddings, or select representative portions.
Mistake 5: Recomputing embeddings unnecessarily
Signs: Same data processed multiple times with different parameters. Pipeline restarts compute embeddings again. Fix: Cache embeddings with source text hash. Check cache before computing. Separate embedding step from downstream processing.
Evaluation Checklists
Your embedding usage is working if:
- You can explain why you chose this embedding model
- Dimensionality reduction matches the downstream task
- Distance metric matches the embedding model's training
- Embedding computation is batched and parallelized
- Edge cases (empty, long text) are handled explicitly
Your embedding usage needs work if:
- Changing embedding model requires code changes throughout
- Clusters don't match semantic expectations
- You're not sure what distance metric is being used
- Re-running with different parameters recomputes embeddings
- Some inputs silently produce null or truncated embeddings
Quick Reference
Text Input
|
v
+-------------------+
| Embedding Model | text-embedding-3-large: 3,072 dims
| (API call) | all-MiniLM: 384 dims
+-------------------+ multilingual-e5: 768 dims
|
v
+-------------------+
| Full Embeddings | Store for re-processing
| (high-dim) | Use cosine distance
+-------------------+
|
+------+-------+
| | |
v v v
30-50D 2D 3D
Cluster Plot Plot
UMAP UMAP UMAP
| Concern | Guidance |
|---|---|
| Model selection | Match to task (search vs. similarity), language, domain |
| Dimensions | Full for storage, 30-50 for clustering, 2-3 for visualization |
| Distance metric | Cosine for raw embeddings, Euclidean after UMAP |
| Batching | 50-100 items per request, parallelize batches |
| Caching | Store embeddings with text hash, reuse across runs |