How We Handle Embeddings

Embeddings are the foundation of modern semantic AI—they transform text into vectors where meaning becomes distance. But embeddings are also large (thousands of dimensions), expensive to compute, and easy to misuse. Using cosine similarity when you need Euclidean distance, or skipping dimensionality reduction before clustering, produces subtly wrong results. This document establishes how we work with embeddings correctly and efficiently.

Core Question: "Am I using the right embedding, the right distance metric, and the right dimensionality for this task?"

Principles

Principle 1: Match Embedding Model to Task

The mistake: Using one embedding model for everything. General-purpose embeddings for domain-specific tasks. Large models for simple classification.

The principle: Choose embedding models based on task requirements. Consider: semantic similarity vs. search retrieval, domain (general vs. specialized), language support, and dimension size vs. accuracy tradeoffs. Document why you chose a specific model.

Why it matters: A code embedding model outperforms a general model on code. A multilingual model handles Vietnamese better than an English-only model. Wrong model choice means garbage in, garbage out—regardless of downstream algorithm quality.

Principle 2: Reduce Dimensions Purposefully

The mistake: Using 3,072-dimensional embeddings directly for clustering (curse of dimensionality), or reducing to 2D for everything (losing information).

The principle: Use different dimensionality for different purposes. High dimensions (50-100) for clustering algorithms like HDBSCAN. Low dimensions (2-3) for visualization only. Preserve original embeddings for re-processing with different parameters.

Why it matters: Clustering in very high dimensions fails due to distance concentration—all points appear equidistant. But aggressive reduction loses semantic distinctions. The right dimension depends on the task.

Principle 3: Choose Distance Metrics Deliberately

The mistake: Defaulting to Euclidean distance for everything, including normalized embeddings where it's suboptimal.

The principle: Use cosine similarity/distance for semantic similarity when embeddings are normalized. Use Euclidean distance after dimensionality reduction (UMAP, PCA). Match the metric to what the embedding model was trained for.

Why it matters: Many embedding models produce normalized vectors where cosine similarity is the intended comparison method. Using Euclidean distance on these vectors still works but is less accurate.

Principle 4: Batch for Efficiency, Preserve for Flexibility

The mistake: Computing embeddings one at a time (slow), or discarding embeddings after immediate use (wasteful).

The principle: Always batch embedding requests (50-100 items per request). Store computed embeddings with their source text. Enable re-clustering or re-analysis without re-computing embeddings.

Why it matters: Embedding computation is the slowest and most expensive pipeline step. Batching reduces latency by 10x. Preservation enables parameter experimentation without repeating the expensive step.

Principle 5: Handle Edge Cases Explicitly

The mistake: Assuming all text produces valid embeddings. Empty strings, very long text, or special characters cause silent failures or truncation.

The principle: Validate text before embedding. Handle empty strings (skip or use placeholder). Manage text length (truncate with warning or chunk and aggregate). Document behavior for edge cases.

Why it matters: Edge cases that produce null embeddings or truncated representations corrupt downstream analysis. A single null vector in a clustering operation can crash or skew results.

Decision Framework

When should I reduce dimensionality?

Always reduce before:

Clustering with density-based algorithms (HDBSCAN, DBSCAN)
K-means or hierarchical clustering on >1000 dimensions
Visualization (always reduce to 2-3D)

Keep full dimensions for:

Nearest neighbor search with optimized libraries (FAISS, Annoy)
When using algorithms designed for high dimensions
Storage when you may need to re-reduce later

What target dimension should I use?

Purpose	Recommended Dimensions	Reasoning
Visualization	2-3	Human perception limit
Clustering	30-50	Balance detail vs. curse of dimensionality
Search/retrieval	Full or 256-512	Preserve semantic precision
Storage (compressed)	128-256	Reduce size while keeping utility

When should I use UMAP vs. PCA vs. t-SNE?

UMAP when:

You need to preserve both local and global structure
Target dimension is 2-50
Speed matters for large datasets
Results will be used for clustering

PCA when:

You need deterministic, reproducible results
Speed is critical and approximation is acceptable
Linear relationships are sufficient
You need interpretable components

t-SNE when:

Visualization is the only goal
Local structure matters more than global
Dataset is small (fewer than 10,000 items)
Cluster separation visibility is priority

Common Mistakes

Mistake 1: Clustering in original high dimensions

Signs: Clustering produces one giant cluster and many outliers. Cluster quality is poor regardless of parameters. Fix: Reduce to 30-50 dimensions before clustering. Use UMAP with metric matching your similarity measure.

Mistake 2: Using 2D embeddings for analysis

Signs: Clusters that look good visually have mixed content. Semantic neighbors in 2D aren't actually related. Fix: 2D is for visualization only. Use higher dimensions (30-50) for actual clustering, then project to 2D for display.

Mistake 3: Ignoring embedding model's intended use

Signs: Symmetric similarity model used for search. Retrieval model used for classification. Fix: Read model documentation. Some models are trained for query-document retrieval (asymmetric), others for sentence similarity (symmetric).

Mistake 4: Not handling text length limits

Signs: Long documents produce identical embeddings. Semantic meaning is lost for detailed content. Fix: Know your model's token limit (usually 512-8192 tokens). Chunk long text and aggregate embeddings, or select representative portions.

Mistake 5: Recomputing embeddings unnecessarily

Signs: Same data processed multiple times with different parameters. Pipeline restarts compute embeddings again. Fix: Cache embeddings with source text hash. Check cache before computing. Separate embedding step from downstream processing.

Evaluation Checklists

Your embedding usage is working if:

You can explain why you chose this embedding model
Dimensionality reduction matches the downstream task
Distance metric matches the embedding model's training
Embedding computation is batched and parallelized
Edge cases (empty, long text) are handled explicitly

Your embedding usage needs work if:

Changing embedding model requires code changes throughout
Clusters don't match semantic expectations
You're not sure what distance metric is being used
Re-running with different parameters recomputes embeddings
Some inputs silently produce null or truncated embeddings

Quick Reference

Text Input
    |
    v
+-------------------+
| Embedding Model   |   text-embedding-3-large: 3,072 dims
| (API call)        |   all-MiniLM: 384 dims
+-------------------+   multilingual-e5: 768 dims
    |
    v
+-------------------+
| Full Embeddings   |   Store for re-processing
| (high-dim)        |   Use cosine distance
+-------------------+
    |
    +------+-------+
    |      |       |
    v      v       v
  30-50D  2D     3D
 Cluster  Plot   Plot
  UMAP    UMAP   UMAP

Concern	Guidance
Model selection	Match to task (search vs. similarity), language, domain
Dimensions	Full for storage, 30-50 for clustering, 2-3 for visualization
Distance metric	Cosine for raw embeddings, Euclidean after UMAP
Batching	50-100 items per request, parallelize batches
Caching	Store embeddings with text hash, reuse across runs