How We Handle Embeddings

Principles for working with vector embeddings effectively

How We Handle Embeddings

Embeddings are the foundation of modern semantic AI—they transform text into vectors where meaning becomes distance. But embeddings are also large (thousands of dimensions), expensive to compute, and easy to misuse. Using cosine similarity when you need Euclidean distance, or skipping dimensionality reduction before clustering, produces subtly wrong results. This document establishes how we work with embeddings correctly and efficiently.

Core Question: "Am I using the right embedding, the right distance metric, and the right dimensionality for this task?"

Principles

Principle 1: Match Embedding Model to Task

The mistake: Using one embedding model for everything. General-purpose embeddings for domain-specific tasks. Large models for simple classification.

The principle: Choose embedding models based on task requirements. Consider: semantic similarity vs. search retrieval, domain (general vs. specialized), language support, and dimension size vs. accuracy tradeoffs. Document why you chose a specific model.

Why it matters: A code embedding model outperforms a general model on code. A multilingual model handles Vietnamese better than an English-only model. Wrong model choice means garbage in, garbage out—regardless of downstream algorithm quality.

Principle 2: Reduce Dimensions Purposefully

The mistake: Using 3,072-dimensional embeddings directly for clustering (curse of dimensionality), or reducing to 2D for everything (losing information).

The principle: Use different dimensionality for different purposes. High dimensions (50-100) for clustering algorithms like HDBSCAN. Low dimensions (2-3) for visualization only. Preserve original embeddings for re-processing with different parameters.

Why it matters: Clustering in very high dimensions fails due to distance concentration—all points appear equidistant. But aggressive reduction loses semantic distinctions. The right dimension depends on the task.

Principle 3: Choose Distance Metrics Deliberately

The mistake: Defaulting to Euclidean distance for everything, including normalized embeddings where it's suboptimal.

The principle: Use cosine similarity/distance for semantic similarity when embeddings are normalized. Use Euclidean distance after dimensionality reduction (UMAP, PCA). Match the metric to what the embedding model was trained for.

Why it matters: Many embedding models produce normalized vectors where cosine similarity is the intended comparison method. Using Euclidean distance on these vectors still works but is less accurate.

Principle 4: Batch for Efficiency, Preserve for Flexibility

The mistake: Computing embeddings one at a time (slow), or discarding embeddings after immediate use (wasteful).

The principle: Always batch embedding requests (50-100 items per request). Store computed embeddings with their source text. Enable re-clustering or re-analysis without re-computing embeddings.

Why it matters: Embedding computation is the slowest and most expensive pipeline step. Batching reduces latency by 10x. Preservation enables parameter experimentation without repeating the expensive step.

Principle 5: Handle Edge Cases Explicitly

The mistake: Assuming all text produces valid embeddings. Empty strings, very long text, or special characters cause silent failures or truncation.

The principle: Validate text before embedding. Handle empty strings (skip or use placeholder). Manage text length (truncate with warning or chunk and aggregate). Document behavior for edge cases.

Why it matters: Edge cases that produce null embeddings or truncated representations corrupt downstream analysis. A single null vector in a clustering operation can crash or skew results.

Decision Framework

When should I reduce dimensionality?

Always reduce before:

  • Clustering with density-based algorithms (HDBSCAN, DBSCAN)
  • K-means or hierarchical clustering on >1000 dimensions
  • Visualization (always reduce to 2-3D)

Keep full dimensions for:

  • Nearest neighbor search with optimized libraries (FAISS, Annoy)
  • When using algorithms designed for high dimensions
  • Storage when you may need to re-reduce later

What target dimension should I use?

PurposeRecommended DimensionsReasoning
Visualization2-3Human perception limit
Clustering30-50Balance detail vs. curse of dimensionality
Search/retrievalFull or 256-512Preserve semantic precision
Storage (compressed)128-256Reduce size while keeping utility

When should I use UMAP vs. PCA vs. t-SNE?

UMAP when:

  • You need to preserve both local and global structure
  • Target dimension is 2-50
  • Speed matters for large datasets
  • Results will be used for clustering

PCA when:

  • You need deterministic, reproducible results
  • Speed is critical and approximation is acceptable
  • Linear relationships are sufficient
  • You need interpretable components

t-SNE when:

  • Visualization is the only goal
  • Local structure matters more than global
  • Dataset is small (fewer than 10,000 items)
  • Cluster separation visibility is priority

Common Mistakes

Mistake 1: Clustering in original high dimensions

Signs: Clustering produces one giant cluster and many outliers. Cluster quality is poor regardless of parameters. Fix: Reduce to 30-50 dimensions before clustering. Use UMAP with metric matching your similarity measure.

Mistake 2: Using 2D embeddings for analysis

Signs: Clusters that look good visually have mixed content. Semantic neighbors in 2D aren't actually related. Fix: 2D is for visualization only. Use higher dimensions (30-50) for actual clustering, then project to 2D for display.

Mistake 3: Ignoring embedding model's intended use

Signs: Symmetric similarity model used for search. Retrieval model used for classification. Fix: Read model documentation. Some models are trained for query-document retrieval (asymmetric), others for sentence similarity (symmetric).

Mistake 4: Not handling text length limits

Signs: Long documents produce identical embeddings. Semantic meaning is lost for detailed content. Fix: Know your model's token limit (usually 512-8192 tokens). Chunk long text and aggregate embeddings, or select representative portions.

Mistake 5: Recomputing embeddings unnecessarily

Signs: Same data processed multiple times with different parameters. Pipeline restarts compute embeddings again. Fix: Cache embeddings with source text hash. Check cache before computing. Separate embedding step from downstream processing.

Evaluation Checklists

Your embedding usage is working if:

  • You can explain why you chose this embedding model
  • Dimensionality reduction matches the downstream task
  • Distance metric matches the embedding model's training
  • Embedding computation is batched and parallelized
  • Edge cases (empty, long text) are handled explicitly

Your embedding usage needs work if:

  • Changing embedding model requires code changes throughout
  • Clusters don't match semantic expectations
  • You're not sure what distance metric is being used
  • Re-running with different parameters recomputes embeddings
  • Some inputs silently produce null or truncated embeddings

Quick Reference

Text Input
    |
    v
+-------------------+
| Embedding Model   |   text-embedding-3-large: 3,072 dims
| (API call)        |   all-MiniLM: 384 dims
+-------------------+   multilingual-e5: 768 dims
    |
    v
+-------------------+
| Full Embeddings   |   Store for re-processing
| (high-dim)        |   Use cosine distance
+-------------------+
    |
    +------+-------+
    |      |       |
    v      v       v
  30-50D  2D     3D
 Cluster  Plot   Plot
  UMAP    UMAP   UMAP
ConcernGuidance
Model selectionMatch to task (search vs. similarity), language, domain
DimensionsFull for storage, 30-50 for clustering, 2-3 for visualization
Distance metricCosine for raw embeddings, Euclidean after UMAP
Batching50-100 items per request, parallelize batches
CachingStore embeddings with text hash, reuse across runs