How We Build ML Pipelines

ML pipelines are not single operations—they are chains of dependent stages where each step transforms data for the next. Preprocessing feeds embeddings, embeddings feed clustering, clustering feeds labeling. When any stage fails, you need to know which one, why, and what to do about it. This document establishes how we build pipelines that are observable, recoverable, and maintainable.

Core Question: "If this pipeline fails at step 3 of 5, can I diagnose the problem and resume without starting over?"

Principles

Principle 1: Make Every Stage Observable

The mistake: Treating the pipeline as a black box. Users see "Processing..." for minutes with no indication of progress or what's happening.

The principle: Every stage should emit progress events: what stage is running, what percentage is complete, and how long it has taken. Use structured events (not just log lines) that the UI can consume.

Why it matters: Long-running ML operations feel broken without feedback. Progress visibility builds trust, helps users estimate wait times, and makes debugging possible when things go wrong.

Principle 2: Define Clear Stage Boundaries

The mistake: Monolithic functions that do preprocessing, embedding, clustering, and labeling in one 500-line block. Changes to one stage risk breaking others.

The principle: Each stage is a separate function with defined inputs and outputs. Stages communicate through explicit data structures, not shared mutable state. One stage failing should not corrupt another's data.

Why it matters: Clear boundaries enable independent testing, easier debugging, and the ability to swap implementations (e.g., different embedding models) without rewriting the pipeline.

Principle 3: Fail Fast, Fail Clearly

The mistake: Catching all exceptions with a generic "Something went wrong" message. Errors in stage 2 manifest as confusing failures in stage 4.

The principle: Validate inputs at the start of each stage. Fail immediately with specific error messages when preconditions aren't met. Include stage name, expected vs. actual state, and recovery suggestions in errors.

Why it matters: Specific errors are fixable. "Embedding failed: API returned 401 Unauthorized" is actionable. "Processing failed" is not.

Principle 4: Design for Partial Recovery

The mistake: Any failure means starting over from the beginning. A labeling API timeout throws away 5 minutes of embedding and clustering work.

The principle: Cache or checkpoint intermediate results where practical. Design stages to be resumable. When possible, allow restarting from the last successful stage, not from scratch.

Why it matters: ML operations are expensive (time, compute, API costs). Partial recovery respects those investments and improves user experience during transient failures.

Principle 5: Keep Transformations Reversible

The mistake: Destructive transformations that lose original data. After clustering, you can no longer access the original keywords or raw embeddings.

The principle: Preserve original inputs alongside transformed outputs. Store both the preprocessed text and the original text. Keep high-dimensional embeddings even after dimensionality reduction.

Why it matters: Reversibility enables re-processing with different parameters, debugging unexpected results, and features like "re-cluster with different settings" without re-uploading data.

Decision Framework

When should I use generators (streaming) vs. batch returns?

Use generators/streaming when:

Pipeline takes more than 5 seconds
You want to show progress during execution
Memory constraints prevent holding all results at once
Client needs incremental updates

Use batch returns when:

Pipeline completes in under 2 seconds
All results are needed before any can be used
Simplicity is more important than progress feedback
Results are small enough to fit in memory

When should I checkpoint intermediate results?

Checkpoint when:

Stage takes more than 30 seconds
Stage involves external API calls (embedding, labeling)
Stage results are expensive to recompute
Users may want to re-run later stages with different parameters

Skip checkpointing when:

Stage is fast (< 5 seconds)
Storage costs exceed recomputation costs
Data sensitivity prevents storing intermediates
Pipeline is always run end-to-end

When should I parallelize a stage?

Parallelize when:

Stage processes independent items (embeddings, labels)
Work is I/O-bound (API calls, file reads)
Speedup is significant (> 3x improvement)
Error handling per item is manageable

Keep sequential when:

Items depend on previous results
Work is CPU-bound on single-threaded libraries
Debugging is more important than speed
Rate limits make parallelization counterproductive

Common Mistakes

Mistake 1: Coupling visualization to computation

Signs: Can't get clustering results without generating plots. Visualization errors crash the pipeline. Fix: Separate computation stages from presentation stages. Return data structures, generate visualizations as a final optional step.

Mistake 2: Hardcoding stage order

Signs: Adding a new preprocessing step requires modifying five files. Can't skip stages during development. Fix: Define pipeline as a sequence of configurable stages. Allow stages to be enabled/disabled or reordered via configuration.

Mistake 3: Swallowing errors in parallel execution

Signs: Some items silently fail in batch processing. Results have fewer items than inputs with no explanation. Fix: Collect errors during parallel execution. Report which items failed and why. Decide explicitly whether partial results are acceptable.

Mistake 4: No memory management for large datasets

Signs: Pipeline crashes with OOM on large inputs. Works fine on 100 items, fails on 10,000. Fix: Process in chunks. Use generators instead of lists. Release references to intermediate data after each stage. Profile memory usage.

Mistake 5: Testing only the happy path

Signs: Pipeline works perfectly on sample data. Crashes on empty input, single item, or malformed data. Fix: Test edge cases explicitly: empty input, single item, duplicate items, maximum size, malformed data, API failures.

Evaluation Checklists

Your pipeline is working if:

Users see which stage is running and progress within that stage
Errors identify the failing stage and provide actionable messages
Each stage can be tested independently with mock inputs
Re-running with different parameters doesn't require re-uploading data
Pipeline handles 1 item and 10,000 items without code changes

Your pipeline needs work if:

"Processing..." is the only feedback for minutes at a time
You debug by adding print statements throughout the code
Changing the embedding model requires modifying clustering code
Partial failures lose all progress
Memory usage grows unbounded with input size

Quick Reference

+-------------------+     +-------------------+     +-------------------+
|   Preprocessing   | --> |    Embeddings     | --> |    Clustering     |
| - Validate input  |     | - Batch requests  |     | - Reduce dims     |
| - Normalize text  |     | - Parallel fetch  |     | - Find clusters   |
| - Emit progress   |     | - Emit progress   |     | - Handle outliers |
+-------------------+     +-------------------+     +-------------------+
         |                         |                         |
         v                         v                         v
   Original data            High-dim vectors          Cluster assignments
   preserved                preserved                 + original data

Stage Concern	Approach
Progress	Emit structured events with stage name, percentage, elapsed time
Errors	Fail fast with stage name, specific message, recovery hint
Boundaries	Function per stage, explicit input/output types
Recovery	Cache expensive stages, allow resuming from checkpoint
Memory	Process in chunks, release intermediate data, use generators