How We Manage Clustering Quality

Principles for producing coherent, useful clusters from noisy data

How We Manage Clustering Quality

Clustering algorithms don't produce perfect results. They produce mathematical groupings that may or may not align with human expectations. Real-world data has outliers that don't fit anywhere, clusters that are too large to be useful, and tiny clusters that should merge with neighbors. This document establishes how we post-process clustering results to produce genuinely useful output.

Core Question: "Does every item end up in a cluster that makes semantic sense?"

Principles

Principle 1: Outliers Are a Feature, Not a Bug

The mistake: Treating outliers as failures to be eliminated. Force-assigning every point to a cluster regardless of fit.

The principle: Distinguish between "true outliers" (genuinely unique items) and "orphans" (items that belong somewhere but weren't assigned). Reassign orphans to their nearest semantically valid cluster using cosine similarity. Keep true outliers separate or in an explicit "Other" category.

Why it matters: Force-assigning a unique item to an unrelated cluster degrades cluster coherence and confuses users. But leaving recoverable orphans unassigned wastes data.

Principle 2: Control Cluster Size Distribution

The mistake: Accepting whatever size distribution the algorithm produces. One cluster with 500 items, others with 3 items each.

The principle: Define acceptable size bounds (minimum and maximum). Subdivide oversized clusters iteratively. Merge undersized clusters into nearest neighbors. Make these bounds configurable based on use case.

Why it matters: A 500-item cluster isn't useful for analysis—it's too broad. A 2-item cluster often isn't meaningful—it's noise. Controlled sizes produce actionable clusters.

Principle 3: Validate Assignments with Similarity

The mistake: Reassigning outliers purely by distance without checking semantic fit. Putting "apple" in a cluster about "automotive parts" because it was mathematically closest.

The principle: When reassigning points, verify similarity exceeds a threshold. Use k-NN voting among neighbors, not just nearest-neighbor assignment. Only reassign if the point genuinely fits.

Why it matters: Nearest in vector space doesn't always mean semantically related, especially after dimensionality reduction. Similarity thresholds prevent nonsensical assignments.

Principle 4: Iterate Until Stable

The mistake: Running post-processing once and accepting the result. After merging small clusters, some new clusters are now too large. After subdividing, some are now too small.

The principle: Run size correction iteratively until sizes stabilize or a maximum iteration count is reached. Log each iteration's changes for debugging.

Why it matters: Post-processing operations interact. Subdividing creates small clusters that need merging. Merging creates large clusters that need subdivision. One pass isn't enough.

Principle 5: Preserve Cluster Semantics Through Labels

The mistake: Labeling clusters with generic names ("Cluster 1") or labels that don't reflect content.

The principle: Generate descriptive labels from cluster content. Sample representative items and use LLM to create concise, meaningful labels. Validate labels against cluster contents.

Why it matters: Labels are how users understand clusters. Good labels make clusters actionable. Bad labels make the entire clustering exercise feel useless.

Decision Framework

When should I reassign outliers vs. keep them separate?

Reassign when:

  • Point has high similarity to a cluster centroid (>0.7 cosine similarity)
  • Multiple nearest neighbors belong to the same cluster (k-NN consensus)
  • Outlier appears to be borderline, not truly unique
  • User prefers complete assignment over precision

Keep separate when:

  • Similarity to all clusters is low (below 0.5 threshold)
  • Nearest neighbors are split across multiple clusters (no consensus)
  • Point is genuinely unique or ambiguous
  • User prefers precision over complete assignment

When should I subdivide a large cluster?

Subdivide when:

  • Cluster size exceeds maximum threshold (e.g., 50+ items)
  • Cluster contains semantically distinct sub-groups
  • Hierarchical structure would benefit user understanding
  • Original granularity setting requested fine-grained clustering

Keep unified when:

  • Items are genuinely similar (high intra-cluster similarity)
  • Subdivision produces arbitrary-seeming sub-clusters
  • User explicitly requested broad, high-level clustering
  • Cluster represents a legitimate large category

When should I merge small clusters?

Merge when:

  • Cluster size is below minimum threshold (e.g., 3 items)
  • High similarity exists between clusters (centroids close)
  • Merged cluster would still be coherent
  • Small cluster appears to be noise or edge cases

Keep separate when:

  • Small cluster represents distinct, meaningful category
  • Merging would combine unrelated items
  • User explicitly requested granular clustering
  • Precision matters more than coverage

Common Mistakes

Mistake 1: Fixed thresholds for all datasets

Signs: Works great on 100 items, produces terrible results on 10,000. Parameters that worked for English fail on Vietnamese. Fix: Scale thresholds with dataset size. Test on representative data samples. Make key thresholds configurable.

Mistake 2: Subdivision that ignores semantics

Signs: Large cluster is split randomly. Sub-clusters don't have meaningful distinctions. Fix: Use hierarchical clustering or semantic-aware subdivision. Split along natural semantic boundaries, not arbitrary halves.

Mistake 3: Aggressive outlier reassignment

Signs: Every point is assigned somewhere. Clusters contain items that clearly don't belong. Fix: Add similarity threshold for reassignment. Accept that some points are genuine outliers. Surface "Other" category explicitly.

Mistake 4: No iteration limit

Signs: Post-processing runs indefinitely on edge cases. System hangs on certain inputs. Fix: Set maximum iterations (10 is usually sufficient). Log when limit is reached. Accept "good enough" results.

Mistake 5: Labels generated before cleanup

Signs: Labels describe content that was later moved. Merged clusters have outdated labels. Fix: Generate labels as the final step, after all post-processing is complete. Labels should reflect final cluster membership.

Evaluation Checklists

Your clustering quality is working if:

  • Clusters are within acceptable size bounds (e.g., 5-50 items)
  • Random samples from each cluster are semantically coherent
  • Outliers are either meaningfully reassigned or explicitly separated
  • Labels accurately describe cluster contents
  • Re-running with same config produces consistent results

Your clustering quality needs work if:

  • Some clusters are vastly larger than others
  • Clusters contain obviously unrelated items
  • Many items end up as unassigned outliers
  • Labels don't match what's in the cluster
  • Results feel random or unstable

Quick Reference

Raw Clustering Output
         |
         v
+-------------------+
| Outlier Handling  |
| - k-NN voting     |
| - Similarity check|
| - Reassign or keep|
+-------------------+
         |
         v
+-------------------+
| Size Correction   |   <----- Iterate until stable
| - Subdivide large |          (max 10 iterations)
| - Merge small     |
+-------------------+
         |
         v
+-------------------+
| Final Cleanup     |
| - Renumber IDs    |
| - Sort by size    |
+-------------------+
         |
         v
+-------------------+
| Label Generation  |
| - Sample items    |
| - LLM labeling    |
| - Validate fit    |
+-------------------+
IssueDetectionResolution
Oversized clustersSize > max_thresholdRecursive bisection, hierarchical split
Undersized clustersSize < min_thresholdMerge to nearest by centroid distance
Outliers (recoverable)High similarity to clusterk-NN voting + threshold reassignment
Outliers (genuine)Low similarity to allKeep in "Other" category
Poor labelsManual inspectionRegenerate with more samples, better prompt