How We Configure ML Algorithms

ML algorithms have dozens of parameters. HDBSCAN alone has min_cluster_size, min_samples, cluster_selection_epsilon, cluster_selection_method, and more. Exposing all of these to users creates confusion. Hiding all of them removes control. This document establishes how we create configuration layers that give users meaningful control without requiring them to understand algorithm internals.

Core Question: "Can a non-ML-expert get good results, while an expert can still fine-tune?"

Principles

Principle 1: Create Two Configuration Layers

The mistake: Exposing raw algorithm parameters (min_cluster_size=15, n_neighbors=30) directly to users, or hiding all configuration entirely.

The principle: Build two layers: a user-facing layer with intuitive concepts ("How many clusters do you want?" on a 1-10 scale), and an internal layer that translates these to algorithm parameters. Users interact with the first; code uses the second.

Why it matters: Users think in terms of outcomes ("fewer, larger clusters" vs. "many specific clusters"), not algorithm mechanics. Translation functions bridge this gap while keeping algorithm tuning centralized and testable.

Principle 2: Provide Meaningful Presets

The mistake: Offering only "Default" or forcing users to configure everything manually before they can start.

The principle: Define 3-5 presets that represent common use cases with descriptive names. "Quick Overview" for broad clustering, "Deep Analysis" for granular results, "High Precision" for strict matching. Each preset is a complete, tested configuration.

Why it matters: Most users don't know what parameters they want—they know what outcome they want. Presets give them a starting point that works, with the option to customize later.

Principle 3: Make Parameters Visibly Different

The mistake: Parameters that produce nearly identical results. Moving a slider from 5 to 6 changes nothing observable.

The principle: Tune your translation functions so that each step on a user-facing scale produces noticeably different results. If two settings look the same, they shouldn't be separate options.

Why it matters: Users lose trust when controls don't seem to do anything. Every parameter point should justify its existence with observable impact.

Principle 4: Document the Translation

The mistake: Magic numbers scattered throughout the code. Nobody knows why min_cluster_size is calculated as max(5, n // 200).

The principle: Centralize parameter translation in one function with clear comments explaining each formula. Document the relationship between user values and algorithm parameters. Include the reasoning, not just the math.

Why it matters: Translation logic is where domain knowledge lives. When you need to tune behavior or debug unexpected results, this documentation is essential.

Principle 5: Validate Early, Explain Clearly

The mistake: Accepting any configuration, then failing deep in the algorithm with cryptic errors. Or silently clamping values without telling the user.

The principle: Validate configurations at the API boundary. When values are adjusted (clamped, rounded), tell the user. When combinations are invalid, explain why and suggest alternatives.

Why it matters: Users should understand what configuration is actually running. Silent adjustments cause confusion when results don't match expectations.

Decision Framework

When should I expose a parameter vs. hide it?

Expose when:

Parameter directly affects user-visible outcomes
Users have legitimate reasons to want different values
The parameter can be explained in non-technical terms
Changing it produces meaningfully different results

Hide when:

Parameter is an implementation detail (thread counts, buffer sizes)
Optimal value can be calculated from other parameters
Wrong values cause failures rather than different results
Parameter requires deep algorithm knowledge to set correctly

When should I use a slider vs. discrete options?

Use sliders when:

Parameter is continuous or has many valid values
Users benefit from fine-grained control
The scale has intuitive meaning (less/more, fewer/many)

Use discrete options when:

Only 2-4 meaningful choices exist
Options represent distinct modes, not degrees
Each option needs explanation (radio buttons with descriptions)

When should I create a new preset vs. extend configuration?

Create a preset when:

A common use case requires multiple parameter changes
The combination represents a coherent strategy
Users would otherwise need to discover the combination themselves

Extend configuration when:

Users need fine-grained control for edge cases
The parameter is independent of existing presets
Power users have requested the control

Common Mistakes

Mistake 1: Translation logic scattered across codebase

Signs: Three different files calculate cluster sizes. Changes require updating multiple locations. Fix: Single translation function that takes user config and returns algorithm params. All algorithm calls use this translated config.

Mistake 2: Parameters that interact unpredictably

Signs: Setting granularity=9 and coherence=9 produces worse results than either alone. Users can't understand the relationship. Fix: Test parameter combinations systematically. Document interactions. Consider constraints that prevent problematic combinations.

Mistake 3: Presets that don't work for common cases

Signs: Users always have to customize after selecting a preset. "Recommended" preset produces poor results on typical data. Fix: Test presets on real-world data samples. Gather feedback on which presets are selected and how often they're modified.

Mistake 4: Scale doesn't match mental model

Signs: Users expect "10" to mean "maximum clusters" but it means "highest quality." Confusion about what direction to move sliders. Fix: Label scales with outcome descriptions, not just numbers. "Fewer clusters" to "More clusters" is clearer than "1" to "10."

Mistake 5: No way to see what's actually running

Signs: Users configure parameters but have no visibility into the translated values. Can't reproduce results or understand behavior. Fix: Provide a "Show advanced" option that reveals the actual algorithm parameters being used. Log configurations for debugging.

Evaluation Checklists

Your configuration system is working if:

New users get good results with presets, no customization
Each slider position produces visibly different outcomes
Power users can access algorithm-level parameters when needed
Configuration can be saved, shared, and reproduced
Invalid configurations are caught with helpful messages

Your configuration system needs work if:

Users ask "what does this slider do?" frequently
Presets are rarely used or immediately customized
Some parameter combinations produce errors deep in processing
You can't explain why the translation formulas are what they are
Results vary unexpectedly with the same configuration

Quick Reference

+---------------------------+      +---------------------------+
|    User Configuration     |      |   Algorithm Parameters    |
+---------------------------+      +---------------------------+
| granularity: 1-10         | ---> | UMAP n_neighbors: 10-50   |
| coherence: 1-10           | ---> | HDBSCAN min_samples: 2-10 |
| min_keywords: 3-30        | ---> | min_cluster_size: 5-30    |
| method: "balanced"/"fine" | ---> | selection: "eom"/"leaf"   |
+---------------------------+      +---------------------------+
            |                                   |
            v                                   v
    User understands              Algorithm executes correctly

Layer	Purpose	Audience
Presets	Quick start, common use cases	All users
User Config	Intuitive controls, outcome-focused	Most users
Algorithm Params	Direct control, full flexibility	Power users, debugging
Internal Defaults	Sensible fallbacks, safety bounds	System