How We Integrate LLM APIs

LLM APIs are fundamentally different from traditional APIs. They are slow, expensive, rate-limited, and occasionally unreliable. A naive integration—one request per item, no error handling, blocking calls—will result in poor user experience, inflated costs, and cascading failures. This document establishes how we build LLM integrations that are fast, resilient, and economical.

Core Question: "If this API call fails or takes 30 seconds, what happens to the user?"

Principles

Principle 1: Batch and Parallelize by Default

The mistake: Making sequential API calls, one item at a time, waiting for each to complete before starting the next.

The principle: Group related items into batches and execute batches in parallel using thread pools or async workers. Most LLM APIs accept batch inputs—use them.

Why it matters: A 100-item job at 500ms per call takes 50 seconds sequentially. With 5 parallel workers and batches of 20, it completes in 5 seconds. Users don't wait, and you stay within rate limits.

Principle 2: Always Provide Fallbacks

The mistake: Treating LLM output as guaranteed. When the API fails or returns malformed data, the entire operation crashes.

The principle: Define a sensible fallback for every LLM call. If label generation fails, use "Cluster N". If embedding fails for one item, retry or exclude it—don't abort the batch.

Why it matters: LLM APIs have outages, rate limits, and content filters. Fallbacks ensure your application degrades gracefully rather than failing completely.

Principle 3: Separate API Key Concerns

The mistake: Hardcoding API keys, or requiring users to configure keys before any functionality works.

The principle: Support multiple key sources with clear precedence: user-provided key > environment variable > error with helpful message. Validate keys early and surface clear errors.

Why it matters: Different deployment contexts need different key strategies. Development uses env vars, multi-tenant apps use user keys, and managed deployments use secrets. Support all without code changes.

Principle 4: Design Prompts for Consistency

The mistake: Writing prompts that produce variable-length, unpredictable outputs that are difficult to parse or display.

The principle: Constrain LLM outputs explicitly. Specify format, length limits, and structure. Use examples in prompts when format matters. Parse defensively.

Why it matters: LLMs are creative by default. Without constraints, you get inconsistent outputs that break UI layouts, exceed storage limits, or require complex post-processing.

Principle 5: Track Costs Explicitly

The mistake: Ignoring token usage until the bill arrives. No visibility into which operations consume the most resources.

The principle: Log token counts for every API call. Calculate estimated costs. Set alerts for unusual usage. Make cost a first-class metric alongside latency and errors.

Why it matters: LLM costs scale with usage. A bug that retries infinitely or a feature that over-fetches can cost thousands before anyone notices.

Decision Framework

When should I use batching vs. individual calls?

Use batching when:

Processing more than 5 items
Items are independent (no sequential dependency)
The API supports batch inputs
Latency matters to the user

Use individual calls when:

Processing 1-2 items
Each item depends on the previous result
You need fine-grained error handling per item
Debugging a specific failure

When should I use synchronous vs. streaming responses?

Use synchronous when:

Response is short (< 100 tokens)
You need the complete response before proceeding
You're batching multiple calls

Use streaming when:

Response is long (> 500 tokens)
User is waiting and needs feedback
You can progressively render results

When should I retry vs. fail fast?

Retry when:

Error is transient (rate limit, timeout, 5xx)
Operation is idempotent
You have retry budget remaining

Fail fast when:

Error is permanent (invalid key, malformed request, 4xx)
User is waiting interactively
Retry budget exhausted

Common Mistakes

Mistake 1: No timeout configuration

Signs: Requests hang indefinitely during API outages. Users see infinite loading spinners. Fix: Set explicit timeouts (30-60s for most LLM calls). Implement circuit breakers for repeated failures.

Mistake 2: Logging full prompts and responses

Signs: Log files grow to gigabytes. Sensitive user data appears in logs. Storage costs spike. Fix: Log metadata (tokens, latency, model, status) not content. Sample full logs for debugging, don't store all.

Mistake 3: Ignoring rate limits until they hit

Signs: Sudden 429 errors during peak usage. Operations fail mid-batch. Fix: Implement client-side rate limiting. Use exponential backoff. Queue requests during high load.

Mistake 4: Tight coupling to one provider

Signs: Switching from OpenAI to Anthropic requires rewriting half the codebase. Fix: Abstract the LLM interface. Isolate provider-specific code. Use adapter patterns for different APIs.

Mistake 5: Treating all models the same

Signs: Using GPT-4 for simple classification. Using GPT-3.5 for complex reasoning. Costs too high or quality too low. Fix: Match model capability to task complexity. Use smaller models for structured extraction, larger models for nuanced generation.

Evaluation Checklists

Your LLM integration is working if:

Batch operations complete in reasonable time (< 30s for most jobs)
Individual failures don't crash batch operations
API key issues surface clear, actionable errors
You can answer "how much did this feature cost last month?"
Switching providers is a configuration change, not a rewrite

Your LLM integration needs work if:

Users see raw API errors ("rate_limit_exceeded", "context_length")
One slow call blocks the entire application
You've been surprised by API costs
Outages in the LLM provider cause complete feature failure
Prompt changes require code deployments

Quick Reference

Concern	Approach
Multiple items	Batch + parallelize (ThreadPoolExecutor, async)
API failures	Retry transient errors, fallback for permanent
API keys	User key > env var > clear error
Prompt output	Constrain format, length, structure explicitly
Cost tracking	Log tokens per call, alert on anomalies
Timeouts	30-60s default, circuit breaker for repeated failures