How We Integrate LLM APIs
Principles for reliable, efficient, and cost-effective LLM API integration
How We Integrate LLM APIs
LLM APIs are fundamentally different from traditional APIs. They are slow, expensive, rate-limited, and occasionally unreliable. A naive integration—one request per item, no error handling, blocking calls—will result in poor user experience, inflated costs, and cascading failures. This document establishes how we build LLM integrations that are fast, resilient, and economical.
Core Question: "If this API call fails or takes 30 seconds, what happens to the user?"
Principles
Principle 1: Batch and Parallelize by Default
The mistake: Making sequential API calls, one item at a time, waiting for each to complete before starting the next.
The principle: Group related items into batches and execute batches in parallel using thread pools or async workers. Most LLM APIs accept batch inputs—use them.
Why it matters: A 100-item job at 500ms per call takes 50 seconds sequentially. With 5 parallel workers and batches of 20, it completes in 5 seconds. Users don't wait, and you stay within rate limits.
Principle 2: Always Provide Fallbacks
The mistake: Treating LLM output as guaranteed. When the API fails or returns malformed data, the entire operation crashes.
The principle: Define a sensible fallback for every LLM call. If label generation fails, use "Cluster N". If embedding fails for one item, retry or exclude it—don't abort the batch.
Why it matters: LLM APIs have outages, rate limits, and content filters. Fallbacks ensure your application degrades gracefully rather than failing completely.
Principle 3: Separate API Key Concerns
The mistake: Hardcoding API keys, or requiring users to configure keys before any functionality works.
The principle: Support multiple key sources with clear precedence: user-provided key > environment variable > error with helpful message. Validate keys early and surface clear errors.
Why it matters: Different deployment contexts need different key strategies. Development uses env vars, multi-tenant apps use user keys, and managed deployments use secrets. Support all without code changes.
Principle 4: Design Prompts for Consistency
The mistake: Writing prompts that produce variable-length, unpredictable outputs that are difficult to parse or display.
The principle: Constrain LLM outputs explicitly. Specify format, length limits, and structure. Use examples in prompts when format matters. Parse defensively.
Why it matters: LLMs are creative by default. Without constraints, you get inconsistent outputs that break UI layouts, exceed storage limits, or require complex post-processing.
Principle 5: Track Costs Explicitly
The mistake: Ignoring token usage until the bill arrives. No visibility into which operations consume the most resources.
The principle: Log token counts for every API call. Calculate estimated costs. Set alerts for unusual usage. Make cost a first-class metric alongside latency and errors.
Why it matters: LLM costs scale with usage. A bug that retries infinitely or a feature that over-fetches can cost thousands before anyone notices.
Decision Framework
When should I use batching vs. individual calls?
Use batching when:
- Processing more than 5 items
- Items are independent (no sequential dependency)
- The API supports batch inputs
- Latency matters to the user
Use individual calls when:
- Processing 1-2 items
- Each item depends on the previous result
- You need fine-grained error handling per item
- Debugging a specific failure
When should I use synchronous vs. streaming responses?
Use synchronous when:
- Response is short (< 100 tokens)
- You need the complete response before proceeding
- You're batching multiple calls
Use streaming when:
- Response is long (> 500 tokens)
- User is waiting and needs feedback
- You can progressively render results
When should I retry vs. fail fast?
Retry when:
- Error is transient (rate limit, timeout, 5xx)
- Operation is idempotent
- You have retry budget remaining
Fail fast when:
- Error is permanent (invalid key, malformed request, 4xx)
- User is waiting interactively
- Retry budget exhausted
Common Mistakes
Mistake 1: No timeout configuration
Signs: Requests hang indefinitely during API outages. Users see infinite loading spinners. Fix: Set explicit timeouts (30-60s for most LLM calls). Implement circuit breakers for repeated failures.
Mistake 2: Logging full prompts and responses
Signs: Log files grow to gigabytes. Sensitive user data appears in logs. Storage costs spike. Fix: Log metadata (tokens, latency, model, status) not content. Sample full logs for debugging, don't store all.
Mistake 3: Ignoring rate limits until they hit
Signs: Sudden 429 errors during peak usage. Operations fail mid-batch. Fix: Implement client-side rate limiting. Use exponential backoff. Queue requests during high load.
Mistake 4: Tight coupling to one provider
Signs: Switching from OpenAI to Anthropic requires rewriting half the codebase. Fix: Abstract the LLM interface. Isolate provider-specific code. Use adapter patterns for different APIs.
Mistake 5: Treating all models the same
Signs: Using GPT-4 for simple classification. Using GPT-3.5 for complex reasoning. Costs too high or quality too low. Fix: Match model capability to task complexity. Use smaller models for structured extraction, larger models for nuanced generation.
Evaluation Checklists
Your LLM integration is working if:
- Batch operations complete in reasonable time (< 30s for most jobs)
- Individual failures don't crash batch operations
- API key issues surface clear, actionable errors
- You can answer "how much did this feature cost last month?"
- Switching providers is a configuration change, not a rewrite
Your LLM integration needs work if:
- Users see raw API errors ("rate_limit_exceeded", "context_length")
- One slow call blocks the entire application
- You've been surprised by API costs
- Outages in the LLM provider cause complete feature failure
- Prompt changes require code deployments
Quick Reference
| Concern | Approach |
|---|---|
| Multiple items | Batch + parallelize (ThreadPoolExecutor, async) |
| API failures | Retry transient errors, fallback for permanent |
| API keys | User key > env var > clear error |
| Prompt output | Constrain format, length, structure explicitly |
| Cost tracking | Log tokens per call, alert on anomalies |
| Timeouts | 30-60s default, circuit breaker for repeated failures |