How We Implement Streaming

Users hate waiting. Streaming shows AI responses as they're generated, making interactions feel instant. This guide covers the principles behind effective streaming.

Core principle: Show progress immediately. Users should never stare at a blank screen.

Why Streaming Matters

Without Streaming	With Streaming
Wait 10 seconds for full response	See first words in 100ms
No idea if it's working	Visible progress
Feels slow even if fast	Feels instant even if same duration

The psychological impact is significant: users perceive streamed responses as faster even when total time is identical.

Streaming Technologies

Server-Sent Events (SSE)

One-way stream from server to client. Our default choice for LLM responses.

Pros: Simple, built into browsers, reconnects automatically, works through most proxies

Cons: Server to client only, limited to text data

WebSockets

Two-way real-time communication.

Pros: Bidirectional, supports binary data, lower latency for back-and-forth

Cons: More complex to implement, requires manual reconnection logic

Choosing Between Them

Use Case	Recommended	Why
LLM response streaming	SSE	One-way is sufficient, simpler
Chat with typing indicators	WebSocket	Need bidirectional updates
Progress updates	SSE	Server pushing to client
Real-time collaboration	WebSocket	Multiple parties sending data
File upload progress	WebSocket	Client sending, server acknowledging

Rule of thumb: If only the server needs to push data, use SSE. If both sides need to send data in real-time, use WebSocket.

SSE Design Principles

Message Format

SSE uses a simple text-based format. Each message has:

A data: prefix
JSON payload (for structured data)
Double newline to separate messages
A completion signal (we use [DONE])

Event Types for Agents

When agents use tools, communicate different event types:

Event Type	Purpose	What to Show
`text`	Streamed content	Append to response
`tool_start`	Tool execution beginning	"Searching..." indicator
`tool_end`	Tool execution complete	Hide indicator, maybe show result
`error`	Something went wrong	Error message
`done`	Stream complete	Remove streaming indicator

Headers

Three headers are essential for SSE:

Content-Type: text/event-stream — tells the browser it's SSE
Cache-Control: no-cache — prevents caching of the stream
Connection: keep-alive — keeps the connection open

Handling Edge Cases

Connection Drops

Connections will drop. Plan for it.

With EventSource API: Reconnection is automatic. The browser handles it.

With fetch(): You need manual retry logic with exponential backoff.

Principle: Never assume a stream will complete successfully. Always have fallback behavior.

Cancellation

Users should be able to cancel streaming requests.

Why it matters:

User realizes they asked the wrong question
Response is going in the wrong direction
User wants to try a different approach

Implementation principle: Use AbortController on the client side. The server should detect client disconnection and stop processing.

Rate Limiting

Multiple concurrent streams can overwhelm your server or hit API limits.

Approach:

Limit concurrent streams per user (2-3 is reasonable)
Return 429 Too Many Requests if limit exceeded
Client should queue or fail gracefully

Partial Chunk Handling

SSE data can arrive in incomplete chunks, especially with slow connections.

The problem: A JSON message might be split across multiple network packets.

The solution: Buffer incoming data. Only parse when you have complete SSE messages (ending with \n\n).

Frontend Considerations

Streaming State

Track three states:

Idle: No request in progress
Streaming: Receiving data
Complete: Stream finished (success or error)

Visual Feedback

State	What to Show
Streaming started	Cursor or typing indicator
Receiving text	Append text progressively
Tool in progress	Tool-specific indicator ("Searching...", "Analyzing...")
Error	Clear error message with retry option
Complete	Remove indicators, enable input

Memory Considerations

Long streams can accumulate significant DOM content.

Watch for:

Too many re-renders as text appends
Growing memory from stored messages
Performance degradation on long conversations

Mitigations:

Batch DOM updates
Virtualize long message lists
Truncate or summarize old messages

Common Mistakes

Not handling partial chunks

Signs: Garbled text, JSON parse errors, missing content.

Fix: Buffer chunks, only parse complete SSE messages.

Blocking UI during stream

Signs: Can't cancel, can't navigate away, frozen interface.

Fix: Keep UI responsive. Streaming should be non-blocking.

Memory leaks

Signs: Memory grows with each stream, performance degrades over time.

Fix: Clean up event listeners. Close streams properly. Don't hold references to completed streams.

No loading state

Signs: User doesn't know request is in progress, may submit again.

Fix: Show streaming indicator immediately on submit. Disable input while streaming.

Ignoring errors

Signs: Stream fails silently, user stares at incomplete response.

Fix: Handle errors explicitly. Show clear error messages. Offer retry.

Evaluation Checklist

Your streaming is working if:

First content appears within 500ms of request
UI remains responsive during streaming
User can cancel in-progress streams
Connection drops are handled gracefully (auto-retry or clear error)
Tool usage is visible to users (not a black box)
Errors display with helpful messages

Your streaming needs work if:

Long delay before first content appears
UI freezes during streaming
No way to cancel
Connection drops cause crashes or hang
Users don't know something is happening
Errors fail silently

Quick Reference

Technology Selection

Need	Use
LLM responses	SSE
Real-time bidirectional	WebSocket
Simple progress updates	SSE
Complex multi-party updates	WebSocket

Event Type Pattern

For agent responses, use consistent event types:

text — content to display
tool_start — beginning tool execution
tool_end — tool execution complete
error — error occurred
done — stream complete

Performance Targets

Metric	Target
Time to first byte	< 500ms
UI responsiveness	Never block
Reconnection	Automatic or < 3s
Error visibility	Immediate and clear