Users hate waiting 10 seconds for a blank screen to fill with text. Streaming shows tokens as theyβre generated β the response feels instant even when the full generation takes 15 seconds.
Server-Sent Events (SSE) β simplest approach
// server.js (Express)
import express from 'express';
import OpenAI from 'openai';
const app = express();
const client = new OpenAI(); // or OpenRouter, Ollama, etc.
app.post('/api/chat', async (req, res) => {
res.setHeader('Content-Type', 'text/event-stream');
res.setHeader('Cache-Control', 'no-cache');
res.setHeader('Connection', 'keep-alive');
const stream = await client.chat.completions.create({
model: 'gpt-5.4-mini',
messages: req.body.messages,
stream: true,
});
for await (const chunk of stream) {
const content = chunk.choices[0]?.delta?.content;
if (content) {
res.write(`data: ${JSON.stringify({ content })}\n\n`);
}
}
res.write('data: [DONE]\n\n');
res.end();
});
app.listen(3000);
Client side:
const response = await fetch('/api/chat', {
method: 'POST',
headers: { 'Content-Type': 'application/json' },
body: JSON.stringify({ messages }),
});
const reader = response.body.getReader();
const decoder = new TextDecoder();
while (true) {
const { done, value } = await reader.read();
if (done) break;
const text = decoder.decode(value);
const lines = text.split('\n').filter(l => l.startsWith('data: '));
for (const line of lines) {
const data = line.slice(6);
if (data === '[DONE]') break;
const { content } = JSON.parse(data);
appendToChat(content); // Update UI
}
}
With Ollama (local streaming)
import ollama from 'ollama';
app.post('/api/chat', async (req, res) => {
res.setHeader('Content-Type', 'text/event-stream');
const stream = await ollama.chat({
model: 'qwen3:8b',
messages: req.body.messages,
stream: true,
});
for await (const chunk of stream) {
res.write(`data: ${JSON.stringify({ content: chunk.message.content })}\n\n`);
}
res.write('data: [DONE]\n\n');
res.end();
});
With Vercel AI SDK (Next.js)
The cleanest implementation for Next.js apps:
// app/api/chat/route.ts
import { streamText } from 'ai';
import { openai } from '@ai-sdk/openai';
export async function POST(req: Request) {
const { messages } = await req.json();
const result = streamText({
model: openai('gpt-5.4-mini'),
messages,
});
return result.toDataStreamResponse();
}
// app/page.tsx
'use client';
import { useChat } from 'ai/react';
export default function Chat() {
const { messages, input, handleInputChange, handleSubmit } = useChat();
return (
<div>
{messages.map(m => <div key={m.id}>{m.content}</div>)}
<form onSubmit={handleSubmit}>
<input value={input} onChange={handleInputChange} />
</form>
</div>
);
}
The Vercel AI SDK handles streaming, parsing, and state management. Three files for a complete streaming chat app.
WebSocket approach (bidirectional)
For real-time applications where the client also sends data during generation:
import { WebSocketServer } from 'ws';
import OpenAI from 'openai';
const wss = new WebSocketServer({ port: 8080 });
const client = new OpenAI();
wss.on('connection', (ws) => {
ws.on('message', async (data) => {
const { messages } = JSON.parse(data);
const stream = await client.chat.completions.create({
model: 'gpt-5.4-mini',
messages,
stream: true,
});
for await (const chunk of stream) {
const content = chunk.choices[0]?.delta?.content;
if (content) {
ws.send(JSON.stringify({ type: 'token', content }));
}
}
ws.send(JSON.stringify({ type: 'done' }));
});
});
Use WebSockets when you need: cancel mid-generation, send additional context during streaming, or bidirectional communication.
Error handling for streams
try {
for await (const chunk of stream) {
// ... process chunk
}
} catch (error) {
if (error.code === 'ECONNRESET') {
// Client disconnected β clean up
return;
}
if (error.status === 429) {
// Rate limited β retry or fallback
res.write(`data: ${JSON.stringify({ error: 'Rate limited, retrying...' })}\n\n`);
}
// Log for debugging
console.error('Stream error:', error);
}
Performance tips
- Use edge functions for lower latency (Vercel Edge, Cloudflare Workers)
- Buffer small chunks β donβt send every single token, batch 3-5 tokens
- Set timeouts β kill streams that run longer than expected
- Monitor token usage β track costs per stream for budget management
Related: Deploy AI Agents to Production Β· AI Agent Error Handling Β· Ollama Complete Guide Β· OpenRouter Complete Guide Β· Best Hosting for AI Projects