πŸ“ Tutorials
Β· 3 min read

Streaming AI Responses in Node.js: SSE, WebSockets, and Edge (2026)


Users hate waiting 10 seconds for a blank screen to fill with text. Streaming shows tokens as they’re generated β€” the response feels instant even when the full generation takes 15 seconds.

Server-Sent Events (SSE) β€” simplest approach

// server.js (Express)
import express from 'express';
import OpenAI from 'openai';

const app = express();
const client = new OpenAI(); // or OpenRouter, Ollama, etc.

app.post('/api/chat', async (req, res) => {
  res.setHeader('Content-Type', 'text/event-stream');
  res.setHeader('Cache-Control', 'no-cache');
  res.setHeader('Connection', 'keep-alive');

  const stream = await client.chat.completions.create({
    model: 'gpt-5.4-mini',
    messages: req.body.messages,
    stream: true,
  });

  for await (const chunk of stream) {
    const content = chunk.choices[0]?.delta?.content;
    if (content) {
      res.write(`data: ${JSON.stringify({ content })}\n\n`);
    }
  }

  res.write('data: [DONE]\n\n');
  res.end();
});

app.listen(3000);

Client side:

const response = await fetch('/api/chat', {
  method: 'POST',
  headers: { 'Content-Type': 'application/json' },
  body: JSON.stringify({ messages }),
});

const reader = response.body.getReader();
const decoder = new TextDecoder();

while (true) {
  const { done, value } = await reader.read();
  if (done) break;
  
  const text = decoder.decode(value);
  const lines = text.split('\n').filter(l => l.startsWith('data: '));
  
  for (const line of lines) {
    const data = line.slice(6);
    if (data === '[DONE]') break;
    const { content } = JSON.parse(data);
    appendToChat(content); // Update UI
  }
}

With Ollama (local streaming)

import ollama from 'ollama';

app.post('/api/chat', async (req, res) => {
  res.setHeader('Content-Type', 'text/event-stream');

  const stream = await ollama.chat({
    model: 'qwen3:8b',
    messages: req.body.messages,
    stream: true,
  });

  for await (const chunk of stream) {
    res.write(`data: ${JSON.stringify({ content: chunk.message.content })}\n\n`);
  }

  res.write('data: [DONE]\n\n');
  res.end();
});

With Vercel AI SDK (Next.js)

The cleanest implementation for Next.js apps:

// app/api/chat/route.ts
import { streamText } from 'ai';
import { openai } from '@ai-sdk/openai';

export async function POST(req: Request) {
  const { messages } = await req.json();
  
  const result = streamText({
    model: openai('gpt-5.4-mini'),
    messages,
  });

  return result.toDataStreamResponse();
}
// app/page.tsx
'use client';
import { useChat } from 'ai/react';

export default function Chat() {
  const { messages, input, handleInputChange, handleSubmit } = useChat();
  
  return (
    <div>
      {messages.map(m => <div key={m.id}>{m.content}</div>)}
      <form onSubmit={handleSubmit}>
        <input value={input} onChange={handleInputChange} />
      </form>
    </div>
  );
}

The Vercel AI SDK handles streaming, parsing, and state management. Three files for a complete streaming chat app.

WebSocket approach (bidirectional)

For real-time applications where the client also sends data during generation:

import { WebSocketServer } from 'ws';
import OpenAI from 'openai';

const wss = new WebSocketServer({ port: 8080 });
const client = new OpenAI();

wss.on('connection', (ws) => {
  ws.on('message', async (data) => {
    const { messages } = JSON.parse(data);
    
    const stream = await client.chat.completions.create({
      model: 'gpt-5.4-mini',
      messages,
      stream: true,
    });

    for await (const chunk of stream) {
      const content = chunk.choices[0]?.delta?.content;
      if (content) {
        ws.send(JSON.stringify({ type: 'token', content }));
      }
    }
    ws.send(JSON.stringify({ type: 'done' }));
  });
});

Use WebSockets when you need: cancel mid-generation, send additional context during streaming, or bidirectional communication.

Error handling for streams

try {
  for await (const chunk of stream) {
    // ... process chunk
  }
} catch (error) {
  if (error.code === 'ECONNRESET') {
    // Client disconnected β€” clean up
    return;
  }
  if (error.status === 429) {
    // Rate limited β€” retry or fallback
    res.write(`data: ${JSON.stringify({ error: 'Rate limited, retrying...' })}\n\n`);
  }
  // Log for debugging
  console.error('Stream error:', error);
}

Performance tips

  1. Use edge functions for lower latency (Vercel Edge, Cloudflare Workers)
  2. Buffer small chunks β€” don’t send every single token, batch 3-5 tokens
  3. Set timeouts β€” kill streams that run longer than expected
  4. Monitor token usage β€” track costs per stream for budget management

Related: Deploy AI Agents to Production Β· AI Agent Error Handling Β· Ollama Complete Guide Β· OpenRouter Complete Guide Β· Best Hosting for AI Projects

πŸ“˜