Jun 27, 2026 · 4 min read

How Rate Limiting Actually Works (Token Bucket, Sliding Window, and More)

Your API gets 10,000 requests per second from a single IP. Your database melts. You add rate limiting. But which algorithm? “100 requests per minute” sounds simple until you realize there are at least four different ways to count, and each behaves differently at the edges.

Why naive rate limiting breaks

The simplest approach: count requests per minute. Reset the counter every 60 seconds.

Minute 1: 0:00-0:59 → 100 requests allowed
Minute 2: 1:00-1:59 → 100 requests allowed

The problem: a user sends 100 requests at 0:59 and 100 requests at 1:00. That’s 200 requests in 2 seconds — double your intended limit. The counter reset at the minute boundary created a burst window.

This is the “fixed window” problem, and it’s why real rate limiters use smarter algorithms.

Token bucket

Imagine a bucket that holds tokens. Every request costs one token. Tokens are added at a fixed rate (e.g., 10 per second). If the bucket is empty, the request is rejected.

Bucket capacity: 100 tokens
Refill rate: 10 tokens/second

Request arrives:
  - Tokens available? → Allow, remove 1 token
  - Bucket empty? → Reject (429 Too Many Requests)

Why it works: It allows bursts (up to the bucket capacity) while enforcing a long-term average rate. A user can send 100 requests instantly if they haven’t made requests recently, but they can’t sustain more than 10/second.

Used by: AWS API Gateway, Stripe, most cloud providers.

class TokenBucket {
  constructor(capacity, refillRate) {
    this.capacity = capacity;
    this.tokens = capacity;
    this.refillRate = refillRate; // tokens per second
    this.lastRefill = Date.now();
  }

  allow() {
    this.refill();
    if (this.tokens >= 1) {
      this.tokens -= 1;
      return true;
    }
    return false;
  }

  refill() {
    const now = Date.now();
    const elapsed = (now - this.lastRefill) / 1000;
    this.tokens = Math.min(this.capacity, this.tokens + elapsed * this.refillRate);
    this.lastRefill = now;
  }
}

Sliding window log

Keep a log of every request timestamp. When a new request arrives, count how many requests happened in the last 60 seconds. If it exceeds the limit, reject.

Window: last 60 seconds
Limit: 100 requests

Request at 1:30:45 → count requests from 1:29:45 to 1:30:45
  - Count is 99 → Allow
  - Count is 100 → Reject

Why it works: No boundary problems. The window slides with each request, so there’s no “reset” moment to exploit.

Downside: Memory. You’re storing every request timestamp. At high volume, this gets expensive.

Used by: GitHub API, smaller-scale APIs where precision matters.

Sliding window counter

A hybrid approach. Divide time into fixed windows but weight the previous window based on how far into the current window you are.

Previous window (0:00-0:59): 80 requests
Current window (1:00-1:59): 30 requests so far
Current time: 1:15 (25% into current window)

Weighted count = (80 × 0.75) + 30 = 90
Limit: 100 → Allow

Why it works: Approximates the sliding window without storing individual timestamps. Uses only two counters per window instead of a full log.

Used by: Redis-based rate limiters, Cloudflare.

Leaky bucket

Requests enter a queue (the bucket). The queue is processed at a fixed rate. If the queue is full, new requests are dropped.

Queue capacity: 50
Processing rate: 10 requests/second

Request arrives:
  - Queue not full? → Add to queue, process in order
  - Queue full? → Reject

Why it works: Guarantees a smooth, constant output rate. No bursts. Useful when your backend can only handle a fixed throughput.

Downside: Adds latency. Requests wait in the queue instead of being processed immediately.

Used by: Network traffic shaping, NGINX’s limit_req module.

Which one should you use?

Algorithm	Allows bursts	Memory	Precision	Best for
Token bucket	✅ Yes	Low	Good	Most APIs
Sliding window log	❌ No	High	Exact	Small-scale, precision needed
Sliding window counter	⚠️ Small	Low	Approximate	High-scale APIs
Leaky bucket	❌ No	Medium	Good	Smooth output rate needed

Default choice: Token bucket. It handles bursts gracefully, uses minimal memory, and is what most cloud providers implement.

Implementation in practice

Most developers don’t implement rate limiting from scratch. Use:

Express: express-rate-limit (fixed window) or rate-limiter-flexible (multiple algorithms)
NGINX: limit_req (leaky bucket)
Redis: ioredis + sliding window counter pattern
Cloud: API Gateway rate limiting (AWS, GCP, Azure)
Cloudflare: Built-in rate limiting rules

The headers your API should return

X-RateLimit-Limit: 100        # max requests per window
X-RateLimit-Remaining: 42     # requests left
X-RateLimit-Reset: 1679616000 # when the window resets (Unix timestamp)
Retry-After: 30               # seconds to wait (on 429 response)

These headers let clients implement backoff without guessing.

The one-sentence summary

Rate limiting controls request flow using algorithms that trade off between burst tolerance, memory usage, and precision — token bucket is the right default for most APIs.

How Rate Limiting Actually Works (Token Bucket, Sliding Window, and More)

Why naive rate limiting breaks

Token bucket

Sliding window log

Sliding window counter

Leaky bucket

Which one should you use?

Implementation in practice

The headers your API should return

The one-sentence summary

📬 AI Dev Weekly

You might also like

How Prompt Caching Works — And Why It Saves You 90% on AI API Costs

How Environment Variables Actually Work (From Shell to Container)

How Transformers Actually Work — A Visual Guide for Developers

How Git Merge vs Rebase Actually Works (With Visual Examples)