How Rate Limiting Actually Works (Token Bucket, Sliding Window, and More)
Your API gets 10,000 requests per second from a single IP. Your database melts. You add rate limiting. But which algorithm? β100 requests per minuteβ sounds simple until you realize there are at least four different ways to count, and each behaves differently at the edges.
Why naive rate limiting breaks
The simplest approach: count requests per minute. Reset the counter every 60 seconds.
Minute 1: 0:00-0:59 β 100 requests allowed
Minute 2: 1:00-1:59 β 100 requests allowed
The problem: a user sends 100 requests at 0:59 and 100 requests at 1:00. Thatβs 200 requests in 2 seconds β double your intended limit. The counter reset at the minute boundary created a burst window.
This is the βfixed windowβ problem, and itβs why real rate limiters use smarter algorithms.
Token bucket
Imagine a bucket that holds tokens. Every request costs one token. Tokens are added at a fixed rate (e.g., 10 per second). If the bucket is empty, the request is rejected.
Bucket capacity: 100 tokens
Refill rate: 10 tokens/second
Request arrives:
- Tokens available? β Allow, remove 1 token
- Bucket empty? β Reject (429 Too Many Requests)
Why it works: It allows bursts (up to the bucket capacity) while enforcing a long-term average rate. A user can send 100 requests instantly if they havenβt made requests recently, but they canβt sustain more than 10/second.
Used by: AWS API Gateway, Stripe, most cloud providers.
class TokenBucket {
constructor(capacity, refillRate) {
this.capacity = capacity;
this.tokens = capacity;
this.refillRate = refillRate; // tokens per second
this.lastRefill = Date.now();
}
allow() {
this.refill();
if (this.tokens >= 1) {
this.tokens -= 1;
return true;
}
return false;
}
refill() {
const now = Date.now();
const elapsed = (now - this.lastRefill) / 1000;
this.tokens = Math.min(this.capacity, this.tokens + elapsed * this.refillRate);
this.lastRefill = now;
}
}
Sliding window log
Keep a log of every request timestamp. When a new request arrives, count how many requests happened in the last 60 seconds. If it exceeds the limit, reject.
Window: last 60 seconds
Limit: 100 requests
Request at 1:30:45 β count requests from 1:29:45 to 1:30:45
- Count is 99 β Allow
- Count is 100 β Reject
Why it works: No boundary problems. The window slides with each request, so thereβs no βresetβ moment to exploit.
Downside: Memory. Youβre storing every request timestamp. At high volume, this gets expensive.
Used by: GitHub API, smaller-scale APIs where precision matters.
Sliding window counter
A hybrid approach. Divide time into fixed windows but weight the previous window based on how far into the current window you are.
Previous window (0:00-0:59): 80 requests
Current window (1:00-1:59): 30 requests so far
Current time: 1:15 (25% into current window)
Weighted count = (80 Γ 0.75) + 30 = 90
Limit: 100 β Allow
Why it works: Approximates the sliding window without storing individual timestamps. Uses only two counters per window instead of a full log.
Used by: Redis-based rate limiters, Cloudflare.
Leaky bucket
Requests enter a queue (the bucket). The queue is processed at a fixed rate. If the queue is full, new requests are dropped.
Queue capacity: 50
Processing rate: 10 requests/second
Request arrives:
- Queue not full? β Add to queue, process in order
- Queue full? β Reject
Why it works: Guarantees a smooth, constant output rate. No bursts. Useful when your backend can only handle a fixed throughput.
Downside: Adds latency. Requests wait in the queue instead of being processed immediately.
Used by: Network traffic shaping, NGINXβs limit_req module.
Which one should you use?
| Algorithm | Allows bursts | Memory | Precision | Best for |
|---|---|---|---|---|
| Token bucket | β Yes | Low | Good | Most APIs |
| Sliding window log | β No | High | Exact | Small-scale, precision needed |
| Sliding window counter | β οΈ Small | Low | Approximate | High-scale APIs |
| Leaky bucket | β No | Medium | Good | Smooth output rate needed |
Default choice: Token bucket. It handles bursts gracefully, uses minimal memory, and is what most cloud providers implement.
Implementation in practice
Most developers donβt implement rate limiting from scratch. Use:
- Express:
express-rate-limit(fixed window) orrate-limiter-flexible(multiple algorithms) - NGINX:
limit_req(leaky bucket) - Redis:
ioredis+ sliding window counter pattern - Cloud: API Gateway rate limiting (AWS, GCP, Azure)
- Cloudflare: Built-in rate limiting rules
The headers your API should return
X-RateLimit-Limit: 100 # max requests per window
X-RateLimit-Remaining: 42 # requests left
X-RateLimit-Reset: 1679616000 # when the window resets (Unix timestamp)
Retry-After: 30 # seconds to wait (on 429 response)
These headers let clients implement backoff without guessing.
The one-sentence summary
Rate limiting controls request flow using algorithms that trade off between burst tolerance, memory usage, and precision β token bucket is the right default for most APIs.
Related: What is Rate Limiting? Β· How CORS Actually Works Β· REST vs GraphQL Β· Nginx cheat sheet