Rate Limiting In Production: A Token Bucket In 30 Lines Of Redis
The “sliding window” rate limiter every tutorial shows you breaks at scale. Token bucket is the algorithm real APIs use because it allows bursts without exceeding the average rate. Here is a 30-line Lua-on-Redis implementation, the failure modes to test for, and the headers you should be returning to clients.
A misconfigured client hammers your API with 600 requests per second. Your servers degrade, healthy traffic suffers, the on-call gets paged. The fix is rate limiting — but the version most tutorials teach (a sliding window in Redis with INCR+EXPIRE) has two failure modes at scale: it allows double the limit at the boundary between windows, and it does not let well-behaved clients use saved-up capacity for legitimate bursts.
The algorithm real APIs use is token bucket. It is the model used by Stripe, GitHub, AWS, and most CDN limiters. It handles bursts gracefully, has clean semantics, and fits in 30 lines of Lua running atomically on Redis. This post is that implementation, the headers you should return alongside it, and the three pitfalls.
How token bucket works
You have a bucket that holds up to capacity tokens. Tokens are added at a constant rate (refill_rate per second, up to capacity). Every request consumes a token. If the bucket is empty, the request is rejected (or queued). Done.
capacity = 10
refill_rate = 5/sec
Time: 0s 1s 2s 3s 4s 5s
Bucket: 10 10 10 3 8 8
Action: -- -- -- 7 reqs -- --
(drain)
The algorithm has three properties that matter:
- Bursts are allowed up to bucket capacity. A client that has been quiet can use saved-up tokens.
- Sustained rate is bounded by refill rate. Over the long run, no client exceeds
refill_raterequests per second. - No window boundary. Sliding/fixed-window limiters allow 2× limit across window boundaries; token bucket does not.
You pick capacity and refill_rate based on your tolerance for bursts vs your worst-case sustained rate. A common pattern: capacity = 10× refill_rate (allow ten seconds of burst).
Why Redis Lua
The naive implementation is two Redis ops: read token count, conditionally decrement, update timestamp. Run two of those interleaved from different processes and you get a race — both read 1 token, both decrement to 0, both let the request through, the bucket goes negative.
Redis Lua scripts run atomically on the server. The whole token-bucket update happens as one operation, no network round trips, no race. The performance is excellent: ~50µs per check, ~50,000 checks per second per Redis instance.
The 30-line Lua script
-- token_bucket.lua
-- KEYS[1] = bucket key (e.g., 'rate:user:42')
-- ARGV[1] = capacity (max tokens)
-- ARGV[2] = refill rate (tokens per second)
-- ARGV[3] = now (seconds, float)
-- ARGV[4] = cost (tokens to consume; usually 1)
-- Returns: { allowed (1/0), remaining tokens, retry_after_seconds }
local capacity = tonumber(ARGV[1])
local refill_rate = tonumber(ARGV[2])
local now = tonumber(ARGV[3])
local cost = tonumber(ARGV[4])
local data = redis.call('HMGET', KEYS[1], 'tokens', 'last_refill')
local tokens = tonumber(data[1]) or capacity
local last_refill = tonumber(data[2]) or now
-- Refill since last update.
local elapsed = math.max(0, now - last_refill)
tokens = math.min(capacity, tokens + elapsed * refill_rate)
local allowed = 0
local retry_after = 0
if tokens >= cost then
tokens = tokens - cost
allowed = 1
else
retry_after = (cost - tokens) / refill_rate
end
redis.call('HMSET', KEYS[1], 'tokens', tokens, 'last_refill', now)
-- TTL = 2× the time to fully refill, so old buckets eventually expire.
redis.call('EXPIRE', KEYS[1], math.ceil(2 * capacity / refill_rate))
return { allowed, tokens, retry_after }
The whole algorithm: read state, compute refill since last call, attempt to consume, write state, return. Atomic because Lua.
Wiring it up in Node.js
import { createClient } from 'redis';
import { readFileSync } from 'fs';
const redis = createClient({ url: process.env.REDIS_URL });
await redis.connect();
const TOKEN_BUCKET_LUA = readFileSync('./token_bucket.lua', 'utf8');
const sha = await redis.scriptLoad(TOKEN_BUCKET_LUA);
export interface RateLimitResult {
allowed: boolean;
remaining: number;
retryAfter: number;
}
export async function rateLimit(
key: string,
capacity: number,
refillRate: number,
cost = 1,
): Promise<RateLimitResult> {
const now = Date.now() / 1000;
const result = (await redis.evalSha(sha, {
keys: [key],
arguments: [String(capacity), String(refillRate), String(now), String(cost)],
})) as [number, number, number];
return {
allowed: result[0] === 1,
remaining: result[1],
retryAfter: result[2],
};
}
scriptLoad registers the script with Redis once and gives you an SHA. Subsequent calls use evalSha which sends only the SHA, not the script body — saves bandwidth and parse time. If Redis restarts and forgets the script, fall back to eval:
try {
return await redis.evalSha(sha, ...);
} catch (e: any) {
if (e.message.includes('NOSCRIPT')) {
return await redis.eval(TOKEN_BUCKET_LUA, ...);
}
throw e;
}
The Express middleware
export function rateLimitMiddleware(
keyFn: (req: Request) => string,
capacity: number,
refillRate: number,
) {
return async (req: Request, res: Response, next: NextFunction) => {
const result = await rateLimit(keyFn(req), capacity, refillRate);
res.setHeader('X-RateLimit-Limit', capacity);
res.setHeader('X-RateLimit-Remaining', Math.floor(result.remaining));
res.setHeader('X-RateLimit-Reset', Math.ceil(Date.now() / 1000 + result.retryAfter));
if (!result.allowed) {
res.setHeader('Retry-After', Math.ceil(result.retryAfter));
return res.status(429).json({
error: 'rate_limited',
retryAfter: result.retryAfter,
});
}
next();
};
}
// Usage
app.use('/api', rateLimitMiddleware((req) => `rate:${req.user.id}`, 100, 10));
// ^ ^
// | refill rate (10/s)
// capacity (100)
This config: 100-token capacity, 10/sec refill — sustained 10 RPS, bursts up to 100 requests within 10 seconds.
The headers you must return
When a client gets rate-limited, the response should tell them everything they need to back off:
X-RateLimit-Limit— capacity.X-RateLimit-Remaining— tokens left.X-RateLimit-Reset— Unix timestamp when the bucket fully refills.Retry-After— seconds until enough tokens are available for the next request (in 429 responses).
Some teams use RateLimit-Limit (no X- prefix) per the draft IETF standard. Either works. Pick one and document it. The worst pattern is “we rate-limit but do not tell the client” — clients then back off arbitrarily, exponentially-back-off-from-zero, or (worse) treat 429 as a transient error and retry instantly.
Three pitfalls
1. Wrong key shape. Rate limiting per IP is the default, and it is wrong for almost everyone. NAT, corporate VPNs, mobile networks — many users share an IP. Use a stable user identifier when you have one (user_id, api_key, account ID), and only fall back to IP for unauthenticated endpoints.
For granular control, layer multiple limits:
// Per-user limit
await rateLimit(`rate:user:${userId}`, 1000, 100);
// Per-API-key limit (lower for cheap keys)
await rateLimit(`rate:key:${apiKey}`, 100, 10);
// Per-IP limit (defense-in-depth)
await rateLimit(`rate:ip:${ip}`, 200, 20);
The first to fail wins. Cost: three Redis round-trips. Pipeline them if it matters.
2. Cost-per-endpoint. Not all endpoints are equal. A GET /me is cheap; a POST /reports/generate is expensive. The token-bucket script accepts a cost argument — pass cost=10 for the expensive endpoints and cost=1 for the cheap ones. Now the bucket reflects work, not request count.
3. Failure mode when Redis is down. If your Redis goes away, every rate-limit check fails. You have two policies:
- Fail open (allow all traffic when Redis is unreachable). Risk: a Redis outage becomes a DDoS amplifier.
- Fail closed (reject all traffic when Redis is unreachable). Risk: a Redis outage takes the API down.
Most teams pick “fail open with a circuit breaker” — first few errors fail open, but once Redis errors exceed a threshold, switch to a degraded local mode (in-process token buckets per server). This is a careful design choice; document the policy.
Distributed considerations
A token-bucket limiter against one Redis works for a single API up to ~50,000 RPS limit-checks. Past that, you need:
Redis Cluster or sharded keys. Distribute keys across shards. Token-bucket keys are independent (per user), so they shard cleanly.
Local-first, eventually consistent limiting. Stripe’s old public design notes describe local in-process limiters that periodically reconcile with a global Redis. Adds complexity; gains ability to handle 1M+ RPS.
For 99% of applications, one Redis with a Lua token bucket is enough. The decision to go more complex should be driven by measured Redis CPU at peak, not theoretical scale.
Testing it
The most common bug in rate-limiter implementations is “it allows the limit on average but spikes 2× at boundaries.” A test with a simulated clock makes this visible:
test('token bucket: capacity 10, refill 5/s, burst is allowed once', async () => {
await redis.del('test-bucket');
// 10 requests in 0.1 seconds — should all succeed (use the burst capacity).
for (let i = 0; i < 10; i++) {
const r = await rateLimit('test-bucket', 10, 5);
expect(r.allowed).toBe(true);
}
// 11th request immediately — should fail.
expect((await rateLimit('test-bucket', 10, 5)).allowed).toBe(false);
// Wait for partial refill.
await new Promise(r => setTimeout(r, 1000)); // 1 second → 5 new tokens
// 5 requests in quick succession — should succeed.
for (let i = 0; i < 5; i++) {
expect((await rateLimit('test-bucket', 10, 5)).allowed).toBe(true);
}
// 6th — should fail.
expect((await rateLimit('test-bucket', 10, 5)).allowed).toBe(false);
});
Run this in CI. It is the test that catches the bug where someone “optimizes” the Lua script and breaks the math.
The takeaway
Rate limiting is one of those features that looks simple and isn’t. Token bucket is the algorithm to use because it allows real-world traffic patterns (bursts after quiet periods) without ever exceeding the long-term limit. Thirty lines of Lua running on Redis is the cheapest correct implementation. Return the X-RateLimit-* headers so clients can back off intelligently. Layer per-user, per-key, per-IP limits for defense-in-depth. Decide your fail-open vs fail-closed policy explicitly.
The next time someone says “we should add rate limiting,” the answer is not “let’s pick a library” — it is “we’ll write the Lua script, deploy it Tuesday, and the limit shows up in the API headers by Friday.”
A note from Yojji
The kind of cross-cutting reliability work that prevents one badly-behaved client from taking down an API for the rest — token-bucket limits, fail-open policies, headers that let clients back off correctly — is the unglamorous backend engineering that decides whether a public API is robust or fragile. It is the kind of work Yojji’s teams build into the production systems they ship.
Yojji is an international custom software development company founded in 2016, with offices in Europe, the US, and the UK. Their teams specialize in the JavaScript ecosystem (React, Node.js, TypeScript), cloud platforms (AWS, Azure, GCP), and full-cycle backend engineering — including the rate-limiting and abuse-prevention work that decides whether your API stays up under load.