Building Scalable APIs with Rate Limiting and Caching
Building APIs that scale requires careful attention to resource management. This guide covers essential patterns for handling high traffic.
Why Rate Limiting Matters
Without rate limiting, a single user or bot can overwhelm your API:
- Excessive load on servers
- Degraded performance for legitimate users
- Increased infrastructure costs
- Potential security vulnerabilities
Implementing Rate Limiting
Token Bucket Algorithm
The token bucket algorithm is one of the most popular approaches:
pythonimport time from collections import defaultdict class RateLimiter: def __init__(self, rate, capacity): self.rate = rate # tokens per second self.capacity = capacity self.tokens = defaultdict(lambda: capacity) self.last_update = defaultdict(time.time) def allow_request(self, client_id): current_time = time.time() time_passed = current_time - self.last_update[client_id] # Add new tokens based on time passed self.tokens[client_id] = min( self.capacity, self.tokens[client_id] + time_passed * self.rate ) self.last_update[client_id] = current_time if self.tokens[client_id] >= 1: self.tokens[client_id] -= 1 return True return False
Redis-Based Rate Limiting
For distributed systems, use Redis:
pythonimport redis import time class DistributedRateLimiter: def __init__(self, redis_client, rate_limit, window_seconds): self.redis = redis_client self.rate_limit = rate_limit self.window = window_seconds def is_allowed(self, key): current = int(time.time()) window_key = f"rate:{key}:{current // self.window}" pipe = self.redis.pipeline() pipe.incr(window_key) pipe.expire(window_key, self.window) result = pipe.execute() return result[0] <= self.rate_limit
Caching Strategies
Response Caching
Cache frequently requested data:
pythonfrom functools import lru_cache from datetime import datetime, timedelta import hashlib import json class APICache: def __init__(self, redis_client, default_ttl=300): self.redis = redis_client self.default_ttl = default_ttl def get_or_set(self, key, fetch_func, ttl=None): cached = self.redis.get(key) if cached: return json.loads(cached) data = fetch_func() self.redis.setex( key, ttl or self.default_ttl, json.dumps(data) ) return data def invalidate(self, pattern): keys = self.redis.keys(pattern) if keys: self.redis.delete(*keys)
Cache Invalidation Patterns
python# Write-through cache def update_user(user_id, data): # Update database db.users.update(user_id, data) # Immediately update cache cache.set(f"user:{user_id}", data) # Write-behind (async) cache async def update_user_async(user_id, data): # Update cache immediately cache.set(f"user:{user_id}", data) # Queue database update await queue.put({"action": "update_user", "id": user_id, "data": data})
Load Distribution
Consistent Hashing
Distribute load across multiple cache servers:
pythonimport hashlib from bisect import bisect_right class ConsistentHash: def __init__(self, nodes, replicas=100): self.replicas = replicas self.ring = [] self.nodes = {} for node in nodes: self.add_node(node) def add_node(self, node): for i in range(self.replicas): key = self._hash(f"{node}:{i}") self.ring.append(key) self.nodes[key] = node self.ring.sort() def get_node(self, key): if not self.ring: return None hash_key = self._hash(key) idx = bisect_right(self.ring, hash_key) % len(self.ring) return self.nodes[self.ring[idx]] def _hash(self, key): return int(hashlib.md5(key.encode()).hexdigest(), 16)
Monitoring and Metrics
Track these key metrics:
- Request rate per endpoint
- Cache hit/miss ratio
- Rate limit violations
- Response times (p50, p95, p99)
- Error rates
pythonfrom prometheus_client import Counter, Histogram request_count = Counter('api_requests_total', 'Total API requests', ['endpoint', 'status']) request_latency = Histogram('api_request_latency_seconds', 'Request latency', ['endpoint']) cache_hits = Counter('cache_hits_total', 'Cache hits', ['cache_type']) cache_misses = Counter('cache_misses_total', 'Cache misses', ['cache_type']) rate_limit_exceeded = Counter('rate_limit_exceeded_total', 'Rate limit exceeded', ['endpoint'])
Best Practices
- Use appropriate TTLs: Balance freshness vs performance
- Implement cache warming: Pre-populate caches during low traffic
- Handle cache failures gracefully: Fall back to database
- Use cache tags: Enable targeted invalidation
- Monitor cache health: Alert on high miss rates
Conclusion
Scalable APIs require a combination of rate limiting, caching, and proper load distribution. Start with simple implementations and evolve based on your specific traffic patterns and requirements.