Rate Limiting

Rate limiting is the practice of capping how many requests a client may make to a service within a given window of time. A public API that placed no ceiling on traffic would be at the mercy of any client, well-behaved or not: a runaway script, an aggressive crawler, or a denial-of-service attack could exhaust the server’s capacity and degrade the experience for everyone else. By enforcing a quota, the service protects its own stability and shares its finite resources fairly across many callers.

The mechanism for refusing excess traffic was standardized in RFC 6585, “Additional HTTP Status Codes,” published in 2012. It defines the 429 Too Many Requests status code, which a server returns when “the user has sent too many requests in a given amount of time.” The RFC specifies that the response “MAY include a Retry-After header indicating how long to wait before making a new request,” and its worked example shows a 429 response carrying “Retry-After: 3600.” This gives a polite client a concrete instruction: back off for an hour rather than hammering the door.

Returning a 429 tells a client it has already exceeded the limit, but it would be better for the client to know the limit in advance and avoid hitting it at all. That is the purpose of the IETF HTTPAPI working group’s draft “RateLimit header fields for HTTP.” The draft defines response header fields, including RateLimit-Limit (the quota in the window), RateLimit-Remaining (how much of the quota is left), and RateLimit-Reset (how long until the window resets), so that a server can advertise its policy and a cooperating client can pace itself before being throttled.

Underneath these signaling conventions sits an algorithm that decides whether a request is within budget. The most widely used is the token bucket: a bucket is refilled with tokens at a steady rate up to some maximum, each request consumes a token, and a request is rejected when the bucket is empty. Because the bucket can hold a reserve of tokens, this approach permits short bursts of traffic while still bounding the long-run average rate, which matches how real clients behave better than a rigid fixed-window counter.

Major API providers each publish their own limits and headers, and the values vary widely by product and by whether the caller is authenticated. The common thread is that rate limiting is now a baseline expectation of any serious networked service: it is the boundary between a system that survives a traffic spike and one that falls over. It pairs naturally with client-side retry-with-backoff, since a client that respects a 429 and waits the indicated interval is precisely the cooperative behavior the limit is designed to encourage.

Sources

Related