What is API Throttling and Rate Limiting?

Leaky Bucket Algorithm

Throttling is an important concept when designing resilient systems. Its also important if you’re trying to use a public API such as Google Maps or the Twitter API. These APIs apply a rate limiting algorithm to keep your traffic in check and throttle you if you exceed those rates. Regardless if you’re trying to design a system to protect yourself from clients, or if you’re just someone trying to call an API, Throttling is an important thing to know about.

The concept itself is a fairly simple one: “just control the amount of traffic to an application”. I find a lot of articles these days such as this one from Microsoft tend to over-complicate this very simple concept with an overload of detail.

In this article I want to help you understand throttling from a practical perspective. I’m going to give you the important bits that are applicable in real world systems that I have worked with. So first, lets define throttling.

“Throttling, also sometimes called Rate Limiting, is a technique that allows a service to control the consumption of resources used by an instance of an application, an individual tenant, or an entire service”

Source: Microsoft.com

One thing I want to emphasize here is the direction of the relationship between Client and Server when talking about Throttling.

Throttling is a policy that the Server enforces and the Client respects.

Clients can respect this policy with certain policies such as Retries and Exponential Backoff (more on that later). But either way, its important to understand who-does-what from the get-go as much of the remaining article assumes this knowledge.

So now we understand what throttling is as a concept. But the question is, why do applications rate limit their clients? Lets explore that below.

Why do Applications Rate Limit?

When a software engineer builds an API, he or she provisions a certain amount of servers to satisfy the expected incoming demand. For instance, If I have a very unpopular system, I may only allocate a couple servers to handle and process incoming traffic. Conversely, if I have a very popular API, I will go ahead and configure a large amount of servers. This is even more important if you have a public api of a well known website (i.e. Twitter, Google Maps, or LinkedIn)

Applications rate limit for some very basic reasons:

  1. System owners do not want a single client to overwhelm their system with requests, affecting traffic for other clients.
  2. System owners want their systems to behave in a predictable way and meet a certain SLA (Service Level Agreement). In order to do so, they must control the rate of traffic coming from individual clients so that it can stay within expected bounds.
  3. System owners want to keep their cost under control. This is especially true if an API consumes a large amount of resources or is linked to another ‘paid’ api.

For example, here’s the rate limits the Twitter API that it applies to developers using it:

Twitter rate limiting / throttling settings for developers
Notice the requests are per 15 minute window. The time window is an important concept in rate limiting to be discussed below.

And here’s an example from LinkedIn. Not as detailed as the Twitter image from above, but its the same idea:

Linkedin Rate Limiting / Throttling section from the developer console.

How do Applications Rate Limit?

Applications can use a variety of techniques to rate limit their clients. The basic outcome from the client side is the same though: if you exceed a certain number of requests per time window, your requests will be rejected and the API will throw you a ThrottlingException. Throttling exceptions indicate what you would expect – you’re either calling too much, or your rate limits are too low. Either or, you should slow down your rate of calling.

The most popular rate limiting or throttling technique that I’ve encountered in the real world is the Token Bucket Algorithm. In fact, its the most popular method used in Amazon Web Services APIs so its important to be familiar with it if you’re using AWS. Lets explore it below.

The Token Bucket Algorithm

The Token Bucket Algorithm has two major components, burst and refill (sometimes called sustain). We define them below.

  1. Burst – Burst corresponds to the number of ‘Tokens’ that are available for a client. The Tokens are consumed every time a request comes in. In this example, imagine that a token is a 1:1 mapping to a drop of water in a bucket. Burst actually refers to the size of the bucket (aka the number of drops that are within it available for consumption). The more burst capacity you have, the more receptive the server is of high volume, but low frequency traffic. Keep in mind here that each client has their own bucket.
  2. Refill/Sustain – Refill or Sustain corresponds to the rate in which the backend service ‘refills’ water into your bucket. It is essentially the replenishment rate of how fast the backend will give you more opportunities to call the API.

An important concept of the token bucket algorithm is the time unit used to define the burst/refill. Usually this operates in seconds for most respectable APIs. This means that your burst capacity is calculated on a per second basis (too many requests exceeding the burst rate in a single second will cause Throttling). Similarly, refill/sustain is usually on a per second basis (you receive new tokens or ‘water’ on a per-second basis). The time unit here could be anything though from milliseconds to seconds to minutes to hours – its really up to you.

Another thing to keep in mind is that the burst rate is usually greater than or equal to the refill / sustain rate. Practically speaking, this makes sense – the bucket capacity we are pouring our water into is usually greater than the rate we are pouring water in. For example, if we have a bucket that is 1 Litre capacity, it wouldn’t make much sense to have a sustain rate or ‘pouring rate’ of 2 Litres per second – it would mean that you’re pouring at a rate that is constantly making the bucket overflow which doesn’t make much sense.

So lets think about some different combinations of high/low burst and sustain limits and the implications it has on the client of a rate limited api.

High Burst, Low Refill/Sustain – This combination means that the client will be allowed to make infrequent, bursty calls to an API. However if the client has too many bursts before the bucket has gone back to maximum capacity, the next burst of calls will fail

Equal Burst and Refill/Sustain – If your burst is equivalent to your refill/sustain, you essentially have a static limit per time unit, i.e. 10 requests allowed per second. This configuration is more similar to the Time Window rate limiting algorithm.

The main strength of the Token Bucket Algorithm is that it allows you to build an API that is accommodating of low frequency, bursty workloads. At the same time, you can prevent too many bursts from occurring from a single client by controlling their refill rate.

Implications of a Rate Limited API

Server Perspective

From the server perspective, Rate Limiting means that you need to either use existing rate limiting features in web servers, or build your own to control traffic. For example, here’s an example from NGINX showing a rate limited API by client IP address at a rate of 1 request / second:

http {
    limit_req_zone $binary_remote_addr zone=one:10m rate=1r/s;

Using IP address is a bit ill advised here since its very easy for customers to change their IP addresses at well. Ideally, you would want to use some client identity name or access token to identify a client in order to control their rates. This is how it works with the Twitter API where with each request you attach your developer access token. This lets Twitter identify you consistently and apply rate limits to just you and nobody else.

Alternatively, if you’re not happy with any of the libraries providing this functionality, you could always roll your own. Be wary though, rate limiting (especially in a distributed environment) can get a bit tricky to implement well. Only go here if the libraries available really don’t solve your problem.

Client Perspective

If you’re a Client, or a user of a rate limited API, there are some important things to be aware of. Most importantly, you as the client need to be aware of your rate limits. This helps you design your system in such a way that you won’t exceed the rates provisioned by your resource server.

Secondly, its important to implement a robust retry policy with exponential backoff when faced with a ThrottlingException from a rate limited API. In the real world, I usually see a retry policy that consists of 3 attempts, with an exponentially increasing sleep duration between each attempts. This makes it such that your request takes a progressively longer sleep between each attempt, giving the resource server the opportunity to ‘catch up’ and assign you more tokens.


Its important to know what Rate Limiting / Throttling is from both the client and the server perspective. If you’re a developer using an open source API, I guarantee you that you will at some point be facing the dreaded ThrottlingException or RateLimitedExceedException from these APIs. Its important to know how to handle them in any case.

If you’re a resource owner / service builder, Rate Limiting / Throttling is an important concept that helps regulate the resources of your service per client so that you can ensure a consist experience for ALL users. Its important to know how it works, and some of the algorithms that are available to you. The Token bucket is arguably the most popular, and its my go-to choice when choosing a rate limiting algorithm.

  1. This is a remarkable article by the way. I am going to go ahead and bookmark this post for my brother to read later on tonight. Keep up the good quality work.

  2. Would it be fair to say:

    Throttling is a server side response where feedback is provided to the caller indicating that there are too many requests coming in from that client or that the server is overloaded and needs clients to slow down their rate of requests.

    Rate Limiting is a client side response to the maximum capacity of a channel.

Leave a Reply

Your email address will not be published. Required fields are marked *

Related Posts