How I Would Design… An API Rate Limiter!

A system design demonstration

James Collerton
9 min readNov 19, 2021
Slow down, you’re moving too fast


This article is the next in my series of how I would design popular applications. It is recommended (although not entirely necessary) to read the previous posts I’ve helpfully compiled in a list here. We will expect a basic familiarity with architecture principles and AWS, but hopefully this post is approachable for most engineers.


Initially, let’s look at our problem statement.

The System to Design

Rate limiting is actually something I’m working on at the moment for The BBC! Let’s say we have a public facing API, or perhaps an internal one that has the potential to be hit by a lot of traffic. We want to be able to limit the amount of requests a user or system can make. This will prevent us from being overwhelmed if we become too popular, or if a nefarious actor starts trying to DoS us!

In our system we need to limit the number of requests an entity can send to an API within a certain time. As we will see, this will be part of a distributed system, therefore we will need to make this limit consistent across multiple machines.

In non-functional terms we need the regulars (available, resilient and highly scalable), but we also need to avoid slowing down any requests using our service.

The Approach

We have a standard approach to system design which is explained more thoroughly in the article here. However the steps are summarised below:

  1. Requirements clarification: Making sure we have all the information before starting. This may include how many requests or users we are expecting.
  2. Back of the envelope estimation: Doing some quick calculations to gauge the necessary system performance. For example, how much storage or bandwidth do we need?
  3. System interface design: What will our system look like from the outside, how will people interact with it? Generally this is the API contract.
  4. Data model design: What our data will look like when we store it. At this point we could be thinking about relational vs non-relational models.
  5. Logical design: Fitting it together in a rough system! At this point I’m thinking at a level of ‘how would I explain my idea to someone who knows nothing about tech?’
  6. Physical design: Now we start worrying about servers, programming languages and the implementation details. We can superimpose these on top of the logical design.
  7. Identify and resolve bottlenecks: At this stage we will have a working system! We now refine the design.

Requirements clarification

At this point I would have a few questions about our system.

As our system is all about traffic, I would want to know the approximate traffic patterns we expect. For example, will there be lots of spikes, in which case we need to cater for this. Equally, what is the expected average traffic, and what can our downstream services handle?

On top of this, what do we need to limit by? Is it an IP address, or a specific user’s credentials? If it is the former we can take this from the HTTP request. However, if the request is made by a single system (not using the X-Forwarded-For header), then we will get the same IP address each time!

If we are looking at a user we need to extract their information another way. It could be that an upstream service extracts their identifier from a cookie or token, and forwards it as a header.

Ideally I’d be looking to design around the maximum capacity of our downstream system. This means if our service gets popular overnight we haven’t unnecessarily capped any users.

For the sake of argument let’s say we’re expecting 100 requests per second, with an average size of 1KB, and about a million users. There are no spikes, and this (conveniently) is also roughly the capacity of what our downstream system can take. We will also have access to a user’s identifiers.

Back of the envelope estimation

Now we have our rough requirements, let’s turn these into some figures we can deal with. If we have 100RPS with a size of 1KB we’re expecting 100KB/s bandwidth. Depending on what rate limiting approach we take we will have different storage requirements. Let’s dive a bit deeper and see what we find.

System interface design

If we plan on treating our rate limiter like a reverse proxy for a single API endpoint, we can reduce complexity by making the two system interfaces as similar as possible.

The only thing we may need to implement on top of the regular API responses is something to say that the request has been rate limited. Luckily, we have the 429 ‘Too Many Requests’ response code. We can send this back to our client with an optional Retry-After header telling the client how long to back off.

Data model design

As we will see in our logical design, our data model depends on how we intend to approach the problem. However, one thing we can say straight off the bat is that we will be best using a fast, non-relational storage like Redis. We don’t want to introduce any latency, and our querying requirements are minimal, so this makes most sense.

Logical design

This is the meat of the problem. There are a number of different algorithms we could use to implement our rate limiting. In this section we will discuss them and decide on where we would like to host them.

Leaky Bucket/ Token Bucket

Aside from acting as great content for nursery rhymes, holes in buckets also act as inspiration for our first rate limiting algorithm.

The core of the leaky bucket algorithm is based on a first-in-first-out (FIFO) queue. We allow requests into our queue at any rate, and we release messages from our queue at a constant one.When our queue is full we begin to discard incoming requests.

The token bucket algorithm is slightly different. We feed tokens into a bucket at a given rate (say one a minute), up until a maximum number of tokens are accrued. Each time a user makes a request they look in their bucket to see if a token is free. If there is we allow them to make a request, if there isn’t then we don’t.


  • Cool names.
  • Requests come through at a constant rate.
  • Comparably easy to implement.


  • No guarantee of the amount of time taken to service a request.
  • Old requests take priority.
  • How do we do leaky bucket on a per-user basis? We could have a queue per user, but then we would need to monitor how many queues we have as well as how full the queues were. If we have one queue a single bad user can fill up our whole queue!

Fixed Window Counter

This is a little different as it limits requests per time window. For example, we may split the day into ten minute segments: 9:00am, 9:10am, 9:20am and limit the number of requests allowed per segment (say a maximum of ten). From there we implement a counter. When a user makes a request they increment their counter in the window. If they are above their maximum requests we block their call.


  • Again, pretty easy to implement.
  • We can separate out windows per user.


  • If we get lots of requests in the first second of the window, then we block the person for the rest of it.
  • If we send back a retry header for the next time window everyone will pile in at that timestamp and block our system.
  • If a user sends a lot of requests at 9:09:59am they can send a lot more the next second when we roll to the next window (boundary conditions).

Sliding Window Logs

When a user sends a request the timestamp is noted. We then periodically (or when a new request is made) go through and remove all requests older than a certain timestamp. We calculate the current request rate from the existing requests, and if it is acceptable we add the new requests to the queue.


  • More accurate than previously, we prioritise newer requests.
  • No boundary conditions.


  • More complex to implement.
  • More space and time consuming. We now need to store an extra timestamp with each requests. Additionally we need to do the calculations to remove the old requests and calculate request rates.

Sliding Window Counter

This is a sort of hybrid approach between fixed window and sliding logs. Strangely, the internet seems to have slightly divided opinions over how this is implemented (which is very unusual as it’s not like people on the internet to disagree!).

Imagine we have the following request limits for the given periods:

9:00 - 9:10: Max 10 requests: 5 requests received
9:10 - 9:20: Max 20 requests: 10 requests received
9:20 - 9:30: Max 30 requests: 15 requests received

If we have five requests come in by 09:15 then we weight the number of requests in the previous window by 50%. We have 5/2 = 2 and our total requests for the ten minute period is 5 + 2 = 7. We then weight the maximum possible requests equally using 10 / 2 + 20 / 2 = 15 and get an effective window of 15, with 7 requests coming in.


  • Gets rid of the boundary conditions.
  • Less memory intensive than sliding logs


  • We make an assumption that requests in the previous window were uniform.

Now we have discussed the various options for our rate limiting algorithm, we need to think about some of the implementation problems.

Initially, where do we store this information? In a modern, distributed system we will likely have multiple nodes behind a load balancer. If we store the information on a single node then other nodes won’t be able to see how many requests a user has made.

An example of a basic setup with information stored on the nodes

In the above example requests can go to either node, each with their mutually exclusive copy of the rate information. If 10 requests go to node A, node B has no idea!

One solution to this is to use sticky sessions. In this case a person’s session is always routed to the same node. In some ways this is a good approach as a person’s requests will always go to the same node with their rate limiting information. However, it brings up issues around problematic user’s requests all going to the same place.

A better solution might be to implement a centralised data store. This brings with it its own issues.

Initially, there will be latency on making requests to a different physical location. Another issue (especially in high traffic systems), is that if we retrieve some information, alter it, then store it again, we may have another request doing the same thing at the same time! This means the initial request will have retrieved an out of date value. Let’s call this a get-then-set approach.

One option is to use locks on a row. However, this will slow down other requests while we free up the lock.

Another method is to rely on a set-then-get approach. In this case we do our writing, then retrieve our value, which is much more atomic. In the sliding or fixed window counters we use an inbuilt function to increment the counter, sliding logs is set-then-get by default.

A really good explanation of all of this can be found in the article here.

Now we have our main concepts ironed out, let’s make a logical design.

Basic logical design

In the basic logical design we have a load balancer in front of our rate limiter nodes to distribute traffic, and a central information store. Valid requests are forwarded onto the load balancer of our downstream service for distribution!

Physical design

The physical design is a lot simpler. We can let AWS handle load balancing and hosting our nodes for us.

A simple physical diagram

We sit the rate limiting service on an ECS cluster and have our information available in a Redis Elasticache. I’ve assumed our downstream service is also on ECS, but it could be on anything.

Identify and resolve bottlenecks

There aren’t many bottlenecks I can see! We’ve already discussed a lot of them covering the pros and cons of different algorithms, data consistency and our data storage solution. We can probably leave it here.


In conclusion we have covered the main rate limiting algorithms and how we might implement them in the real world.



James Collerton

Senior Software Engineer at Spotify, Ex-Principal Engineer at the BBC