How I Would Design… Uber’s Backend!

A System Design Demonstration

9 min readDec 11, 2021

Audience

This article is the next in my series of how I would design popular applications. It is recommended (although not entirely necessary) to read the previous posts I’ve helpfully compiled in a list here. We will expect a basic familiarity with architecture principles and AWS, but hopefully this post is approachable for most engineers.

A bit of a disclaimer on this one. The Uber architecture turned out to be way more complicated than I thought, so it’s part system design, part case study. Hope you don’t mind!

Argument

Initially, let’s look at our problem statement.

The System to Design

With luck you’ve used Uber before, but if not, let’s recap.

Uber is an application for booking taxi rides. A rider opens the application on their phone and enters a destination. They are then shown a number of surrounding taxis that can pick them up.

Simultaneously, the nearest drivers are sent a notification and the option to pick up the rider. On selecting a rider the driver is shown the route to their pickup. Both parties are also shown each other’s locations in real time.

Once a rider is safely at their destination the driver can mark their journey as complete, opening them up for a new pickup!

Our usual non-functional requirements stand. It must be reliable, scalable and available.

The Approach

We have a standard approach to system design which is explained more thoroughly in the article here. However the steps are summarised below:

Requirements clarification: Making sure we have all the information before starting. This may include how many requests or users we are expecting.
Back of the envelope estimation: Doing some quick calculations to gauge the necessary system performance. For example, how much storage or bandwidth do we need?
System interface design: What will our system look like from the outside, how will people interact with it? Generally this is the API contract.
Data model design: What our data will look like when we store it. At this point we could be thinking about relational vs non-relational models.
Logical design: Fitting it together in a rough system! At this point I’m thinking at a level of ‘how would I explain my idea to someone who knows nothing about tech?’
Physical design: Now we start worrying about servers, programming languages and the implementation details. We can superimpose these on top of the logical design.
Identify and resolve bottlenecks: At this stage we will have a working system! We now refine the design.

With that said, let’s get stuck in!

Requirements Clarification

The first things I would be wondering about would be how many riders and drivers would be online per day. I would also be thinking about roughly how many rides we’d be expecting.

Let’s say we have ten million customers and five million drivers online at any one time. We average around ten million rides a day.

I would also be curious about how often we need to be sending data back and forth between the various parties and our infrastructure. Let’s say we send location information back every five seconds.

Back of the Envelope Estimation

Initially, ten million rides a day is equivalent to 115 rides a second. We haven’t covered any of our data structures yet, but we can see that’s a lot of information. Additionally if we have 15 million parties sending us requests every five seconds, that’s roughly 3 million data points a second. Wow!

System Interface Design

There are a number of core interaction points with our system. What’s interesting is that we need to send messages both from the client and the server. Two way communication was covered more thoroughly in my article on Facebook Messenger here. However it is enough to know we will be relying on WebSockets for our communication.

The only exception will be sending location data from the clients to the central service. This happens on such scale that we will need another approach, and instead will be using HTTP calls.

The initial part of the system interface is designed below in the form of functions we may use to make the WebSockets calls.

We can also define a /users/{id}/locations endpoint which we POST to every few seconds. This will return a 201, or if there is an error the 4XX, 5XX codes as usual. The location object forming the contents of this call will be covered shortly.

Data Model Design

To understand the data model it might help to read the previous article on Yelp here. In the Yelp article we rely on breaking down the globe into a grid. A lot of people seem to think Uber uses Google S2 to do this, but it looks like in fact they now have their own solution, H3.

We also need to decide how we will store our data. As we will have such a huge amount of it, NoSQL would seem to be the most sensible. Interestingly Uber themselves designed a new database ‘Schemaless’, which works like Google BigTable and is built on top of MySQL. Very recently they’ve evolved this into DocStore, a general purpose transactional database.

Let’s say we’re going to use this to store our data. We now need to define a number of schemas to use in it.

Rider/ Driver/ User

Riders and drivers are very similar in many respects, they both represent people, have locations etc. It may be a good suggestion to model them as separate entities, but both inheriting from a single user structure. This gives us a little flexibility on things like their state (riders will be waiting, drivers will be picking up etc.), but allows us to share a lot of the common functionality.

{
  "id": "<Id of the user>",
  "location": "<Location object representing user's whereabouts>",
  "state": "<Whether the person is waiting, riding, etc.>"
}

Location

{
  "id": "<Id of the location>",
  "latitude": "<Latitude number>",
  "longitude": "<Longitude number>"
}

Assuming we’re using H3, chances are that the coordinates won’t be done in longitude and latitude. However, for simplicity let’s assume they are.

Route

{
  "id": "<Id of the route>",
  "routePoints" : [
    ... list of points on the route
  ]
}

The route points will work alongside Google Maps. In reality the API will be much more rich than made out here, but for the sake of argument we can just say that the points are where the car turns, and we just join them up to make the route.

Trip

{
  "origin": "<Where the trip started>",
  "destination": "<Where the trip's going>",
  "rider": "<Rider user object>",
  "driver": "<Driver user object>",
  "state": "<Current state of the trip, waiting to pick up etc.>",
  "route": "<Route object for the trip>"
}

Logical Design

Now we have an idea of how we might interact with the system, and how we might store our data, let’s think about how we might pin it together.

Initially, we need a way of having messages come in from and go out to riders and drivers. We do this through a gateway service. This separates out traffic between location data and request data. The former goes to an HTTP service, the latter goes to a WebSockets service.

The HTTP service pushes location information directly into location storage. The WebSockets service communicates with and orchestrates the dispatch group of services, which is responsible for setting up rides. It is this dispatch functionality that contains the meat of the Uber design.

When a rider requests a trip they send a message to the dispatch group of services. Within the group there is a rider service. It is this service’s responsibility to mark the grid squares within a given radius of the rider with the rider Id, and to change the state of the rider. This is done through a dispatch data service which interacts with the dispatch data store.

It also sends a message to a driver service alerting them of a new ride at a given grid square. The driver service contacts the data service to find the Ids of all the available drivers in squares within a given radius, sending out a push notification alerting them of the new rider.

Within this message is the route to the new rider. This is calculated using a further maps service which connects to external map APIs to calculate paths given the origin and destination

The driver service is also responsible for polling the location data store looking for free drivers who have moved between grid squares to where a rider is waiting. On doing so they send a push notification.

When a driver accepts a ride the rider service checks to see if the rider Id is still on the grid (hasn’t already been accepted), and removes the rider Id if needed. The rider and driver services then respectively alter the states of both rider and driver to occupied, and the map service searches for and disseminates the route.

The driver and rider services additionally poll the data service for drivers and riders on the same trip and update the opposite parties with their location data.

Finally, on ending the ride both rider and driver statuses change back to available and the whole thing starts again!

Physical Design

I will admit, things get a bit hairy here. Uber uses and develops a lot of their own technologies, so we’re going to stray from the normal AWS implementations. This section acts as more of a case study than design document.

Previously, Uber used RingPop, an open source NodeJs project for their dispatch services. RingPop allowed you to create clusters of nodes in order to carry out functionality (like in our dispatch services). These nodes were capable of discovering and coordinating with each other to assign work in appropriate places. For example, if node A was very busy, work would be assigned to node B. Additionally if node B went down, node C could come up and node A would discover it without central configuration.

The previous fulfilment architecture can be found in one of their engineering blog articles here.

The design was based around two entities: trip and supply, and two services, demand and supply. A trip was a ride in our context (travelling from A to B), and supply represents our drivers in all their contexts (picking up, travelling, available etc.).

Originally these entities were stored in Cassandra and Redis, with an abstraction sitting on top known as the Marketplace Storage Gateway (MSG).

However, they discovered issues with consistency (as they prioritised low latency and availability) and scalability (as RingPop used peer-to-peer coordination the total number of nodes that could be added reached a threshold). It also used Node.js (which they wanted to move away from), and from the look of things was just a nightmare to maintain.

Their new architecture uses Apache Helix and Zookeeper to have centralised cluster management (instead of peer-to-peer), a serial queue to ensure only one transaction occurs at a time and their new DocStore storage engine pinned on top of Google Spanner.

It also uses the Google Maps Platform for route calculation. They don’t use Flutter, but it’s very trendy these days, so let’s do the front end in that. I’ve also put the API Gateway as an AWS one, but I highly doubt that’s what they use. Finally, I’ve added a Kafka stream to deal with the vast amounts of location data.

Let’s translate this into a physical diagram.

Identify and resolve bottlenecks

With all that said and done, we can look at how we might optimise the system. As most data needs to be live, there’s not a lot of caching we can do. Most of the optimisation is actually done through the technology choices we’ve made.

We could potentially do some work splitting up dispatch storage and storing our location and trip information in separate places. Additionally we could separate out the web sockets service to have one service responsible for communicating, and one service responsible for orchestrating the various dispatch services.

We could also put a number of extras onto our service, like analytics, fraud detection, payment, or extend the exercise to cover how we could translate this architecture onto Uber Eats.

Conclusion

In conclusion, we’ve examined the Uber backend, done a bit of a case study, and come up with our own proposed design.