How I Would Design… Twitter!
A System Design Demonstration
--
Audience
This article is the next in my series of how I would design popular applications. It is recommended (although not entirely necessary) to read the previous posts I’ve helpfully compiled in a list here. We will expect a basic familiarity with architecture principles and AWS, but hopefully this post is approachable for most engineers.
Argument
Initially, let’s look at our problem statement.
The System to Design
We are recreating the popular social media site Twitter. The premise (for those of you who aren’t familiar), is that users can send short text tweets of a maximum length (let’s say 140 characters). They can also add hashtags, which are a way of grouping together tweets of a similar theme.
People follow each other to see the content they post. The list of content a user sees is their ‘home timeline’, and the list of content they have posted is their ‘user timeline’. For the purpose of the exercise we have the following requirements.
- Allow users to post tweets
- Let users retrieve their home and user timelines
- Provide search for tweets by hashtag
- Surface the trending hashtags for a user.
We will assume that the ‘following’ function is taken care of for us. If it is of interest you can find a proposed solution in my Instagram design article here.
The Approach
We have a standard approach to system design which is explained more thoroughly in the article here. However the steps are summarised below:
- Requirements clarification: Making sure we have all the information before starting. This may include how many requests or users we are expecting.
- Back of the envelope estimation: Doing some quick calculations to gauge the necessary system performance. For example, how much storage or bandwidth do we need?
- System interface design: What will our system look like from the outside, how will people interact with it? Generally this is the API contract.
- Data model design: What our data will look like when we store it. At this point we could be thinking about relational vs non-relational models.
- Logical design: Fitting it together in a rough system! At this point I’m thinking at a level of ‘how would I explain my idea to someone who knows nothing about tech?’
- Physical design: Now we start worrying about servers, programming languages and the implementation details. We can superimpose these on top of the logical design.
- Identify and resolve bottlenecks: At this stage we will have a working system! We now refine the design.
With that said, let’s get stuck in!
Requirements Clarification
The first things I would be wondering is how many users we’re expecting and how often they read/ write/ search tweets. Let’s say we have 500 million users. Each user reads 100 tweets a day, and posts 1. They typically search and view trending hashtags once a day.
Back of the Envelope Estimation
If we have 500 million users reading 100 tweets of 140 characters this is equivalent to 500,000,000 * 100 * 140 bytes = 7 terabytes
of reads a day, which is equivalent to 81MB/s
or nearly 600,000
requests a second!
For our write/ search/ view capacity we have just 70GB
of traffic a day and 810KB/s
/ 6000RP/s
. Already we can see we have a read heavy application…
System Interface Design
Having clarified the requirements, now it’s time to design our interfaces. We need endpoints for the following
To post a tweet we could make a POST
request to a user/{id}/tweets
endpoint to create a new tweet object. The regular response codes 201
, 4XX
, 5XX
would all apply. We will assume the request comes with a cookie we can validate and use to check if they are signed in as the user corresponding to the path parameter.
Retrieving a user’s tweets could then be done via a GET
request to /users/{id}/tweets
, with response codes 200
, 4XX
, 5XX
. Retrieving a user’s home timeline could be done via /tweets/homes/users/{id}
.
Search would use an endpoint /tweets?hashtag=<enteredValue>
, and we could return hashtags using /hashtags/trending
. The last endpoint is up for debate, (if not all of them, URL structure is an ongoing war to be waged). We could do something more flexible with pagination and sorting, but it seems inappropriate to expose to the user.
Data Model Design
Now we know how to access our data, what will it look like! We are mainly focusing around tweets and hashtags. Our tweets will looks like
{
"user_name" : "<Name of the user>",
"text": "<Text of the tweet>"
"hash_tags" : "<Array of hashtags>"
}
Our hashtag object can simply be
{
"text": "<Text of the hashtag>"
}
Using a hashtag object instead of just a string allows for some flexibility if we want to add other information (such as hit count) at a later date.
We also want to think about what kind of data storage we would like to use. From our back of the envelope calculations we saw that we had read heavy requirements. This suggests we will use non-relational, key-value databases to optimise our reads. The columns could be similar to:
Users
id VARCHAR PRIMARY KEY
name VARCHAR
Tweets
id VARCHAR PRIMARY KEY
text VARCHAR
user_id VARCHAR FOREIGN KEY (user_id) REFERENCES user id
created_at TIMESTAMP
Hashtags Association
id VARCHAR PRIMARY KEY
tweet_id VARCHAR FOREIGN KEY (tweet_id) REFERENCES tweet id
hashtag_id VARCHAR FOREIGN KEY (hashtag_id) REFERENCES hashtag id
Hashtags
id VARCHAR PRIMARY KEY
text VARCHAR
Note, we have missed the association table for relating users to each others for following, however this is out of the scope of the exercise.
Logical Design
We now move onto the logical design. There are a few requirements that are reasonably separate, so we can focus on them individually and then fit them together. The sections we will focus on are reading and writing, searching and trending hashtags.
Initially, we will be using a CQRS style pattern for reading and writing.
The idea behind this is to separate out our two concerns. We sit a thin gateway layer in front of all requests directing them to the relevant service. When someone writes a tweet it is written to the write database, which in turn synchronises separate, fully formed views for the user and home feeds in the read database. Notice this means we take up more space, but we reduce load times.
To read a user’s tweets, or their home feed, we can then query the read database via the read service, which will be able to return both lists fully formed and ready to go!
Searching is different again. We can use an inverted index search which will store hashtags against the Ids of the tweets containing them. We can then retrieve the tweets from the data store. Note, this data store could be our write database, or to reduce load it could be another data store we write to specifically for searching.
Finally we need a method of finding the top hashtags. This can be done via a set of streaming services that can aggregate hashtag counts as tweets come in.
To aggregate hashtag information we take in tweets as a stream, then pass them to a parser which separates out the hashtag and location data. Although this isn’t a requirement it makes sense. Imagine there’s a trending football game in Portugal, chances are users in Mumbai may not be interested!
The location and hashtag counters operate over a window and aggregate the data as it comes in, returning the counts of tags and locations over that window (a window may be five minutes for example).
These counts are then streamed to another set of services responsible for persisting the data to a data store. Interactions with the data store will be significantly reduced as we will only be receiving batches of data per window.
Let’s put all this together into one final logical design!
Physical Design
Now we can convert this to a physical diagram.
The above represents how we might physically implement our system. We can us an AWS API Gateway as our gateway service, which then forwards write requests onto a Kinesis stream, and read/ search requests to Lambdas. We would need to be careful we didn’t exhaust our Lambda pool (especially with 6000RPS), but in theory this is fine.
The write Kinesis stream would then be read by three different applications.
- A Spark Streaming service on AWS EMR, this would be responsible for parsing tweets into location and hashtag information.
- A Kinesis Firehose which would push data into an Amazon Redshift data warehouse.
- A Kinesis Firehose which would push data into AWS OpenSearch, controlling our inverted index search.
The search component of our problem is all handled by Spark Streaming services on EMR. There may need to be additional streams in between services, but these have been omitted for brevity.
Our read databases are implemented using AWS DynamoDb, which is Amazon’s highly resilient and performant non-relational database.
The final part of our system to discuss writing to Redshift and to DynamoDb. From the diagram you can see we are streaming all tweets directly into the read database, conflating the separation of read and write data stores in CQRS. For our case this is fine as we continue to separate out the read and write services.
Identify and Resolve Bottlenecks
The key bottleneck we will focus on is when a very popular user issues a tweet. If a celebrity with 100 million followers sends out a tweet then we will need to update millions of views in the read database, which will be time consuming!
To optimise this we could introduce a new celebrity relational database with their tweets in. The read service will aggregate the popular tweets and the user’s views to create a more performant solution.
Conclusion
In conclusion we have covered how we would design and implement a large-scale, text-based social media platform. I hope you have enjoyed it!
N.B. I am in no way an expert at large scale data processing! This is my best guess and should be treated as so. Let me know if anything seems incorrect!