How I Would Design… PasteBin!

A System Design Demonstration

Audience

This is another article aimed at engineers interested in looking at how another developer may approach system design. It is not at all a definitive method, more a way of generating an approximate first draft architecture.

I would thoroughly recommend reading the initial article in this series here. It explores designing a URL Shortener, and we will recycle some of the ideas in this article.

We assume you have a reasonable grasp of architecture, and in later stages AWS. However, even without those you should be able to get the gist.

Argument

Initially, let’s look at our problem statement.

We are recreating an existing service, PasteBin. It will help to have a play around with it.

The requirements are as below:

  1. Users can either enter text or upload a text file to receive a unique, short, unguessable URL allowing them to access and modify the text.
  2. It should count how many times the URL is visited.

We have a standard approach to system design which is explained more thoroughly in the article here. However the steps are summarised below:

  1. Requirements clarification: Making sure we have all the information before starting. This may include how many requests or users we are expecting.
  2. Back of the envelope estimation: Doing some quick calculations to gauge the necessary system performance. For example, how much storage or bandwidth do we need?
  3. System interface design: What will our system look like from the outside, how will people interact with it? Generally this is the API contract.
  4. Data model design: What our data will look like when we store it. At this point we could be thinking about relational vs non-relational models.
  5. Logical design: Fitting it together in a rough system! At this point I’m thinking at a level of ‘how would I explain my idea to someone who knows nothing about tech?’
  6. Physical design: Now we start worrying about servers, programming languages and the implementation details. We can superimpose these on top of the logical design.
  7. Identify and resolve bottlenecks: At this stage we will have a working system! We now refine the design.

The problem statement inspires at least a couple of questions. What is the maximum size of a paste? How many users are we expecting? What are our read/ write ratios?

Let’s say the maximum paste size is 1MB and we are expecting 10 million ‘pastes’ a month, with a read/ write ration of 10:1.

Two things we may want to calculate are our storage and bandwidth requirements.

10 million pastes a month is roughly equivalent to 4 pastes a second. This equates to 40 reads a second, making our overall bandwidth requirements (40 + 4) x 1MB = 44 MB per second.

To find out our storage requirements we will need to know the maximum amount of time a paste will be stored for. Let’s assume a year. This makes our average storage requirements for pastes(12 x 1,000,000) x 1MB = 12TB.

We could also worry about the space required by the other components of our project (keys, statistics etc.), however we will omit this for the time being, as we have covered it in the URL shortener example here.

Our API requires a number of properties as dictated by the requirements. We need to specify some text and an optional expiration time. In Java Spring Boot we may define something similar to the below.

@PostMapping("/paste")
public ResponseEntity<Paste> createPaste(
@RequestBody Paste paste
) {
...
}

Where the Paste object resembles:

{
"text": "<The text we would like to save as a paste",
"key": "<Key to retrieve the paste, request should not specify>"
}

Our response codes could then be:

  • 201: The paste was successfully created.
  • 400: Client error, there is an issue with the request being sent.

The request would not specify a key, however it will be returned in the response body.

Retrieving a paste can be done via:

@GetMapping("/{key}")
public ResponseEntity<Paste> getPaste(
@PathVariable("key") String key
) {
...
}

Which will return the same object, allowing us to create a URL using the key. The response codes could be:

  • 200: Success, returning the paste.
  • 404: The key cannot be found.

Finally, as we have a requirement to be able to amend the pastes we will introduce a PATCH request. Note, we will ignore the key if it is specified in the request body.

@PatchMapping("/{key}")
public ResponseEntity<Paste> editPaste(
@PathVariable("key") String key,
@RequestBody Paste paste
) {
...
}

This will then return the following status codes:

  • 200: Successfully updated.
  • 400: Client error, there is an issue with the request being sent.
  • 404: There is no entry with that key to be updated.

Now we must decide the format in which we would like to store our data. As our keys will be unique it makes sense to have the below format.

  • key VARCHAR PRIMARY KEY
  • paste TINYTEXT
  • hits INT
  • expiry_date_time TIMESTAMP

The key column is where we store the shortened part of the URL, the paste is the value we are storing, hits is a counter of the number of times we have used a paste, and the expiry date time allows us to clear up the item when it is no longer needed.

We could use a non-relational store, however our schema is well defined and wouldn’t be leveraging any of the merits of the non-relational data storage paradigm, so for this example we will stick to relational.

Additionally, if the text size was bigger we could instead store a reference to an S3 bucket, where we store a text file. This would prevent the limitations of a conventional database.

As the designs are so similar a lot of the logical design is covered in the previous article. Let’s recap.

Overview of the PasteBin design

The key service is responsible for generating and maintaining a list of keys that can be used for pastes. It is also responsible for tracking which keys are in use when a client tries to specify their own key.

The clean up service is used to constantly check for expired pastes, remove them from the database and inform the key service that key is back in circulation.

The paste service is the core of the system. It is where all API requests go to and is responsible for creating, reading and updating pastes, whilst also incrementing the hit counter for individual entries.

Similarly we can translate this into a physical design. We use Amazon ECS for the services we would like to remain up consistently, and then a Lambda for the clean up service. As we covered previously, our core data is mainly relational, and so this is a good opportunity to use Aurora. Our key service could be DynamoDb, as it is essentially a key-value store.

A test physical diagram, demonstrating how we can implement our logical design using AWS

The final stage is to look for areas where we can make improvements in our design. There are two that we can recycle from our URL shortener example:

  1. A cache for reading pastes.
  2. A batch method for retrieving keys.
Our final design

There’s always more we could do! Security, more advanced analytics, allowing pastes of images, video, audio, the list goes on. It’s often useful to think about these things and how you might adapt your system. This kind of mindset can help ensure your designs remain flexible and extensible.

Conclusion

In conclusion we have covered the definition of a PasteBin problem, a method for addressing it, and explored a potential solution.

Principal Software Engineer at the BBC