How I would design… Dropbox!

A system design demonstration

James Collerton
10 min readSep 24, 2021

Audience

This article is the next in my series of how I would design popular applications. It is recommended (although not entirely necessary) to read the previous posts here, here and here. We will expect a basic familiarity with architecture principles and AWS, but hopefully this post is approachable for most engineers.

Argument

Initially, let’s look at our problem statement.

The System to Design

We are going to recreate the popular cloud storage solution DropBox. The idea is that we are able to upload and download files to a location on the internet, giving us access when needed. If you’ve never used a similar system, you can sign up and give it a go here.

The requirements of our system are:

  • A user should be able to upload, download, update and delete files automatically using a desktop client synced to a folder.
  • We should maintain different versions of our file.
  • We should be able to sync files, so if a file is uploaded on one client, this should be reflected in another.

As always, the system should be reliable, scalable, and highly available.

The Approach

We have a standard approach to system design which is explained more thoroughly in the article here. However the steps are summarised below:

  1. Requirements clarification: Making sure we have all the information before starting. This may include how many requests or users we are expecting.
  2. Back of the envelope estimation: Doing some quick calculations to gauge the necessary system performance. For example, how much storage or bandwidth do we need?
  3. System interface design: What will our system look like from the outside, how will people interact with it? Generally this is the API contract.
  4. Data model design: What our data will look like when we store it. At this point we could be thinking about relational vs non-relational models.
  5. Logical design: Fitting it together in a rough system! At this point I’m thinking at a level of ‘how would I explain my idea to someone who knows nothing about tech?’
  6. Physical design: Now we start worrying about servers, programming languages and the implementation details. We can superimpose these on top of the logical design.
  7. Identify and resolve bottlenecks: At this stage we will have a working system! We now refine the design.

With our approach in place, let’s get started!

Requirements Clarification

Initially we should be wondering how many people will be using our system, and roughly how many files will be uploaded/ downloaded a day. It would also be good to have an estimate of an average file size.

Let’s say we have 100 million users a day, each of which uploads and downloads ten files. The average size of a file is 100KB.

Back of the Envelope Estimation

This means we have 100,000,000 x 10 x 100KB = 100TB of files stored a day, which over ten years equates to 365PB of files. It also means 1.16GB of bandwidth will be used a second (looking at only download/ upload in isolation).

We can use this to inform some of our later physical designs.

System Interface Design

Now we need to decide how we would like to interact with our system. We will assume that we have a client application sitting on whichever machine would like to use our DropBox solution.

At the core of it there are only two different functionalities we need, uploading and downloading. If we were to charge in with the first solution that came to mind we may just upload and download whole files. However, with a little thought we can see this will be inefficient.

Ideally, if only part of a file is changed we only want to upload or download that part of the file. We’re going to call these parts ‘chunks’. Note, these chunks are just a subsection of the bytes of the final file!

Let’s expand on this idea. Let’s say we have File A, on our computer, belonging to User A , which is split into four chunks. How do we keep track of this, and know the chunks have changed?

Whenever a file is saved locally we might create a file object.

{
"id": "<The Id of the file>,
"name": "<The name of the file, complete with extension>",
"userId": "<The Id of the user the file belongs to>",
"version": "<The version of the file>"
}

We will assume we don’t need to worry about user registration, and that we already have a list of users in the backend, modelled as below.

{
"id" : "<The Id of the user>",
"name" : "<Name of the user>"
}

The chunks of the file can then be associated with our new file object.

{
"id": "<The Id of the chunk>",
"fileId": "<The Id of the file the chunk belongs to>",
"hash": "<The hash of the chunk, so we can detect changes>",
"order": "<What order in the file this chunk comes in>",
"versions": "<Which versions of the file this chunk represents>",
"bytes": "<The bytes of the file chunk>"
}

If we revisit our requirements, this should allow us to carry out the necessary work.

When a file is created we send requests in order to create our new file object and add the chunks. We also maintain a list of this information locally for use later.

When a file is updated we split it into chunks and hash each one. We check against the local list to see which have been updated. We send requests to add all of the new chunks remotely, but also to add a new version to the list for all of the existing ones.

When we want to download a file we look for all of the chunks that belong to the file for a given version, download and reassemble them. Deletion comprises of deleting all of the chunks of that file, then the file itself.

This means we need endpoints for uploading, downloading, updating and deleting our files.

Uploading a file can be done with an endpoint similar to the below. Note, we only want to create the file object itself, not the chunks.

@PostMapping(path = "/file")
public ResponseEntity<File> createFile(
@RequestBody File file
) {
...
}

We use a separate endpoint for sending chunks to reduce size and allow us to retry failed requests.

@PostMapping(path = "/file/{id}/chunk")
public ResponseEntity<Chunk> createChunk(
@PathParam("id") id,
@RequestBody Chunk chunk
) {
...
}

Something important to note is that we will need to encode and decode our chunks to base64 whilst sending them over HTTP.

Downloading files is similar, however we would like to give them the potential to specify a version for download. We introduce an optional parameter, where we return the latest version if none is specified.

@GetMapping(path = "/file/{id}")
public ResponseEntity<File> getFile(
@PathParam("id") Long Id,
@RequestParam(required = false) Integer version
) {
...
}

As we are downloading the file all in one go, we could think about the file object having all of the chunks included. An alternative would be to split up requests between file and chunks.

Updating files is slightly different. We will send a PATCH request to our endpoint containing a file object. This can be used to update things like the filename. We then have a separate endpoint for updating a chunk. With both we would need some extra application logic to handle versioning.

@PatchMapping(path = "/file/{fileId}")
public ResponseEntity<File> updateFile(
@PathParam("id") Long Id,
@RequestBody File file) {
...
}
@PatchMapping(path = "/file/{fileId}/chunk/{chunkId}")
public ResponseEntity<File> updateFile(
@PathParam("id") Long Id,
@RequestBody Chunk chunk) {
...
}

Finally we have our deletion request, which can delete the file and all associated chunks.

@DeleteMapping(path = "/file/{id}")
public ResponseEntity<File> deleteFile(
@PathParam("id") Long Id
) {
...
}

We could also implement endpoints for downloading all of a user’s files to a new client, but we will say that is out of scope for the current article.

I haven’t worried about specifying the response codes, but you can assume they follow along the conventional lines.

Data Model Design

As part of our system interface design we created a number of objects we would like to store. Initially, we need to make a choice between a relational and non-relational data solution.

The ACID properties of a traditional RDBMS are very useful, as we need to know that all chunks of a file are successfully saved in order to be able properly restore it. There may also be multiple clients working on the same file. One of the main issues of using an RDBMS is scaling. However, we can address this using sharding or similar techniques.

We propose the following tables. Note, we have separated out the versions from the chunks. Each chunk can belong to multiple versions, and versions can have multiple chunks.

We will be storing our chunks in cloud storage (think Amazon S3), so will use a reference to a bucket instead of tabular blob storage.

File

  • id VARCHAR PRIMARY KEY
  • user_id VARCHAR FOREIGN KEY (user_id) REFERENCES user id
  • file_name VARCHAR
  • version INT

Chunk

  • id VARCHAR PRIMARY KEY
  • file_id VARCHAR FOREIGN KEY (file_id) REFERENCES file id
  • hash VARCHAR
  • order INT
  • url VARCHAR

Chunk to Version Association table

  • id VARCHAR PRIMARY KEY
  • chunk_id VARCHAR FOREIGN KEY (chunk_id) REFERENCES chunk id
  • version_id VARCHAR FOREIGN KEY (version_id) REFERENCES version id

Version

  • id VARCHAR PRIMARY KEY
  • version INT

User

  • id VARCHAR PRIMARY KEY
  • name VARCHAR

Logical Design

A lot of our logical design has been threshed out in deciding our data model and API! Let’s summarise it below.

An initial logical design

As stated in the summary we will have a desktop client as well as some services hosted in the cloud.

The desktop client we can split into four main components.

  1. The Watcher: This is responsible for monitoring the files in the given folder, and alerting the chunker when there are changes. It is also responsible for synchronising the files when there have been remote changes.
  2. The Chunker: This is responsible for splitting files into chunks and reassembling files from incoming updated chunks.
  3. The Indexer: As described previously we need a component to keep track of all of the various pieces of data flying about. The indexer does this by managing the local database.
  4. The Local DB: This contains all of the information for the local user.

Our cloud services can be split into the following.

  1. The File API: This is the backend service we would communicate with from the desktop application in order to work with our data.
  2. The File Queue: The File API is a thin layer we put in front of a queue. The queue allows us some resiliency in case of downstream failures.
  3. The File Processor: This takes the messages from the queue, deserialises them (including decoding from base64) and writes to the central database which holds all user’s file information and the file storage.
  4. File Metadata: This is all user file information compiled from all of their local databases. It can be used to synch new clients when they sign in.
  5. File Storage: This is the object storage for the file chunks.
  6. File Synchroniser: When an object is created in the object storage this means a file has been updated. The object storage should alert the other clients to get the new version.
  7. Alert Queues: It is not guaranteed all clients will be online at this point, so we introduce a queue before the next component. When clients are online the next component can be triggered to synch them.
  8. Chunk Alerter: This is used to alert the various clients that there are updates they need to perform.

Now we have a reasonable logical design, let’s move onto the physical one.

Physical Design

An example physical design

A potential physical design is demonstrated above. Our desktop client is written in Electron, which uses HTML, CSS and JavaScript. This means the watcher, chunker and indexer can be separated out into separate js classes. We can use a small embedded database to provide for our local needs, SQLite is a good solution.

Our cloud solution is based in AWS. The thin gateway layer is an API Gateway persisting directly to an SQS queue. This queue triggers a Lambda who writes to an Aurora DB and S3. Depending on the amount of time it would take to write these changes this could be swapped out for an ECS-based service to prevent Lambdas being alive too long.

We can configure an S3 trigger to set off a Lambda to alert our various clients there has been a change. This pushes to a queue, which in turn triggers another Lambda.

Identify and Resolve Bottlenecks

Although I’m sure there are plenty of optimisations we can make across the design, the core one would be the reading of metadata from our Aurora DB. By introducing a caching layer we can minimise this time for the hottest data.

Our final potential design

Extras

There’s always more we could do! This includes non-functional requirements like security and more advanced analytics, or it could be functional additions such as limiting the amount of data a user can store. It’s often useful to think about these things and how you might adapt your system. This kind of mindset can help ensure your designs remain flexible and extensible.

Conclusion

In conclusion we have talked through how we may go forward designing DropBox, including an implemented physical architecture.

--

--

James Collerton

Senior Software Engineer at Spotify, Ex-Principal Engineer at the BBC