System Design Need to Knows
High Level Points for Designing a System
Audience
The purpose of this article is to clarify some core ideas that might be useful for system design. It is aimed at engineers with a reasonable understanding of architecture, and who would like a list of factors to contemplate when drafting ideas. The areas are inspired by the article here.
We will cover the following:
- Load Balancing
- Proxies
- Redundancy/ replication
- PACELC theorem
- CAP theorem
- Consistent hashing
- Quorum
- Heartbeat
- Checksum
- Leader and Follower
The following areas are covered in the linked articles, just scroll to find the relevant section!
- Caching
- Partitioning/ sharding/ indexing
- Bloom filters
- Long polling, web sockets and server side events
- CDNs
There are two final articles in the works, so look out for them soon!
- NoSQL
- Streaming
Argument
Let’s work our way through!
Load Balancing
Load balancers are a way of distributing traffic between multiple servers. Usually these servers will be identical. They act as a reverse proxy: requests come in to the load balancer, which then distributes them out to the nodes in their cluster.
This improves latency (as we can choose to distribute to the least congested server), availability (as if one server is down we can use another), and maintainability (if we need to patch/ update one server the load balancer can redirect to the others).
Below we have a diagram of how a load balancer may sit in an architecture.
The load balancer sits in front of the servers making up the respective services (A and B). Requests are directed at the load balancer. which then distributes them.
The eagle-eyed amongst you will have noticed that the load balancer can be a point of failure, which means it is often backed up by replicas, ready to take over when required.
Load balancers have the important job of deciding whether or not nodes in the cluster are healthy or not. Usually they will send a request to an endpoint and look for a 200
response. If they don’t receive anything they will assume the service is unhealthy and replace it.
They can also provide more sophisticated functionality like SSL/ TLS, and sticky sessions for when we want to keep redirecting sessions to the same node.
In the case we want to distribute our load between servers, we have the following load balancing algorithms:
- Least connection: Server with fewest active connections.
- Least response time: Server with fewest active connections and lowest average response time.
- Least bandwidth: Currently serving lowest MB/s traffic.
- Round robin: Cycles through a list and sends each request to the next one.
- Weighted round robin: Each server assigned a weighting according to its processing power. We then assign traffic round robin, but heavier weighted servers get priority.
- IP hash: Hash the IP address to assign the server.
Proxies
A proxy server is any server that translates traffic between protocols or networks. They sit between a client and and their destination. Requests go through the proxy server and out to the target machine, usually flowing back through the same proxy.
They are useful for a number of reasons:
- You can control and log access to sites.
- Proxies can cache data/ compress traffic to improve performance. A CDN is an example of a proxy that does some caching.
- You can get better privacy, or access blocked resources, as you can disguise the details of the client machine.
These benefits (especially anonymity) play into the different types of proxy you can have: transparent, which forwards your IP address, anonymous, which removes your IP address, distorting, which fakes an IP address, and high anonymity, which periodically changes the fake IP address (think TOR).
However, there are also a number of risks associated with using proxies. They have access to all of your traffic and so can do things like log your requests. It’s very important to encrypt your traffic when using these services, as they can see some of what you’re sending!
Proxies are generally split into two: forward and reverse.
Forward proxies sit between clients and external networks. Requests go through the forward proxy before they go externally. These are things like VPNs and your work site filters.
Reverse proxies are the opposite and receive requests coming into your network from the outside world. Varnish is a good example of this, and is used for caching. For more information check out my article here.
Redundancy/ Replication
In computing, things fall over. It’s a tragic fact of life. Redundancy is the art of making multiple copies of a service to deal with these issues. For example, we may have multiple copies of the same application code on different servers sat behind a load balancer. We may also have multiple copies of our database in case one fails.
When a part of our architecture dies we want to swap (automatically or manually) from using this particular instance to another one. This process is known as a failover.
Often when we talk about replication we will talk about shared-nothing architectures and transparency. A shared-nothing architecture means that nodes in our system are independent. If one goes down, then we don’t need to worry about it physically affecting the others, we can fail over comfortably.
Transparency is the idea that if we add a new replica of a service, no other services are aware of it. For example, we don’t want to have to tell our other services the address of our new replica!
All of this plays into the next definition, replication. Replication is the process by which we ensure different nodes across our service have the same state. This could mean replicating data, application code, configurations. Replication can be a headache in its own right and, as we see in later sections, sometimes we sacrifice some consistency for the sake of speed.
All in all, this is helpful for three main reasons:
- Reliability: If we have some form of replication then we remove single points of failure, and can handle it when infrastructure goes down.
- Performance: As in our load balancer example, having multiple nodes means we can spread work.
- Scalability: We can add more replicas in order to server more clients!
CAP theorem
The CAP theorem states a distributed system can have at most two of the following properties:
- Consistency
- Availability
- Partition Tolerance
To help our explanation, let’s imagine the below system. This is based off a very useful article found here. We have a client that can send read/write requests to our system, which comprises of two nodes containing two values.
If we imagine a load balancer, similar to the one we discussed earlier, in between components we can see how this relates to a more real-life scenario.
Consistency is the idea that once we update the value on any server, we get the same value when we read it from any server (or newer if it has been updated again).
Availability is the idea that if we send a request to a server, then as long as the server is up, they should get a response. There can be no ignored requests!
Partition Tolerance is the idea that we can lose an arbitrary number of messages between our servers (server 1 and server 2 in our case). Losing messages is a partition.
So how can we prove the CAP theorem? We can use proof by contradiction (an exciting throwback to my days as a university maths student). Assume we have a system that has consistency, availability and partition tolerance.
Let’s say our system has all three properties. We send a write message to our first server, updating the value from A to B. The system has partition tolerance meaning we can lose messages between servers. We lose the message saying the value has been updated from A to B.
Our client then sends read request to server 2. Server 2 must send a response (availability), and it has no knowledge of the update of server 1. It replies with value A. This violates consistency! QED!
PACELC theorem
We can extend this notion into the Partitioning, Availability, Consistency, Else, Latency, Consistency (PACELC) theorem. This theory was levelled in critique to the CAP theorem. Looking at CAP we notice that we only have to choose between the three if there is a partition. If no messages are lost we can have all three!
Using PACELC, if our system has a partition (is losing messages) we must select between availability and consistency, else we must select between latency and consistency.
Let’s take this back to our example. Imagine we have a partition. Then we can either choose between our second service being available, or consistent. It can’t get the value from the first service, so it is either willingly refusing messages, or it is returning the old value.
Now if we have no partition, then we can either wait for the second server to update from the first (latency), or we can respond immediately with the old value (violating consistency). A diagram is included below to demonstrate this visually.
Consistent hashing
To understand the benefit of consistent hashing, it might be helpful to read this article, specifically the section on data partitioning. It also helps to have understood the previous section on replication. This, and the article here, provided a lot of motivation for the next section.
When we partition data, we want to spread it across multiple servers. This helps our system become more scalable and performant. Additionally, we may want to replicate our data to multiple servers to make it more resilient.
Let’s say we have a set of data and we would like to partition it across five servers. We could theoretically hash the value, then use this hashed value to send it to a particular server.
However, what happens if we want to add or remove a server? Suddenly our mappings change and our system breaks.
Consistent hashing is used to distribute data so as to minimise the rearrangement needed when servers are added or removed from a system. It functions by arranging servers in a ring.
In the above we have four servers, and our hash function has values between 1 and 40. We have spread these values evenly between our nodes. Each one is provided a token, mapping to the start of their range. Server A has token 1, server B has token 11 etc.
Whenever we read or write data we take an MD5 hash of the value, then comparing the result to the value ranges for our servers, assigning it to the appropriate node. In our previous example “New string to save!” would still go to server C.
The difference is in how we redistribute data when a node is added or removed. Previously we had to recalculate all of mappings for each server. In this case only the next node is affected, as we add or remove data from it.
The important thing to notice is that this can lead to the uneven distribution of data. If server C is removed, passing all of its information to server D, then server B is removed, again passing all of its information to server D, we end up with a large imbalance of storage!
Additionally, we might store backups of our data on other nodes. For example, server A may hold the primary range 0–10, but also a backup of server B’s range 11–20. If server B goes down, suddenly server A is handling double the traffic!
To solve this we utilise the ‘virtual nodes’ concept. At the moment we assign a single range to a single machine. However, we could easily assign multiple, smaller ranges to the same machine.
We distribute virtual nodes across physical servers randomly, trying not to put two contiguous nodes on the same server (remember how adding and subtracting data affects the next node).
Virtual nodes speed up rebalancing data as when a physical server is added or removed the virtual nodes can be spread more evenly across the others. Additionally this helps avoid hotspots.
It also means we can use lots of different machines in our cluster. We can just add more virtual nodes to larger servers!
The final thing we will touch on is replication. In our virtual nodes example we replicated each virtual node once. However, we can define the number of replicas we would like using a replication factor, N
. A replication factor of two means two copies on different nodes, of three means three copies on different nodes etc.
When we write a value to a node, this coordinator node is responsible for copying the value to the next N — 1
nodes.
So when do we use consistent hashing? Mainly we use it when we need to scale up and down storage/ cache usage, or any system that wants to shard data.
Quorum
On the subject of distributed systems and replication, let’s introduce the notion of quorum.
Imagine we have a system with three nodes. A client can read/ write from/ to any of them. They update a value on a single node, which must be replicated to the others.
Our client writes to server A, which then replicates to server B and server C. Notice, our user is expecting a success response. How do we decide when a write has been successful?
Here we introduce a dichotomy. Initially we could wait until all servers had been updated, which would be safe but slow. Alternatively we could wait for none of the servers to update, which would be fast but we may lose data! It sounds as if the best solution is somewhere in between.
Usually we wait until the majority of servers have been updated. So in our cluster the number is 2 (3 total clusters divided by 2 is 1.5, which rounds up to 2). We can then tolerate total clusters minus the quorum failures (2 failures), as we can guarantee the data will be on at least 3 nodes.
We can reverse engineer this to calculate the number of nodes we need depending on the number of failures we would like (twice plus one the number of failures).
We might use the notion of quorum to update data in a cluster of servers, or for leader election (which we will see in the following section).
Something to recognise is that the number of servers not only affects our resilience, but also affects our write latency. In reality most clusters choose three or five.
Leader and Follower
Having a leader in a distributed system is just the concept of giving one node within your cluster some greater powers. This could be distributing work, altering data, or handling all requests.
Leader election has good and bad points. Pros include:
- Improves efficiency, reduces coordination, simplifies architectures.
- Easier to think about.
- Improve performance by providing a single place to store cache.
Cons tend to focus on moving away from a distributed model. In general we have a single point of failure, scaling, and trust. If something goes wrong with our leader it can affect everything.
We can see this relates to the previous issue of replicating data between servers. In the quorum section we covered when we considered a write finished and stable. However, there are cases when we can have inconsistent data. For example, if it is copied to only three out of the five servers. A leader would help alleviate this issue, as it would ensure the data was safely on all five.
To battle this we can elect one server as the leader. They take charge and copy results to all servers.
Leader election happens as the servers start up. Each one begins an election, and can be in the leader/ follower/ searching-for-leader state. We use a heartbeat (covered shortly) to know if a leader has gone down.
The first criteria for a leader is that it is one of the most up to date servers, identified by their generation clock and the latest item in their write-ahead log (slightly out the scope of this article). If there are multiple candidates we can use other factors such as the server rank, Id, or which server started the election.
The server which receives the majority of votes is the leader. The majority is defined using quorum!
Running a leader election within a data cluster works well for smaller groups of nodes. For large data clusters, it’s easier to use an external tool like Zookeeper. Usually we will have a single server which is marked as a controller node, which makes all decisions.
Finally, we want to know when to use this pattern. We will concentrate on three main cases.
- When we require a leader, and there isn’t an obvious candidate.
- When our cluster is doing complex workloads that needs good coordination.
- When we need strong consistency, our leader can guarantee this.
Heartbeat
Heartbeats are checks to ensure a server is still available. In our distributed system we send a request to all other servers indicating the sender is still available.
Generally if it takes 1 second for a heartbeat to reach a server, then we’ll set a time limit of over a second (say 5), and expect a heartbeat in that bracket. This prevents us falsely diagnosing failed servers.
Checksum
A checksum (sometimes called a hash sum, or hash) is a string we can use to check data for deviations from an original. Let’s say we have a file, and we want to check if it has been corrupted/ there are errors/ it’s been changed. If we have an original checksum, we can generate a new checksum, and compare and contrast the two.
The most commonly used algorithms for generating checksums are MD5, SHA-1, and SHA-256. But how do they work?
Let’s imagine our data is a text file containing the string, “Hello, world”. We run it through the SHA-256 algorithm on our terminal.
shasum -a 256 example.txt
37980c33951de6b0e450c3701b219bfeee930544705f637cd1158b63827bb390
The first line is our command, the second is our generated checksum. Now let’s remove the comma, to just have “Hello World” (a single character difference).
shasum -a 256 example.txt
1894a19c85ba153acbf743ac4e43fc004c891604b26f8c69e1e83ea2afc7c48f
Same command, same file, totally different checksum.
Generally we use this when downloading files. It helps us notice if the file has been corrupted, or if someone has maliciously swapped it out.
An interesting question is, do these functions always generate different numbers? In fact security researchers have found some collisions with MD5 and SHA-1 functions. That’s two different files with the same checksum. Hence us using SHA-256.
Conclusion
In conclusion, we’ve gone through a number of the core concepts to do with system design, and there’s a number of follow on articles to look forward to!