How I Would Design… A Search Engine!
A System Design Demonstration
This article is the next in my series of how I would design popular applications. It is recommended (although not entirely necessary) to read the previous posts I’ve helpfully compiled in a list here. We will expect a basic familiarity with architecture principles and AWS, but hopefully this post is approachable for most engineers.
Initially, let’s look at our problem statement.
The System to Design
We would like to create a system similar to Google. Users should be able to enter a term and retrieve a list of web pages relevant to that term.
We also want to be able to highlight the term in the text, showing where it is located in the document.
The system should be low latency, highly available, resilient and scalable.
We have a standard approach to system design which is explained more thoroughly in the article here. However the steps are summarised below:
- Requirements clarification: Making sure we have all the information before starting. This may include how many requests or users we are expecting.
- Back of the envelope estimation: Doing some quick calculations to gauge the necessary system performance. For example, how much storage or bandwidth do we need?
- System interface design: What will our system look like from the outside, how will people interact with it? Generally this is the API contract.
- Data model design: What our data will look like when we store it. At this point we could be thinking about relational vs non-relational models.
- Logical design: Fitting it together in a rough system! At this point I’m thinking at a level of ‘how would I explain my idea to someone who knows nothing about tech?’
- Physical design: Now we start worrying about servers, programming languages and the implementation details. We can superimpose these on top of the logical design.
- Identify and resolve bottlenecks: At this stage we will have a working system! We now refine the design.
With that said, let’s get stuck in!
The questions I would be asking would be: how many searches we are expecting a day, how big our documents are, how many documents we’re expected to search, the average number of documents in a result, and traffic patterns (do we have lots of common searches, or a peak time for searching).
Let’s say we expect a billion searches a day, and have a billion documents of size
100KB. Our traffic patterns are fairly consistent and we return 10 documents a search.
Back of the Envelope Estimation
If we have a billion documents of size
100KB this is the same as
100,000,000,000 KB of data to store, or
100TB, which is a lot. Equally, a billion searches a day is the same as around 11,600 searches a second. If a search returns 10 documents, this is the same as
1MB of data, so we’re sending around a gigabyte of data a second.
We can see it’s a fairly high traffic system!
System Interface Design
Now we need to decide how we would like to interact with our service. As we will see in our later section, we can assume that documents to return are already fed to us, we only need to design the search endpoint.
For this we can have a single endpoint
/search?query=<query text> which returns a
200 response and a body similar to the below.
Error responses will show the regular
Data Model Design
The meat of the problem lies in how we model our data. A large component of this section is taken from the video here.
A naive approach would see us iterate through our collection of documents looking for a term each time we search. When we find the term we would add our document to our result, then return the list. However, each search requires us to look at all of our documents, which is very slow.
A quicker solution is to do some pre-processing and use an index. Our service will trawl the documents, find the terms within the document, then create a map of terms to documents. This means when we search for a term we only need to find it in a single list!
So how do we create this map?
We can assume our system is fed into by a web crawler. If you aren’t familiar with web crawlers you can read up using my article here. This means our system will receive a steady flow of documents that require indexing.
Initially we need to do some processing of our documents. We will strip out all of the HTML tags, image links and technical components of the page, as we won’t need these for searching.
We then need to do some additional work. We convert all words to lower case, carry out stemming and lemmatisation (converting words like rapid, rapidly, rapidity to just rapid for simplicity) and remove stop words (the, and, I etc.). We then have a corpus of documents with meaningful terms.
From there we list all of the terms and mark off which web pages contain them.
There are a couple of ways we could store this information. The first is using a binary representation. For example rapid maps to
111, bed maps to
100, joyful maps to
110. This is concise, but doesn’t give us enough detail to do our required highlighting.
To highlight words we need to develop the idea a little further. We maintain a list of which documents our terms occur in, and where in the document they happen.
The next thing we need to consider is how we combine terms in a search. For example, if someone searches ‘rapid joyful’ we might want to return the locations for pages 1 and 2, but not page 3 (conjunction, we need both words). Alternatively, for ‘joyful news’ we may want to return all pages (disjunction, we need either word). These can be implemented using an exclusion or union of results.
We can also do some calculations to see how relevant our result are. For example, the maximum distance between rapid and joyful in document one is 5, whereas in document 2 it’s 4, which suggests a higher relevancy.
From all of the above it appears we are using a NoSQL based approach as we don’t need ACID properties, our traffic requires high availability, and it appears as if we can use a key (term) to value (documents and placements) store.
When we retrieve the documents themselves we can do something similar using a key (document Id) to value (document information as specified in the system interface section).
A basic logical design is contained below.
We will assume we don’t need to worry about the client application. Our user makes a request to our API Gateway, which we will use for throttling and other functional tasks.
This forwards search requests to our search service. The search service has two responsibilities. The first is querying the search index to retrieve the list of documents and placements. The second is querying the search storage to retrieve the document summaries and aggregating them with the index results.
Going on behind the scenes is the indexer. This is fed into by a web crawler which is responsible for trawling the internet for documents. The indexer extracts the relevant information and stores them in the search index and search storage to be queried later.
Note, we could have used a single table design and stored our indexes and search information in the same table, however it is conceptually easier to understand them as two separate entities.
We can then transpose that into a physical design.
In this physical design our API Gateway uses AWS API Gateway. It then forwards requests on to an ECS-based service to do the searching of indexes and information. We could have used a Lambda, however as there are so many requests it doesn’t make sense to have to start one up per request, we may as well have a scalable solution already in place.
The storage solutions both use DynamoDb, which provide scalable, sharded data storage with constant return times. The indexer itself also sits on ECS.
Identify and Resolve Bottlenecks
The main bottleneck we may have is hot data, or terms that are queried often. To resolve this we can use a caching system. For this we could cache at the API Gateway or Search Service level. We will pick the Search Service. We can use a Least Recently Used or Least Frequently Used eviction policy.
In conclusion we have summarised how one might go about designing a search service. There are already plenty of implementations (OpenSearch and ElasticSearch), but understanding it from the ground up will be helpful when you go and put them to use!