A Crash Course in NoSQL vs SQL

Storing your data!

10 min readJan 14, 2022

If you like this article, you’ll love the SQL!

Audience

This article is aimed at developers looking to explore the difference between relational and non-relational data stores. It is aimed at people with at least a fleeting understanding of the differences already, and looks to expand on those themes.

If you’ve never encountered either paradigm then I would do a bit of reading beforehand. Otherwise, plough on!

Argument

Over the rest of this article we’re going to explore a lot of the common threads that are drawn out when comparing the two patterns. We’ll aim to explain them in a little more depth than usual, and also examine some of the exceptions or rule-bending that goes along with them.

Schema vs No Schema

One of the core things that people talk about when discussing SQL vs NoSQL is the use of schemas. In relational databases we have a relational model. There are tables with columns that can be joined and queried.

How NoSQL databases store data differs depending on the type of database. Some of the core ones are:

Document based: CouchDB, Cosmos DB, MongoDB, Amazon DocumentDB
Key-value pairs: DynamoDb (kind of), Redis
Graph databases: Neo4j, Amazon Neptune, Oracle Spatial and Graph
Column-oriented databases: MariaDB Columnstore, Apache Hbase, Amazon Redshift

Let’s explore them.

A document database stores data in JSON, XML or BSON. We can nest documents, and index them to speed up searching. Let’s have a look at a document to start us off.

This is a user JSON document. We have a nested address object, and a nested list of purchases. This is the opposite of normalisation! Note, in engines like MongoDB it is still possible to normalise data models, but this comes with its own issues.

Document databases are particularly good as they provide a flexible schema, removing the need to consistently be updating SQL table structure.

Use cases include:

Content management such as blogs or video platforms. We can store each post as a document, and if we change the model we don’t need to update all the existing ones.
Catalogs such as lists of products. Products have different attributes, so it makes sense to have different schemas.
Comments or social media posts. We can store these in a document if they have different fields (attached images, reactions etc.)

Key-Value pairs are even simpler still, and comprise of a key (which we use to access the data), and a value (the data itself).

Use cases include:

Shopping carts. If we have a busy day on our shopping site we will need to service lots of customers. We can store a customers Id as their key, and their cart as the value for quick, reliable retrieval.
User preferences/ profiles. When a user logs into a site we may want to retrieve things like their recommendations and settings. We can store it alongside their user Id, as we will never want to retrieve it by anything but this.
Cache (for in-memory data stores)

Graph Databases focus on relationships between data. To brush up on graphs you can see my article here. We have our nodes, which are our pieces of data: places, things etc. These nodes have properties: name, date of birth etc. We also then have edges.

The edges are first class citizens in the fact that they can also have properties or weighting.

This type of database is best when we’re looking to understand the relationship between data.

Use cases include:

Fraud detection: A full white paper is available here, but you can use graphs to detect synthetic identities.
Community detection: We can discover tightly connected nodes. For example, this can be useful for finding suggested friends in a social media site. A good article can be found here.
Recommendation systems. Imagine we have a TV show A (a node), which has a number of features (action, drama, romance, also nodes). We then want to find a recommendation for viewers who like TV show A. We could traverse to the features, then out again to other shows to find similar ones. Having weighting amplifies this, because we can indicate how much action, drama or romance our show has.

An example of using a graph, we can traverse from TV Show A to a different TV show which shares similar features.

Finally, column-oriented databases. These are mainly used for analytics. Imagine we have the below data.

We have essentially pivoted the data. The boxes at the bottom demonstrate the data held in memory. We now see all of the price data is held contiguously, and is much easier to access.

Another benefit is compression. As we have the same data type in the same chunk of memory we can use compression techniques fitting to that data type to reduce our storage needs.

Use cases include:

Data warehouses and analytics: Updating can be hard as we need to update several places in memory. However with data warehouses we tend to batch update. We then run analytics on top of this, and it’s easy to pull things like all price data quickly.

Vertical vs Horizontal Scaling

Another thing that comes up frequently with relational vs non-relational storage is vertical vs horizontal scaling.

It is recommended to read this article first, especially the part on partitioning and sharding. I will assume you’re familiar with both. To summarise:

Vertical scaling: Making the machine with our data on more powerful.
Horizontal scaling: Adding more machines to store our data on.

Relational databases are generally scaled vertically. Think about our table structure. If we want to query it, and the information lies on different physical machines we need to go to the separate machines, retrieve the data, aggregate it and bring it back. This can be costly and time consuming (though not impossible!).

This is compounded if we need to orchestrate things like transactions and referential integrity.

Non-relational databases are slightly different. Initially, a lot of them prioritise availability over consistency. This means we don’t need to worry too much about the above.

Additionally, we don’t tend to do complex queries relying on joins of data stored on separate machines.

In a document collection we might use a shard key to distribute the data between nodes. This key is comprised of the field (or fields) we query by. When we query an orchestrator knows where each of the documents are physically stored and can go and retrieve them.

In a key-value store we can shard by the key. This uses the same principle, but there is only one option for the key.

Sharding graph databases is a little bit different, and not as common as in the other three types. An excellent explanation is found here. However, if we find we have a graph that suits a level of separation it can work. We may need to redundantly copy some entities between shards to maintain complete graphs.

We finish with columnar data storage. This works similar to document collections and key value stores. In RedShift we have sort and distribution keys. Sort keys order the data on disk. Distribution keys are used to spread our data between nodes.

Approaches to CAP and PACELC

If you aren’t familiar with the CAP and PACELC theorems you can brush up using my article here. This next section mainly focuses on the CAP component, I’ll assume you have a rough understanding.

Relational databases tend to focus on consistency and availability over partition tolerance. This is played into by the fact they normally sit on a single machine.

In the case of having a replica of your write server we can either guarantee that any write will have happened on the replica (high consistency) or do this asynchronously. In the former case this adds extra latency and dependencies.

In comparison, NoSQL databases favour availability and partition tolerance over consistency. This has a lot to do with the fact they are often sharded. Consistency can be achieved using consensus or quorum (covered in the article here).

Replicas are spread between nodes, so in non-relational databases you can still complete writes even if a node falls over or disconnected. It is then the duty of the other nodes to update the new or reconnected node.

Complexity of Queries

So far in the article we have seen that relational databases tend to be better for complex query patterns involving a lot of joins, whereas non-relational databases are better for larger data sets which require less sophisticated or specific querying.

Document collections tend to provide an interface that allows us to query by multiple fields (see here for MongoDb). Key-value stores tend to only let us query by key. Columnar databases generally use a richer SQL language.

Graph databases also allow for more rich query methods. We can express relationships as part of a request. For example, in Neo4j we may use something like:

MATCH (:Person {name: 'James'})-[:PRESENTED]-&gt;(programme:Programme)
RETURN programme

Which returns all of the programmes James has presented!

Something particularly interesting is the use of the single table pattern. We will investigate this within the parameters of DynamoDb.

Let’s imagine we have the below table structure.

In this relational model we have a channel, which hosts a programme. This programme has a voting system where users can vote for their favourite contestant. We collect these votes each second to later total them up.

Let’s say we want to put this information into a key-value store like DynamoDb. How would we allow the rich querying we need?

We first of all need to decide how we would like to access our data. Non-relational data stores can require us to know a lot about our data up front in order to partition them effectively.

Let’s say we have the following data access requirements:

Get the results for each vote by contestant.
Get a list of programmes by channel.
Get the winner for all votes in a channel.
Get the top three from a programme.
Get the second-by-second results for contestants over all programmes.

To do this we use something from graph theory called an adjacency list. You can read more about it in my article here. The idea is that each data item links to connected relevant ones.

Now, in a key-value store we need to query by the key. However, there’s no rule saying that our key needs to represent the same thing every time. We could have some keys that map to one thing, and some keys that map to another. Let’s expand with an example.

The sort key was briefly discussed earlier, but is a way of ordering data within a partition. We also need to introduce two new concepts:

Local Secondary Indexes: A separate sort key using the same partition key.
Global Secondary Indexes: Can use a different partition key and sort key.

Now let’s say we have these three properties of the table.

Partition key and sort key as displayed above.
Local secondary index with second sort key as ‘numeric’.
Global secondary index with partition key as sort key and second sort key as ‘numeric’.

How can we fulfil all the requirements of our system?

Get the results for each vote by contestant: Search for partition key with the contestant Id. This returns all results for that contestant. We can aggregate their results to a total.
Get a list of programmes by channel: Search for partition key with channel Id to return the list of programmes partition keys
Get the winner for all votes in a channel: Get programmes as before, then query by partition key and filter results.
Get the top three from a programme: Search for partition key with programme Id and sort by numeric using the local secondary index.
Get the second-by-second results for contestants over all programmes: Search for partition key with contestant Id.

ACID vs BASE

To better understand ACID I would read my article here. Assuming you’ve caught up let’s discuss BASE. The acronym stands for:

Basic Availability: We have access to the data almost all the time.
Soft-state: We don’t need our data to be consistent all the time.
Eventual consistency: All of our data will be consistent at some point.

This should be familiar from our conversations in the CAP theorem section!

When to Use

Although we covered use cases previously, this section is designed to outline when we want to pick one model over the other.

NoSQL is good when:

We have lots of data of different structures.
We have a very large amount of data and we can’t easily handle it in a SQL model, or we need to shard it.
We need zero downtime and can sacrifice it at the cost of consistency.

SQL is good when:

We have well-structured data.
Our data requires lots of complex querying.
We require ACID properties and transactions (although some non-relational models like MongoDB can do the latter).
We need data integrity using things like foreign keys.

Reading the article I have no doubt you’ve seen that non-relational comes in many flavours and there is no golden set of rules for picking one over the other. You just need to understand your use case and your options!

NewSQL, the shape of things to come…

The very final (I promise) thing we’ll touch on is NewSQL. One of the popular open-source implementations is CockroachDb, whereas Google offers Spanner.

The idea behind both is to provide a SQL implementation that can be distributed (but maintain ACID properties). Cockroach does this by converting SQL ideas to a key-value paradigm. Spanner does this by using consensus protocols (read more in my article here). This is a very broad stroke explanation, but you can dig deeper into their documentation!

Conclusion

In conclusion we’ve covered the basics of SQL vs No-SQL, when we might use it and the relevant tradeoffs.

A Crash Course in NoSQL vs SQL

Storing your data!

Audience

Argument

Schema vs No Schema

Vertical vs Horizontal Scaling

Approaches to CAP and PACELC

Complexity of Queries

ACID vs BASE

When to Use

NewSQL, the shape of things to come…

Conclusion

Written by James Collerton

No responses yet