A Crash Course in Google Bigtable

There will be no sequel

5 min readFeb 18, 2023

Audience

This article is aimed at engineers with a reasonable understanding of cloud computing, specifically Google Cloud Platform. For a refresher you can use my article here.

You will also need a base understanding of NoSQL and its use cases, which you can pick up from another one of my pieces here!

Within the article we will be exploring GCP Bigtable, its properties and its use cases.

Argument

To best understand Bigtable, perhaps we should start with the storage model.

Data Storage Model

Initially we have the table. This is the top-level logical entity where we want to store information. We tend to have a small number of tables, each of which contains similar data.

For example, if we were building a commenting system we may store individual comments and the top-level thread information in the same table.

Bigtable is a sorted key-value store. This means the underlying data model resembles a table with rows, which we can only retrieve using the key. The sorted property means we can store slightly different data in the same table, and if we use a different key prefix, similar data will be stored together!

I’ve added an example below to demonstrate, where the key prefix is comm. We’ll also use this to illustrate the other properties.

An example of a table for storing comments. For more information on storing similar data in a single table, look at the ‘single table design pattern’ in my article here.

We are familiar with the notion of columns from traditional data storage. However, Bigtable uses a column family model. In this we see similar columns grouped together.

Expanding on the example above, there are three column families:

Author: This is the person who wrote the comment
Content: This is the content of the comment
Link: This represents all the links in the comment. For example, in comm:2 Andy added links to the words ‘news’ and ‘social’ which point to the described sites.

The words (e.g. ‘news’ and ‘social’) are known as column qualifiers and we use these in combination with the column family to identify unique columns.

This outlines an interesting property of Bigtable: it can be very sparse. Most comments won’t have the same link text, so won’t have entries in those columns. This isn’t an issue.

Additionally, you’ll notice the [t1, t2, t3] entities. These represent cells within the column. As we update a value over time Bigtable keeps track of this, and makes the changes available for querying.

Physical Implementation

Now we have a conceptual model, let’s think about how it works under the hood.

The important thing to note is the separation between the Bigtable cluster (orange and green) and the storage layer (red).

Requests are handled by the Bigtable cluster, which means adding more nodes increases throughput, and by adding more clusters you can deal with different types of requests.

Bigtable instances are data containers, and have one or more clusters in different zones, which contain one or more nodes. Clusters replicate between themselves in an eventually consistent manner.

Tables are defined at the instance level. We can have more than one table per instance.

However, data itself is stored on Colossus, and sharded (presumably according to row key) into tablets in SSTable format.

What this means is rebalancing and recovery are very fast, as all we are replacing is metadata on Bigtable nodes, not underlying data.

Scaling

As covered previously, we can scale by manually adding more nodes to the cluster. Google also offers autoscaling, an automatic tuning of node numbers.

You can trigger this using CPU or storage utilization, and adjust the maximum and minimum number of nodes you would like in your clusters. You can do all of this separately per cluster!

App Profiles

Something often heard in relation to Bigtable is the notion of app profiles. An application connecting to Bigtable is responsible for providing one to tell the instance:

Which clusters to route the application’s incoming traffic to.
Your application’s attitude to single-row transactions (slightly out of the scope of this article).

Resiliency

Within your instance you can only have one cluster per zone. However, once you have multiple clusters spread out over multiple zones, replication between them begins automatically. By default this happens in an eventually consistent manner.

To ensure resiliency Bigtable can failover. There are two ways this can be done: manually (we adjust an application’s app profile to point to a responsive cluster) and automatically (an app profile specifies multiple clusters and Bigtable routes to the appropriate one).

Reads, Writes and CBT

The final thing we will do is look at what interacting with a Bigtable instance may look like.

There are multiple write methods for Bigtable.

Simple write: As on the tin, mutating a row based on the row key.
Increment and append: This is used to update (increment or append to) an existing value in a table.
Conditionally write: Writes based on a condition. If the condition is not satisfied nothing happens.
Batch writes: Removing some of the latency of single writes by batching them together

Reads have a slightly simpler API.

Single row reads: Used to read one row based on a row key.
Scans: Reads multiple rows based on row key prefixes or range start and end keys. With scans you can also add filter clauses to remove some of the results.

One way of carrying out this functionality is to use the cbt CLI tool.

This is a neat little wrapper round some command line functionality for interacting with Bigtable. We make a small config file telling cbt which project and instance to look at. In return we can run commands like cbt ls for listing tables, and cbt ls <table-name> for listing the column families of that table.

Conclusion

This covers the very basics of Bigtable. We haven’t addressed things like encrypting data, garbage collection or backups! However, this is enough to get you started.