A Crash Course in Google BigQuery

Data Warehouse Project

James Collerton
3 min readFeb 20, 2023
Me in the office 💁‍♀️

Audience

This article is aimed at engineers looking to gain an understanding of Google’s BigQuery technology, their managed data warehouse solution.

It will help to have a surface level understanding of Google Cloud Platform (covered in my article here). Aside from that, as long as you have a good grounding in general data storage you should be OK.

Argument

Initially, what is a data warehouse? In general terms it’s a data system designed to support business intelligence. It takes in data from multiple sources, often storing it in separate storage. It may offer analytics tools on top.

BigQuery is Google’s fully managed offering. We don’t need to worry about provisioning machines, we just focus on our data. It offers easy-to-use methods of ingesting information, and data can be queried using a SQL-like language via the console, CLI or using client libraries.

Additionally lots of popular visualisation tools build on top of it. We can choose if we want to store our data in BigQuery itself, or just use it as a layer on top of existing storage.

Storing Data in BigQuery

Before we go into how to get data in and out of BigQuery, let’s think about the underlying data storage. The two main components we will focus on are datasets and tables, although we can briefly cover some of the others.

Datasets are the top level of the hierarchy and and organise and control access to tables (and views). These are based in a given location, they aren’t global. They don’t store data, they just define a group that other resources can belong to.

Tables then belong to a dataset, and are where the data resides. They are defined by a schema containing the column names, data types and other metadata. They come in three flavours:

From there they look almost identical to SQL tables. On top of this you have functionality like routines and search. Routines let you define and call stored procedure like functionality, whereas search lets you find data elements in unstructured data like text or JSON.

Getting Data In

So now we have a good idea of how data is stored, let’s think about how we might get it in. We can do this by:

  1. Batch uploading a set of data records. You can do this by uploading from local or cloud storage files, or Google/ third party applications.
  2. Streaming records. This sends records in near realtime. For more information on streaming read my article here.
  3. Generating data by querying. Does what it says on the tin, you can run a SQL query on your data and reinsert parts of it.
  4. Using third parties. Some third parties provide loaders for BigQuery.

Getting Data Insights Out

Now we know how our data is stored, and how we get data in, how do we get useful data insights?

The primary way we do this is via SQL querying. Once we have our data in a table within our dataset we can query it using the Google console or third-party tools.

Google also offers a number of other analysis tools for things like geospatial analysis, machine learning and business intelligence.

Conclusion

This rounds out our article nicely! At the start we said the purpose of a data warehouse was to enable business intelligence, and the last thing we did was introduce Google’s offering in the space.

--

--

James Collerton

Senior Software Engineer at Spotify, Ex-Principal Engineer at the BBC