Lightning fast database querying with the R API.
What is Google Big Query?
Big Query is a highly performant cloud data storage service which started in 2011. You can manage it inside the Google Cloud Console and query the storage with standard SQL commands from the bq console or via the API. It is easy to set up, auto scales and there are a variety of established connectors to Google and other services. In this article I will show you the advantages of working with Big Query and how to use the API from R and build queries with dplyr functions.
What are the advantages of Big Query?
Google Big Query has a few advantages in comparison to other cloud data storages. Other technologies will be comparable or even better, but the combination of all the advantages and especially the Google integration is what makes Big Query really outstanding. Advantages are:
- Lightning fast query speeds: BQ had similar benchmarking results to other modern data base technologies. BQ was also compared to other Data Warehouse solution that have similar features such as Amazon Redshift, Snowflake, Microsoft Azure and Presto and all showed more or less similar performances and pricing.
- Low costs: BQ has similar costs to other big data warehouse solutions. Costs as of today are 0.02$/GB per Month storage and 5$/TB data querying. 10 GB storage per month is free as well as 1 TB per month data querying. Many operations as e.g. data loading, copying, exporting, deleting as well as failed queries are free. Furthermore there is query caching, you do not have to pay if you run a query again on the same, unchanged data. There are flat-rate prices available, too.
- Easy integration with Google services: Data from Google Analytics 360 can be easily stored in BQ. This is a big advantage as Google Analytics has a limit on stored rows and only enables reports on sample data. You can get a more detailed customer journey and combine every dimension with every metric if you store your Analytics data in BQ as you can access all of your tracking data. Additionally, datasets on Google Cloud storage and Google Drive can be queried via BQ without manually importing the dataset.
- Easy integration with other tools: BQ has it’s own machine learning suite with Big Query ML engine which enables you to import tensorflow models for prediction. There also is a BQ BI engine, but both seem not particularly useful to me yet as the functionalities are limited. Many services such as Tableau, Qlik or Looker have connectors to BQ.
- No-Ops management: No prior data base management knowledge is needed to set up BQ and manage security and recovery.
- Public datasets: You have a nice selection of publicly available data on BQ, some of the datasets are constantly updated!
Use Big Query with R
Enable Big Query and get your credentials
- Go to the Google Cloud Platform and login with your Google account. At the top left corner go to “Choose Project” and start a new project. If you go on your home dashboard to “Go to APIs overview” you will see the activated APIs of the Google Cloud Service. “BigQuery API” and “BigQuery Storage API” should be activated by default for all new projects.
2. Get your API key as described by the gargle R package here. In short, go to the Credentials section at the Google Cloud Platform in the dashboard shown above and create credentials > API key. You can rename your API key and restrict it to only certain APIs like “BigQuery API”. If you need application access to BQ you will need a service account token which you can download as JSON.
Querying with R
For querying BQ we will use the R library bigrquery. The other prominent R library for BQ is bigQueryR which in contrast to bigrquery depends on library googleAuthR which makes it more compatible with Shiny and other packages.
First, we get the libraries and authenticate either with our created API key or with the downloaded service account token JSON.
Now we can start querying our Big Query data or public datasets. We will query the Real-time Air Quality dataset from openAQ. This is an open source project which provides real time data (if you stretch the definition of “real time) from 5490 world wide air quality measurement stations, which is awesome! You can see the dataset and a short description on Big Query here if you are logged into Google. To find open datasets in the Cloud Console, scroll down on the left menu, there you should see “Big Query” under the header “Big Data”. If you go then to “+Add Data” you will be able to browse public data sets.
We will wrap the bigrquery API with DBI to be able to use it with dplyr verbs, however the bigrquery package provides an low level API, too.
In this example you can see querying with dplyr functions that are converted to SQL queries, however you can do not get the full flexibility that direct SQL querying provides. For this you could send SQL queries via DBI::dbGetQuery() from R. The global air quality data set gets updated regularly, however older entries are omitted, probably to save storage costs. Check out my next post on how to build a Dockerized Cron-job to get the newest air pollution data from India while keeping older records.
This article was also published on https://www.r-bloggers.com/.