Access your data in Google BigQuery with Python and R

[This article was first published on R – Blendo, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Some time ago we discussed how you can access data that are stored in Amazon Redshift and PostgreSQL with Python and R. Let’s say you did find an easy way to store a pile of data in your BigQuery data warehouse and keep them in sync. Now you want to start messing with it using statistical techniques, maybe build a model of your customers’ behavior, or try to predict your churn rate.

To do that, you will need to extract your data from BigQuery and use a framework or language that is best suited for data analysis and the most popular so far are Python and R. In this small tutorial we will see how we can extract data that is stored in Google BigQuery to load it with Python or R, and then use the numerous analytic libraries and algorithms that exist for these two languages.

READ: If you are looking to access your data in Amazon Redshift and PostgreSQL with Python and R

Python

Connect to BigQuery with Python

In order to pull data out of BigQuery, or any other database, we first need to connect to our instance. To do so, we need a cloud client library for the Google BigQuery API. Although the options are quite many, we are going to work with the Google Cloud Bigquery library which is Google-supported.

However, feel free to explore other third-party options like the BigQuery-Python library by tylertreat which is also widely used, well-documented and frequently maintained.

So, in order to be able to install the Google Cloud Library, we assume that you have already setup a Python Development Environment. If this is not the case you can refer to Python Development Environment Setup Guide.

The next step is to install the library and this can be easily achieved by executing the following command:

pip install --upgrade google-cloud-bigquery

In order to be able to connect to the database, you need to download locally the .json file which contains the necessary credentials from within your service account. In case you didn’t already have a service account, you would have to create one and then download in our local machine the previously mentioned JSON.

Now that everything is set we can move on and write some Python code in order to initialize the connection.

from google.cloud import bigquery

from google.oauth2 import service_account

credentials = service_account.Credentials.from_service_account_file(

    'path/to/file.json')

project_id = 'my-bq'

client = bigquery.Client(credentials= credentials,project=project_id)

As you can see, the parameters you need to specify are the project_id and the location of the json key file. In Bigquery, a project is the top-level container and provides you default access control across all datasets.

Executing Queries with Python

With the BigQuery client, we can execute raw queries on a dataset using the query method which actually inserts a query job into the BigQuery queue.

These queries are executed asynchronously in the sense that there is no timeout specified and that the program will wait for the job to complete.

As long as the job is completed the query method returns a QueryJob instance according to documentation which, among others, contains the produced results.

For more details you can always refer to the official documentation here

The Python code required follows:

query_job = client.query("""

  SELECT *

  FROM dataset.my_table

  LIMIT 1000""")

results = query_job.result()  # Waits for job to complete.

Note that the above query uses standard SQL syntax as the client library default to this. If you wish you can change the SQL dialect to legacy as follows:

job_config.use_legacy_sql = True

query_job = client.query("""

  SELECT *

  FROM dataset.my_table

  LIMIT 1000""", job_config=job_config)

results = query_job.result()  # Waits for job to complete.

 

R

If Python is not your cup of tea and you prefer R instead, you are still covered. Getting your data from Google BigQuery is equally easy as in Python – or even easier.

Connect to BigQuery with R

Again we are going to use an open source library called BigrQuery, which is created and maintained by Hadley Wickham, Chief Scientist at RStudio. In order to install we simply have to run the following command from within R console:

install.packages(“bigrquery”)

Executing Queries with R

According to the library documentation,  the first time we’ll need to authorize the application to access Google Cloud services. As documented on the Authentication section of the bigrquery GitHub page, we’ll follow the prompts within R to open the authorization URL and later copy the authorization code back to R. You’ll only need to authorize the library once, requests performed after the first authorized one will refresh access credentials.

So the overall procedure that we are going to follow can be summarized as:

  • Import ‘BigrQuery’ library.
  • Specify the project ID from the Google Cloud Console like we did with Python.
  • Form your query string
  • Call query_exec with your project ID and query string.
#import library

library(bigrquery)

# Use your project ID here

project_id <- "your-project-id" # put your project ID here

# Example query

sql_string <- "SELECT * FROM dataset.my_table LIMIT 1000"

# Execute the query and store the result

query_results <- query_exec(sql_string, project = project_id, useLegacySql = FALSE)

Again, if you wish you can execute queries using legacy sql by changing the value of the useLegacySql to true in the query_exec function.

Conclusion

As you can see, getting your data from BigQuery for further analysis in Python and R is really easy. The true power of a database that stores your data in comparison with CSV files etc. is that you have SQL as an additional tool. Invest some time learning how to work with SQL and you will not regret it, having structured data in a database and using SQL to pre-process your data before you start building your statistical models will save you time and resources.

Although in this article we focused mainly on BigQuery, using any other database is equally easy. The main difference will be the selection of the appropriate library for Python and for R.

Would love to hear your thoughts in the comments below.

Get our latest posts in your email to stay on top of everyone.

The post Access your data in Google BigQuery with Python and R appeared first on Blendo.

To leave a comment for the author, please follow the link and comment on their blog: R – Blendo.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)