So you found a way to store a pile of data in Amazon Redshift and keep them in sync. Now you want to start messing with it using statistical techniques, maybe build a model of your customers’ behavior, or try to predict your churn rate. To do that, you will need to extract your data from Redshift and use a framework or language that is best suited for data analysis and the most popular so far are Python and R. In this small tutorial we will see how we can extract data that is stored in Amazon Redshift to load it with Python or R, and then use the numerous analytic libraries and algorithms that exist for these two languages.
Redshift is compatible with PostgreSQL, in general and for any language you can safely assume that you can access Redshift using the official PostgreSQL libraries for the language of your choice, which means that more or less what you will see in this guide also applies for the case where you want to use data coming from a PostgreSQL database instance. For JDBC and ODBC drivers you can check the documentation here.
When it comes to Python, the most popular libraries for data analytics are the following:
- NumPy which is probably the most fundamental library in Python for scientific computing.
- pandas which is a Python Data Analysis library which provides high performance data structures for operating with table-like structures.
No matter what kind of analysis you wish to do, from linear models to using frameworks like Tensorflow, you will need to use one of the above libraries to represent your initial data. For this reason in this post we will see how to connect to Amazon Redshift and load our data in structures of NumPy and pandas.
Connect to Redshift with Python
In order to pull data out of Redshift, or any other database, we first need to connect to our instance. To do that we need to use a library or driver for Python to connect to Amazon Redshift. You can use any of the available libraries for Python, the one that PostgreSQL recommends is Psycopg so we will use this one for this tutorial but feel free to experiment with any other you might want.
import psycopg2 con=psycopg2.connect(dbname= 'dbname', host='host', port= 'port', user= 'user', password= 'pwd')
The above code is all we need to connect to our Redshift (or PostgreSQL) instance with Python using the Psycopg library, with the connection that we get back a connection variable with which we can start executing queries to pull data out of our database. Of course it is highly recommended that if you intent to use this code as part of your pipeline, to wrap in in a function that will also handle any errors. The parameters that you need are the typical ones for databases:
- Database name
- Host name
- User name
Execute queries with Python using Psycopg
When you get your DB connection you need to start pulling data, to do that you will need to execute some kind of SQL query to narrow down the amount of data you will get based on the type of analysis you want to perform. To do that with Psycopg you need to perform the following steps:
- First get a cursor from your DB connection:
cur = con.cursor()
- Execute a select query to pull data where
tableis the table you want to get data from :
cur.execute("SELECT * FROM `table`;")
- After the successful execution of your query you need to instruct Psycopg how to fetch your data from the database, there are plenty of option there depending on how you plan to process your data. In most cases in analytics you will need the complete dataset for analysis so the most appropriate command would be the following:
- Of course after you are done do not forget to close your cursor & connection
The most important part here is the SQL query that you execute, SQL is a very powerful an expressive language and you can do a large part of your data pre-processing using it. For example you might join different tables, or even create new features by using the aggregation functions that are supported by Redshift and PostggreSQL.
Load data to NumPy
To turn your data into a NumPy array is pretty straight forward, all you need to do is to initialize a new NumPy array passing as a parameter our query results.
import numpy as np data = np.array(cur.fetchall())
cur is the cursor we have created previously. That’s all, your data from Redshift as a NumPy array 🙂
Load data to pandas
If instead of NumPy you plan to work with pandas, you can avoid using the previous steps altogether. You can use the read_sql method with which you can read an SQL query or database table directly into a DataFrame. In order to do that you will also need to use SQLAlchemy.
from sqlalchemy import create_engine import pandas as padas engine = create_engine('postgresql://scott:[email protected]_host:5439/mydatabase') data_frame = padas.read_sql_query('SELECT * FROM `table`;', engine)
With the above code you will end up having a pandas data frame that contains the results of the SQL query that you have provided to it.
If Python is not your cup of tea and you prefer R instead, you are still covered. Getting your data from Amazon Redshift or PostgreSQL is equally easy as in Python. As in Python we again need to first take care of how we will connect to our database and execute queries to it. To do that we will need the “RPostgreSQL” package.
With the above code we make the package available and visible by R. Now we need to proceed in creating connections and executing queries. At the end the command to close the connection, don’t do that before you execute queries though 🙂 but always remember to close the connection when you are done with pulling data out of the database.
Getting your data from the database into an R data frame is just one command away:
As you can see, getting your data from Redshift or PostgreSQL for further analysis in Python and R is really easy. The true power of a database that stores your data in comparison with CSV files etc. is that you have SQL as an additional tool. Invest some time learning how to work with SQL and you will not regret it, having structured data in a database and using SQL to pre-process your data before you start building your statistical models will save you time and resources.
Although in this article we focused mainly on Amazon Redshift and PostgreSQL, using any other database is equally easy. The main difference will be the selection of the appropriate library for Python in the case of NumPy and for R. The usage of SQLAlchemy by Pandas makes it easier for you as the only change required is the configuration string for the database engine.
Would love to hear your thoughts in the comments or Facebook bellow.