Data visualization is not only important to communicate results but also a powerful technique for exploratory data analysis. Each plot type like scatter plots, line graphs, bar charts and histograms has its own purpose and can be leveraged in a powerful way using the ggplot2 package.
- Understand the different roles of data visualization
- Understand the different plot types available
- Get an overview of the ggplot2 package.
Introduction to data visualization
A picture is worth a thousand words.
Data visualization is the quickest and most powerful technique to understand new and existing information. During an initial exploration phase data scientists try to reveal the underlying features of a dataset like different distributions, correlations or other visible patterns. This process is also called exploratory data analysis (EDA) and marks the starting point of each data science project.
The graphs produced during the EDA show the data scientist the directions of the journey ahead. Revealed patterns can inspire hypothesis about the underlying processes, features of the dataset to be extracted or modelling techniques to be tested. Last but not least, visualizations uncover outliers and data errors which the data scientist needs to take care about.
The biggest role for data visualization is the communication of data science findings to colleagues and customers through presentations, reports or dashboards. Effort used for EDA and visualizations is time well spent since results can be directly used to communicate findings.
Quiz: Visualization Phase
For which phases is data visualization important in the data science workflow?
- Explorative Data Analysis (EDA).
- Detection of outliers.
- Communication of Results.
Available Plot Types
There are many plot types available which help to understand different features and relationships in the dataset.
During the exploratory data analysis phase we typically want to detect the most obvious patterns by looking at each variable in isolation or by detecting relationships of variables against others. The used plot type is also determined by the data type of the input variables like numeric or categorical.
Scatter plots are used to visualize the relationship between two numeric variables. The position of each point represents the value of the variables on the x and y-axis.
Line graphs are used to visualize the trajectory of one numeric variable against another which are connected through lines. They are well suited if values only change continuously – like temperature over time.
Bar Charts and Histograms
Bar charts visualize
numeric values grouped by categories. Each category is represented by one bar with a height defined by each
numeric value. Histograms are specific bar charts to summarize the number of occurrences of numeric values over a set of value ranges (or bins). They are typically used to determine the distribution of numeric values.
Other frequently used plot types in data science include:
- Box plots: Show distributional information of numeric values grouped in categories as boxes. Great to quickly compare multiple distributions.
- Violin plots: Same as box plots but show distributions as violins.
- Heat Maps: Show interactions of variables – typically correlations – as rastered image highlighting areas of high interaction.
- Network Graphs: Show connections between nodes
Quiz: Distribution Comparison Plots
Which plot types are typically used to compare distributions of numeric variables?
- Box plots
- Network graphs
- Violin plots
- Line Graphs
Due to the importance of visualization for data science and statistics, R offers a rich set of tools and packages. The core R language already provides a rich set of plotting functions and plot types. These plotting functions require users to specify how to plot each element on the canvas step by step. By contrast, the ggplot2 package allows the specification of plots through set of plotting layers. This requires the package to figure out the required steps to take to produce the graph.
Through the pre-defined set of geometric layers, facets and themes ggplot2 enables users to create beautiful graphs in very short time. ggplot2 is also the most widely adopted plotting library in the R community.
Quiz: ggplot2 Facts
Which statements about data visualization and ggplot2 are correct?
- ggplot2 is the only way to create plots in R.
- ggplot2 facilitates the creation of good looking graphs quickly.
- ggplot2 requires users to specify the plotting commands in a step-by-step fashion.
- ggplot2 enables users to specify plots in a declarative way.