Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

In data science, before doing almost anything else, you need to know your data.

This is why visualization is one of the pillars of data-science: visualization allows you to see your data and “know” it in a way that your mind is wired for.

(And, it’s why I emphasize mastering data visualization before almost anything else.)

In practice, “knowing your data” typically begins by using data visualizations and summary statistics to examine individual variables. You need to ask and answer questions: What’s in the variable? How is it distributed? What is the mean?

For answering some of these questions about individual variables, there are few visualization techniques that are simpler, or more useful, than the histogram.

Let’s take a look at a simple histogram in ggplot2 that has a few extra details and annotations to provide some additional information.

## Code

#-------------
#-------------
library(ggplot2)

#--------------------------------------
# CREATE VARIABLE, NORMALLY DISTRIBUTED
#--------------------------------------

# set "seed" for random numbers
set.seed(42)

# create variable
xvar_rand_norm <- rnorm(1000, mean = 5)

#--------------------------------
# CREATE DATA FRAME FROM VARIABLE
#--------------------------------
df.xvar <- data.frame(xvar_rand_norm)

#---------------------------------
# CALCULATE MEAN
#  we'll use this in an annotation
#---------------------------------

xvar_mean <- mean(xvar_rand_norm)

#-----------------------------------------------
# PLOT
#  Here, we're going to plot the histogram
#  We'll also add a line at the calculated mean
#  and also add an annotation to specify the
#  value of the calculated mean
#-----------------------------------------------

ggplot(data = df.xvar, aes(x = xvar_rand_norm)) +
geom_histogram() +
geom_vline(xintercept = xvar_mean, color = "dark red") +
annotate("text", label = paste("Mean: ", round(xvar_mean,digits = 2)), x = xvar_mean, y = 30, color = "white", size = 5)



### The output plot ## How this code works

This is a pretty straightforward histogram with the addition of a vertical line to indicate where the mean is (geom_vline()) and a text annotation to indicate the value of the mean (annotate()).

In case you’re not familiar with how ggplot2 works, I’ll run through it:

Next, we’re using the rnorm() function to generate a set of normally distributed random numbers. If you look a little closer, we’re using the mean = parameter to specify that we want these normally distributed numbers to have a mean of 5.

After creating the variable itself, we’re using data.frame() to create a data frame. This newly created data frame has the name df.xvar. Keep in mind that the prefix “df.” has no special meaning in R. In fact, in contrast to some other programming languages like Python or Java, the “.” has no special meaning at all; R treats it just like any other character. Ultimately, adding a prefix like “df.” is simply a personal naming convention I use to keep data organized. This can be useful in large projects.

Next, we calculate the mean. This is extremely straightforward. The mean() function calculates the mean.

Finally, we use ggplot2 to plot all of this information.

We initially call the ggplot() function to initiate plotting. ggplot() essentially says “we’re going to plot something.” The data = parameter indicates the exact data frame that we’ll plot; it says “we’re going to be plotting some variables that are inside of the data frame df.xvar.”

The aes() function allows us to specify a variable mapping. The idea of “mapping variables to aesthetic attributes” is a critical part of the ggplot2 conceptual framework, but it’s somewhat beyond the scope of this post. To put this simply, the code “x = xvar_rand_norm” inside of the aes() function lets ggplot know that you want to plot the variable xvar_rand_norm, and that visually, it is to be plotted on the x-axis.

After specifying which variable that we’re going to include in the plot, on the next line, geom_histogram() specifies the type of plot we want to draw. Again, the concept of a “geom” (AKA “geometric object”) is slightly outside of the scope of this post, but to put it simply, a “geom” is simply a geometric object that we want to draw. So geom_histogram() indicates that we want to draw a histogram.

geom_vline() tells ggplot to add a vertical line. We’re indicating exactly where to draw that line with the parameter xintercept =. So here, we’re specifying that we want to draw a vertical line that intercepts the x-axis at xvar_mean, which we already calculated in a previous part of the code. color = “dark red” does exactly what it looks like: it sets the color of the line to dark red.

Lastly, we’re using annotate() to add a text annotation. Essentially, label = indicates the exact text to use in the annotation. You can see that we’re using “Mean: “, but then we’re using the paste() function to convert the value of our calculated mean (xvar_mean) into a string of characters, which we’re also including in the annotation. So, the annotation ends ups as “Mean: 4.97” (assuming that you’ve run this code with the seed 42).

## You need to master the histogram

We’re using a few useful techniques in this visualization, but the critical piece that you really need to know is the first two lines of ggplot() code.

ggplot(data = df.xvar, aes(x = xvar_rand_norm)) +
geom_histogram()


I’ve been beating this drum for well over a year now, but this bears repeating:

The histogram is one of the plots that you need to master. You’ll use it constantly in analysis and reporting. When you move on to more advanced topics like machine learning, you’ll need to use the histogram to examine how your variables are distributed (although you can also use it’s fraternal twin, the density plot).

Here’s what I mean by “master”: you should be able to write the code for a histogram “in your sleep”.

You should be able to write the code to create a histogram with your eyes closed.

## Fluency with the basics: a critical milestone

And not just the histogram.

You need the same level of fluency with all of the other primary tools of visualization like the scatterplot, the bar chart, the line chart. You also need that level of fluency with the basic data wrangling techniques of dplyr and tidyr.

## You need to practice

Achieving that level of fluency isn’t hard, but most people never practice, so they never get there.

I want to make this clear: in order to master R, master data science, master data visualization, and master machine learning, you need to practice.

In this regard, learning data science is much like learning a musical instrument: you need to practice. Ideally, you need to practice every day.

Sound like work? It is. But the rewards are profound.

## Discover how to rapidly master data science

The big barrier is that most people don’t know how to practice writing code.

If you want to discover how to practice data science and rapidly master the techniques, then sign up for the Sharp Sight Labs email list.

In the near future, Sharp Sight Labs will be publishing a lot more material about the strategies and systems for practicing data science and achieving rapid results.

The post A simple histogram (and why you need to practice it) appeared first on SHARP SIGHT LABS.