books<-read_csv("2017_books.csv", col_names = TRUE)
## Warning: Duplicated column names deduplicated: 'Author' => 'Author_1' 
One analysis I conducted with this dataset was to look at the correlation between book length (number of pages) and read time (number of days it took to read the book). We can also generate a scatterplot to visualize this relationship.
## Pearson's product-moment correlation
## data: books$Pages and books$Read_Time
## t = 3.1396, df = 51, p-value = 0.002812
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.1482981 0.6067498
## sample estimates:
scatter <- ggplot(books, aes(Pages, Read_Time)) +
geom_point(size = 3) +
labs(title = "Relationship Between Reading Time and Page Length") +
ylab("Read Time (in days)") +
xlab("Number of Pages") +
There’s a significant positive correlation here, meaning the longer books take more days to read. It’s a moderate correlation, and there are certainly other variables that may explain why a book took longer to read. For instance, nonfiction books may take longer. Books read in October or November (while I was gearing up for and participating in NaNoWriMo, respectively) may also take longer, since I had less spare time to read. I can conduct regressions and other analyses to examine which variables impact read time, but one of the most important parts of sharing results is creating good data visualizations. How can I show the impact these other variables have on read time in an understandable and visually appealing way?
gghighlight will let me draw attention to different parts of the plot. For example, I can ask gghighlight to draw attention to books that took longer than a certain amount of time to read, and I can even ask it to label those books.