EDA Before CDA

John Myles White

9 years ago

[This article was first published on John Myles White » Statistics, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

One Paragraph Summary

Always explore your data visually. Whatever specific hypothesis you have when you go out to collect data is likely to be worse than any of the hypotheses you’ll form after looking at just a few simple visualizations of that data. The most effective hypothesis testing framework in existence is the test of intraocular trauma.

Context

This morning, I woke up to find that Neil Kodner had discovered a very convenient CSV file that contains geospatial data about every valid US zip code. I’ve been interested in the relationship between places and zip codes recently, because I spent my summer living in the 98122 zip code after having spent my entire life living in places with zip codes below 20000. Because of the huge gulf between my Seattle zip code and my zip codes on the East Coast, I’ve on-and-off wondered if the zip codes were originally assigned in terms of the seniority of states. Specifically, the original thirteen colonies seem to have some of the lowest zip codes, while the newer states had some of the highest zip codes.

While I could presumably find this information through a few web searches or could gather the right data set to test my idea formally, I decided to blindly plot the zip code data instead. I think the results help to show why a few well-chosen visualizations can be so much more valuable than regression coefficients. Below I’ve posted the code I used to explore the zip code data in the exact order of the plots I produced. I’ll let the resulting pictures tell the rest of the story.

zipcodes <- read.csv("zipcodes.csv")
 
ggplot(zipcodes, aes(x = zip, y = latitude)) +
  geom_point()
ggsave("latitude_vs_zip.png", height = 7, width = 10)
ggplot(zipcodes, aes(x = zip, y = longitude)) +
  geom_point()
ggsave("longitude_vs_zip.png", height = 7, width = 10)
ggplot(zipcodes, aes(x = latitude, y = longitude, color = zip)) +
  geom_point()
ggsave("latitude_vs_longitude_color.png", height = 7, width = 10)
ggplot(zipcodes, aes(x = longitude, y = latitude, color = zip)) +
  geom_point()
ggsave("longitude_vs_latitude_color.png", height = 7, width = 10)
ggplot(subset(zipcodes, longitude < 0), aes(x = longitude, y = latitude, color = zip)) +
  geom_point()
ggsave("usa_color.png", height = 7, width = 10)

Picture

(Latitude, Zipcode) Scatterplot

(Longitude, Zipcode) Scatterplot

(Latitude, Longitude) Heatmap

(Longitude, Latitude) Heatmap

(Longitude, Latitude) Heatmap without Non-States

To leave a comment for the author, please follow the link and comment on their blog: John Myles White » Statistics.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.