ggplot “Doodling” with HIBP Breaches

July 29, 2018
By

(This article was first published on R – rud.is, and kindly contributed to R-bloggers)

After reading this interesting analysis of “How Often Are Americans’ Accounts Breached?” by Gaurav Sood (which we need more of in cyber-land) I gave in to the impulse to do some gg-doodling with the “Have I Been Pwnd” JSON data he used.

It’s just some basic data manipulation with some heavy ggplot2 styling customization, so no real need for exposition beyond noting that there are many other ways to view the data. I just settled on centered segments early on and went from there. If you do a bit of gg-doodling yourself, drop a note in the comments with a link.

You can see a full-size version of the image via this link.

library(hrbrthemes) # use github or gitlab version
library(tidyverse)

# get the data

dat_url <- "https://raw.githubusercontent.com/themains/pwned/master/data/breaches.json"

jsonlite::fromJSON(dat_url) %>% 
  mutate(BreachDate = as.Date(BreachDate)) %>% 
  tbl_df() -> breaches

# selected breach labels df
group_by(breaches, year = lubridate::year(BreachDate)) %>% 
  top_n(1, wt=PwnCount) %>% 
  ungroup() %>% 
  filter(year %in% c(2008, 2015, 2016, 2017)) %>% # pick years where labels will fit nicely
  mutate(
    lab = sprintf("%s\n%sM accounts", Name, as.integer(PwnCount/1000000))
  ) %>% 
  arrange(year) -> labs

# num of known breaches in that year for labels
count(breaches, year = lubridate::year(BreachDate)) %>% 
  mutate(nlab = sprintf("n=%s", n)) %>% 
  mutate(lab_x = as.Date(sprintf("%s-07-02", year))) -> year_cts

mutate(breaches, p_half = PwnCount/2) %>% # for centered segments
  ggplot() +
  geom_segment( # centered segments
    aes(BreachDate, p_half, xend=BreachDate, yend=-p_half), 
    color = ft_cols$yellow, size = 0.3
  ) +
  geom_text( # selected breach labels
    data = labs, aes(BreachDate, PwnCount/2, label = lab),
    lineheight = 0.875, size = 3.25, family = font_rc,
    hjust = c(0, 1, 1, 0), vjust = 1, nudge_x = c(25, -25, -25, 25),
    nudge_y = 0,  color = ft_cols$slate
  ) +
  geom_text( # top year labels
    data = year_cts, aes(lab_x, Inf, label = year), family = font_rc, 
    size = 4, vjust = 1, lineheight = 0.875, color = ft_cols$gray
  ) +
  geom_text( # bottom known breach count totals
    data = year_cts, aes(lab_x, -Inf, label = nlab, size = n), 
    vjust = 0, lineheight = 0.875, color = ft_cols$peach,
    family = font_rc, show.legend = FALSE
  ) +
  scale_x_date( # break on year
    name = NULL, date_breaks = "1 year", date_labels = "%Y"
  ) +
  scale_y_comma(name = NULL, limits = c(-450000000, 450000000)) + # make room for labels
  scale_size_continuous(range = c(3, 4.5)) + # tolerable font sizes 
  labs(
    title = "HIBP (Known) Breach Frequency & Size",
    subtitle = "Segment length is number of accounts; n=# known account breaches that year",
    caption = "Source: HIBP via "
  ) +
  theme_ft_rc(grid="X") +
  theme(axis.text.y = element_blank()) +
  theme(axis.text.x = element_blank())

To leave a comment for the author, please follow the link and comment on their blog: R – rud.is.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...



If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Comments are closed.

Search R-bloggers

Sponsors

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)