Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Über Tuesday has come and almost gone (some state results will take a while to coalesce) and I’m relieved to say that {catchpole} did indeed work, with the example code from before producing this on first run:

If we tweak the buffer space around the squares, I think the cartogram looks better:

but, you should likely use a different palette (see this Twitter thread for examples).

I noted in the previous post that borders might be possible. While I haven’t solved that use-case for individual states, I did manage to come up with a method for making a light version of the cartogram usable:

library(sf)
library(hrbrthemes)
library(catchpole)
library(tidyverse)

candidates_expanded <- expand_candidates()

gsf <- left_join(delegates_map(), candidates_expanded, by = c("state", "idx"))

m <- delegates_map()

# split off each "area" on the map so we can make a border+background
list(
setdiff(state.abb, c("HI", "AK")),
"AK", "HI", "DC", "VI", "PR", "MP", "GU", "DA", "AS"
) %>%
map(~{
suppressWarnings(suppressMessages(st_buffer(
x = st_union(m[m$state %in% .x, ]), dist = 0.0001, endCapStyle = "SQUARE" ))) }) -> m_borders gg <- ggplot() for (mb in m_borders) { gg <- gg + geom_sf(data = mb, col = "#2b2b2b", size = 0.125) } gg + geom_sf( data = gsf, aes(fill = candidate), col = "white", shape = 22, size = 3, stroke = 0.125 ) + scale_fill_manual( name = NULL, na.value = "#f0f0f0", values = c( "Biden" = '#f0027f', "Sanders" = '#7fc97f', "Warren" = '#beaed4', "Buttigieg" = '#fdc086', "Klobuchar" = '#ffff99', "Gabbard" = '#386cb0', "Bloomberg" = '#bf5b17' ), limits = intersect(unique(delegates$candidate), names(delegates_pal))
) +
guides(
fill = guide_legend(
override.aes = list(size = 4)
)
) +
coord_sf(datum = NA) +
theme_ipsum_es(grid="") +
theme(legend.position = "bottom")


### {ssdeepr}

Researcher pals over at Binary Edge added web page hashing (pre- and post-javascript scraping) to their platform using ssdeep. This approach is in the category of context triggered piecewise hashes (CTPH) (or local sensitivity hashing) similar to my R adaptation/packaging of Trend Micro’s tlsh.

Since I’ll be working with BE’s data off-and-on and the ssdeep project has a well-crafted library (plus we might add ssdeep support at \$DAYJOB), I went ahead and packaged that up as well.

I recommend using the hash_con() function if you need to read large blobs since it doesn’t require you to read everything into memory first (though hash_file() doesn’t either, but that’s a direct low-level call to the underlying ssdeep library file reader and not as flexible as R connections are).

These types of hashes are great at seeing if something has changed on a website (or see how similar two things are to each other). For instance, how closely do CRAN mirror match the mothership?

library(ssdeepr) # see the links above for installation

cran1 <- hash_con(url("https://cran.r-project.org/web/packages/available_packages_by_date.html"))
cran2 <- hash_con(url("https://cran.biotools.fr/web/packages/available_packages_by_date.html"))
cran3 <- hash_con(url("https://cran.rstudio.org/web/packages/available_packages_by_date.html"))

hash_compare(cran1, cran2)
## [1] 0

hash_compare(cran1, cran3)
## [1] 94


I picked on cran.biotools.fr as I saw they were well-behind CRAN-proper on the monitoring page.

I noted that BE was doing pre- and post-javascript hashing as well. Why, you may ask? Well, websites behave differently with javascript running, plus they can behave differently when different user-agents are set. Let’s grab a page from Wikipedia a few different ways to show how they are not alike at all, depending on the retrieval context. First, let’s grab some web content!

library(httr)
library(ssdeepr)
library(splashr)

# regular grab
h1 <- hash_con(url("https://en.wikipedia.org/wiki/Donald_Knuth"))

# you need Splash running for javascript-enabled scraping this way
sp <- splash(host = "mysplashhost", user = "splashuser", pass = "splashpass")

# js-enabled with one ua
sp %>%
splash_user_agent(ua_macos_chrome) %>%
splash_go("https://en.wikipedia.org/wiki/Donald_Knuth") %>%
splash_wait(2) %>%
splash_html(raw_html = TRUE) -> js1

# js-enabled with another ua
sp %>%
splash_user_agent(ua_ios_safari) %>%
splash_go("https://en.wikipedia.org/wiki/Donald_Knuth") %>%
splash_wait(2) %>%
splash_html(raw_html = TRUE) -> js2

h2 <- hash_raw(js1)
h3 <- hash_raw(js2)

# same way {rvest} does it
res <- httr::GET("https://en.wikipedia.org/wiki/Donald_Knuth")

h4 <- hash_raw(content(res, as = "raw"))


Now, let’s compare them:

hash_compare(h1, h4) # {ssdeepr} built-in vs httr::GET() => not surprising that they're equal
## [1] 100

# things look way different with js-enabled

hash_compare(h1, h2)
## [1] 0
hash_compare(h1, h3)
## [1] 0

# and with variations between user-agents

hash_compare(h2, h3)
## [1] 0

hash_compare(h2, h4)
## [1] 0

# only doing this for completeness

hash_compare(h3, h4)
## [1] 0


For this example, just content size would have been enough to tell the difference (mostly, note how the hashes are equal despite more characters coming back with the {httr} method):

length(js1)
## [1] 432914

length(js2)
## [1] 270538

nchar(
paste0(
collapse = "\n"
)
)
## [1] 373078

length(content(res, as = "raw"))
## [1] 374099


### FIN

If you were in a U.S. state with a primary yesterday and were eligible to vote (and had something to vote for, either a (D) candidate or a state/local bit of business) I sure hope you did!

The ssdeep library works on Windows, so I’ll be figuring out how to get that going in {ssdeepr} fairly soon (mostly to try out the Rtools 4.0 toolchain vs deliberately wanting to support legacy platforms).

As usual, drop issues/PRs/feature requests where you’re comfortable for any of these or other packages.