Articles by hrbrmstr

Identify & Analyze Web Site Tech Stacks With rappalyzer

September 30, 2017 | hrbrmstr

Modern websites are complex beasts. They house photo galleries, interactive visualizations, web fonts, analytics code and other diverse types of content. Despite the potential for diversity, many web sites share similar “tech stacks” — the components that come together to make them what they are. These stacks consist of web servers (...
[Read more...]

SODD — StackOverflow Driven-Development

September 28, 2017 | hrbrmstr

I occasionally hang out on StackOverflow and often use an answer as an opportunity to fill a package void for a particular need. docxtractr and qrencoder are two (of many) packages that were birthed from SO answers. I usually try to answer with inline code first then expand the functionality ... [Read more...]

Speeding Up Digital Arachnids

September 25, 2017 | hrbrmstr

spiderbar, spiderbar Reads robots rules from afar. Crawls the web, any size; Fetches with respect, never lies. Look Out! Here comes the spiderbar. Is it fast? Listen bud, It's got C++ under the hood. Can you scrape, from a site? Test with can_fetch(), TRUE == alright Hey, there There goes ...
[Read more...]

Pirating Web Content Responsibly With R

September 19, 2017 | hrbrmstr

International Code Talk Like A Pirate Day almost slipped by without me noticing (September has been a crazy busy month), but it popped up in the calendar notifications today and I was glad that I had prepped the meat of a post a few weeks back. There will be no ‘...
[Read more...]

Mapping Fall Foliage with sf

September 18, 2017 | hrbrmstr

I was socially engineered by @yoniceedee into creating today’s post due to being prodded with this tweet: Where to see the best fall foliage, based on your location: https://t.co/12pQU29ksB pic.twitter.com/JiywYVpmno— Vox (@voxdotcom) September 18, 2017 Since there aren’t nearly enough sf and geom_... [Read more...]

It’s a FAKE (?)! Revisiting Trust In FOSS Ecosystems

September 15, 2017 | hrbrmstr

I’ve blathered about trust before 1 2, but said blatherings were in a “what if” context. Unfortunately, the if has turned into a when, which begged for further blathering on a recent FOSS ecosystem cybersecurity incident. The gg_spiffy @thomasp85 linked to a post by the SK-CSIRT detailing the discovery and ...
[Read more...]

Revisiting Readability With RStudio

September 13, 2017 | hrbrmstr

I’ve blogged about my in-development R package hgr a before and it’s slowly getting to a CRAN release. There are two new features to it that are more useful in an interactive session than in a programmatic context. Since they build on each other, we’ll take them ...
[Read more...]

Teasing Out Top Daily Topics with GDELT’s Television Explorer

September 9, 2017 | hrbrmstr

Earlier this year, the GDELT Project released their Television Explorer that enabled API access to closed-caption tedt from television news broadcasts. They’ve done an incredible job expanding and stabilizing the API and just recently released “top trending tables” which summarise what the “top” topics and phrases are across news ...
[Read more...]

Readability Redux

September 4, 2017 | hrbrmstr

I recently posted about using a Python module to convert HTML to usable text. Since then, a new package has hit CRAN dubbed htm2txt that is 100% R and uses regular expressions to strip tags from text. I gave it a spin so folks could compare some basic output, but ... [Read more...]

New CRAN Package Announcement: splashr

August 29, 2017 | hrbrmstr

I’m pleased to announce that splashr is now on CRAN. (That image was generated with splashr::render_png(url = "https://cran.r-project.org/web/packages/splashr/")). The package is an R interface to the Splash javascript rendering service. It works in a similar fashion to Selenium but is fear ...
[Read more...]

Unbottling “.msg” Files in R

August 25, 2017 | hrbrmstr

There was a discussion on Twitter about the need to read in “.msg” files using R. The “MSG” file format is one of the many binary abominations created by Microsoft to lock folks and users into their platform and tools. Thankfully, they (eventually) provided documentation for the MSG file format ...
[Read more...]

Reticulating Readability

August 24, 2017 | hrbrmstr

I needed to clean some web HTML content for a project and I usually use hgr::clean_text() for it and that generally works pretty well. The clean_text() function uses an XSLT stylesheet to try to remove all non-“main text content” from an HTML document and it usually ... [Read more...]

Caching httr Requests? This means WAR[C]!

August 22, 2017 | hrbrmstr

I’ve blathered about my crawl_delay project before and am just waiting for a rainy weekend to be able to crank out a follow-up post on it. Working on that project involved sifting through thousands of Web Archive (WARC) files. While I have a nascent package on github to ... [Read more...]

R⁶ — Reticulating Parquet Files

August 1, 2017 | hrbrmstr

The reticulate package provides a very clean & concise interface bridge between R and Python which makes it handy to work with modules that have yet to be ported to R (going native is always better when you can do it). This post shows how to use reticulate to create parquet ... [Read more...]

R⁶ — General (Attys) Distributions

July 25, 2017 | hrbrmstr

Matt @stiles is a spiffy data journalist at the @latimes and he posted an interesting chart on U.S. Attorneys General longevity (given that the current US AG is on thin ice): Only Watergate and the Civil War have prompted shorter tenures as AG (if Sessions were to leave now). ...
[Read more...]
1 10 11 12 13 14 23

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)