Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

I’ve blathered about trust before 1 2, but said blatherings were in a “what if” context. Unfortunately, the if has turned into a when, which begged for further blathering on a recent FOSS ecosystem cybersecurity incident.

The gg_spiffy @thomasp85 linked to a post by the SK-CSIRT detailing the discovery and take-down of a series of malicious Python packages. Here’s their high-level incident summary:

SK-CSIRT identified malicious software libraries in the official Python package repository, PyPI, posing as well known libraries. A prominent example is a fake package urllib-1.21.1.tar.gz, based upon a well known package urllib3-1.21.1.tar.gz.

Such packages may have been downloaded by unwitting developer or administrator by various means, including the popular “pip” utility (pip install urllib). There is evidence that the fake packages have indeed been downloaded and incorporated into software multiple times between June 2017 and September 2017.

Words are great but, unlike some other FOSS projects (*cough* R *cough*) the PyPI folks have authoritative log data regarding package downloads from PyPI. This means we can begin to quantify the exposure. The Google BigQuery SQL was pretty straightforward:

SELECT timestamp, file.project as package, country_code, file.version AS version
FROM (
)
WHERE file.project IN ('acqusition', apidev-coop', 'bzip', 'crypt', 'django-server',
'pwd', 'setup-tools', 'telnet', 'urlib3', 'urllib')

Let’s see what the daily downloads of the malicious package look like:

But, we need counts of the mal-package dopplegangers (i.e. the good packages) to truly understand scope of exposure:

Thankfully, the CK-CSIRT folks caught this in time and the exposure was limited. But those are some popular tools that were targeted and it’s super-easy to sneak these into requirements.txt and scripts since the names are similar and the functionality is duplicated.

I’ll further note that the crypto package was “good” at some point in time then went away and was replaced with the nefarious one. That seems like a pretty big PyPI oversight (vis-a-vis package retirement & name re-use), but I’m not casting stones. R’s devtools::install_github() and wanton source()ing are just as bad, and the non-CRAN ecosystem is an even more varmint-prone “wild west” environment.

Furthermore, this is a potential exposure issue in many FOSS package repository ecosystems. On the one hand, these are open environments with tons of room for experimentation, creativity and collaboration. On the other hand, they’re all-too-easy targets for malicious hackers to prey upon.

I, unfortunately, have no quick-fix solutions to offer. “Review your code and dependencies” is about the best I can suggest until individual ecosystems work on better integrity & authenticity controls or there is a cross-ecosystem effort to establish “best practices” and perhaps even staffed, verified, audited, free services that work like a sheriff+notary to help ensure the safety of projects relying on open source components.

Python folks: double check that you weren’t a victim here (it’s super easy to type some of those package names wrong, and hopefully you’ve noticed builds failing if you had done so).

R folks: don’t be smug, watch your GitHub dependencies and double check your projects.

You can find the data and the scripts used to generate the charts (ironically enough) on GitHub.

Finally: I just want to close with a “thank you!” to PyPI’s Donald Stufft who (quickly!) pointed me to a blog post detailing the BigQuery setup.