Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Most modern operating systems keep secrets from you in many ways. One of these ways is by associating extended file attributes with files. These attributes can serve useful purposes. For instance, macOS uses them to identify when files have passed through the Gatekeeper or to store the URLs of files that were downloaded via Safari (though most other browsers add the com.apple.metadata:kMDItemWhereFroms attribute now, too).

Attributes are nothing more than a series of key/value pairs. They key must be a character value & unique, and it’s fairly standard practice to keep the value component under 4K. Apart from that, you can put anything in the value: text, binary content, etc.

When you’re in a terminal session you can tell that a file has extended attributes by looking for an @ sign near the permissions column:

$cd ~/Downloads$ ls -l
total 264856
[email protected] 1 user  staff     169062 Nov 27  2017 1109.1968.pdf
[email protected] 1 user  staff     171059 Nov 27  2017 1109.1968v1.pdf
[email protected] 1 user  staff     291373 Apr 27 21:25 1804.09970.pdf
[email protected] 1 user  staff    1150562 Apr 27 21:26 1804.09988.pdf
[email protected] 1 user  staff     482953 May 11 12:00 1805.01554.pdf
[email protected] 1 user  staff  125822222 May 14 16:34 RStudio-1.2.627.dmg
[email protected] 1 user  staff    2727305 Dec 21 17:50 athena-ug.pdf
[email protected] 1 user  staff      90181 Jan 11 15:55 bgptools-0.2.tar.gz
[email protected] 1 user  staff    4683220 May 25 14:52 osquery-3.2.4.pkg


You can work with extended attributes from the terminal with the xattr command, but do you really want to go to the terminal every time you want to examine these secret settings (now that you know your OS is keeping secrets from you)?

I didn’t think so. Thus begat the xattrs package.

Data scientists are (generally) inquisitive folk and tend to accumulate things. We grab papers, data, programs (etc.) and some of those actions are performed in browsers. Let’s use the xattrs package to rebuild a list of download URLs from the extended attributes on the files located in ~/Downloads (if you’ve chosen a different default for your browsers, use that directory).

We’re not going to work with the entire package in this post (it’s really straightforward to use and has a README on the GitHub site along with extensive examples) but I’ll use one of the example files from the directory listing above to demonstrate a couple functions before we get to the main example.

First, let’s see what is hidden with the RStudio disk image:

library(xattrs)
library(reticulate) # not 100% necessary but you'll see why later
library(tidyverse) # we'll need this later

## [1] "com.apple.diskimages.fsck"            "com.apple.diskimages.recentcksum"


There are four keys we can poke at, but the one that will help transition us to a larger example is com.apple.metadata:kMDItemWhereFroms. This is the key Apple has standardized on to store the source URL of a downloaded item. Let’s take a look:

get_xattr_raw("~/Downloads/RStudio-1.2.627.dmg", "com.apple.metadata:kMDItemWhereFroms")
##   [1] 62 70 6c 69 73 74 30 30 a2 01 02 5f 10 4c 68 74 74 70 73 3a 2f 2f 73 33 2e 61 6d 61
##  [29] 7a 6f 6e 61 77 73 2e 63 6f 6d 2f 72 73 74 75 64 69 6f 2d 69 64 65 2d 62 75 69 6c 64
##  [57] 2f 64 65 73 6b 74 6f 70 2f 6d 61 63 6f 73 2f 52 53 74 75 64 69 6f 2d 31 2e 32 2e 36
##  [85] 32 37 2e 64 6d 67 5f 10 2c 68 74 74 70 73 3a 2f 2f 64 61 69 6c 69 65 73 2e 72 73 74
## [113] 75 64 69 6f 2e 63 6f 6d 2f 72 73 74 75 64 69 6f 2f 6f 73 73 2f 6d 61 63 2f 08 0b 5a
## [141] 00 00 00 00 00 00 01 01 00 00 00 00 00 00 00 03 00 00 00 00 00 00 00 00 00 00 00 00
## [169] 00 00 00 89


Why “raw”? Well, as noted above, the value component of these attributes can store anything and this one definitely has embedded nul[l]s (0x00) in it. We can try to read it as a string, though:

get_xattr("~/Downloads/RStudio-1.2.627.dmg", "com.apple.metadata:kMDItemWhereFroms")
## [1] "bplist00\xa2\001\002_\020Lhttps://s3.amazonaws.com/rstudio-ide-build/desktop/macos/RStudio-1.2.627.dmg_\020,https://dailies.rstudio.com/rstudio/oss/mac/\b\vZ"


So, we can kinda figure out the URL but it’s definitely not pretty. The general practice of Safari (and other browsers) is to use a binary property list to store metadata in the value component of an extended attribute (at least for these URL references).

There will eventually be a native Rust-backed property list reading package for R, but we can work with that binary plist data in two ways: first, via the read_bplist() function that comes with the xattrs package and wraps Linux/BSD or macOS system utilities (which are super expensive since it also means writing out data to a file each time) or turn to Python which already has this capability. We’re going to use the latter.

I like to prime the Python setup with invisible(py_config()) but that is not really necessary (I do it mostly b/c I have a wild number of Python — don’t judge — installs and use the RETICULATE_PYTHON env var for the one I use with R). You’ll need to install the biplist module via pip3 install bipist or pip install bipist depending on your setup. I highly recommended using Python 3.x vs 2.x, though.

biplist <- import("biplist", as="biplist")

biplist$readPlistFromString( get_xattr_raw( "~/Downloads/RStudio-1.2.627.dmg", "com.apple.metadata:kMDItemWhereFroms" ) ) ## [1] "https://s3.amazonaws.com/rstudio-ide-build/desktop/macos/RStudio-1.2.627.dmg" ## [2] "https://dailies.rstudio.com/rstudio/oss/mac/"  That's much better. Let's work with metadata for the whole directory: list.files("~/Downloads", full.names = TRUE) %>% keep(has_xattrs) %>% set_names(basename(.)) %>% map_df(read_xattrs, .id="file") -> xdf xdf ## # A tibble: 24 x 4 ## file name size contents ## ## 1 1109.1968.pdf com.apple.lastuseddate#PS 16 ## 2 1109.1968.pdf com.apple.metadata:kMDItemWhereFroms 110 ## 3 1109.1968.pdf com.apple.quarantine 74 ## 4 1109.1968v1.pdf com.apple.lastuseddate#PS 16 ## 5 1109.1968v1.pdf com.apple.metadata:kMDItemWhereFroms 116 ## 6 1109.1968v1.pdf com.apple.quarantine 74 ## 7 1804.09970.pdf com.apple.metadata:kMDItemWhereFroms 86 ## 8 1804.09970.pdf com.apple.quarantine 82 ## 9 1804.09988.pdf com.apple.lastuseddate#PS 16 ## 10 1804.09988.pdf com.apple.metadata:kMDItemWhereFroms 104 ## # ... with 14 more rows ## count(xdf, name, sort=TRUE) ## # A tibble: 5 x 2 ## name n ## ## 1 com.apple.metadata:kMDItemWhereFroms 9 ## 2 com.apple.quarantine 9 ## 3 com.apple.lastuseddate#PS 4 ## 4 com.apple.diskimages.fsck 1 ## 5 com.apple.diskimages.recentcksum 1  Now we can focus on the task at hand: recovering the URLs: list.files("~/Downloads", full.names = TRUE) %>% keep(has_xattrs) %>% set_names(basename(.)) %>% map_df(read_xattrs, .id="file") %>% filter(name == "com.apple.metadata:kMDItemWhereFroms") %>% mutate(where_from = map(contents, biplist$readPlistFromString)) %>%
select(file, where_from) %>%
unnest() %>%
filter(!where_from == "")
## # A tibble: 15 x 2
##    file                where_from
##
##  1 1109.1968.pdf       https://arxiv.org/pdf/1109.1968.pdf
##  3 1109.1968v1.pdf     https://128.84.21.199/pdf/1109.1968v1.pdf
##  5 1804.09970.pdf      https://arxiv.org/pdf/1804.09970.pdf
##  6 1804.09988.pdf      https://arxiv.org/ftp/arxiv/papers/1804/1804.09988.pdf
##  7 1805.01554.pdf      https://arxiv.org/pdf/1805.01554.pdf
##  8 athena-ug.pdf       http://docs.aws.amazon.com/athena/latest/ug/athena-ug.pdf
## 10 bgptools-0.2.tar.gz http://nms.lcs.mit.edu/software/bgp/bgptools/bgptools-0.2.tar.gz
## 11 bgptools-0.2.tar.gz http://nms.lcs.mit.edu/software/bgp/bgptools/
## 12 osquery-3.2.4.pkg   https://osquery-packages.s3.amazonaws.com/darwin/osquery-3.2.4.p…
## 14 RStudio-1.2.627.dmg https://s3.amazonaws.com/rstudio-ide-build/desktop/macos/RStudio…
## 15 RStudio-1.2.627.dmg https://dailies.rstudio.com/rstudio/oss/mac/


(There are multiple URL entries due to the fact that some browsers preserve the path you traversed to get to the final download.)

Note: if Python is not an option for you, you can use the hack-y read_bplist() function in the package, but it will be much, much slower and you'll need to deal with an ugly list object vs some quaint text vectors.

### FIN

Have some fun exploring what other secrets your OS may be hiding from you and if you're on Windows, give this a go. I have no idea if it will compile or work there, but if it does, definitely report back!

Remember that the package lets you set and remove extended attributes as well, so you can use them to store metadata with your data files (they don't always survive file or OS transfers but if you keep things local they can be an interesting way to tag your files) or clean up items you do not want stored.