Access the Internet Archive Advanced Search/Scrape API with wayback (+ a links to a new vignette & pkgdown site)

[This article was first published on R –, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

The wayback???? package has had an update to more efficiently retrieve mementos and added support for working with the Internet Archive’s advanced search+scrape API.


The search/scrape interface lets you examine the IA collections and download what you are after (programmatically). The main function is ia_scrape() but you can also paginate through results with the helper functions provided.

To demonstrate, let’s peruse the IA NASA collection and then grab one of the images. First, we need to search the collection then choose a target URL to retrieve and finally download it. The identifier is the key element to ensure we can retrieve the information about a particular collection.


nasa <- ia_scrape("collection:nasa", count=100L)

## # A tibble: 100 x 3
##    identifier addeddate            title                                       
##    <chr>      <chr>                <chr>                                       
##  1 00-042-154 2009-08-26T16:30:09Z International Space Station exhibit         
##  2 00-042-32  2009-08-26T16:30:12Z Swamp to Space historical exhibit           
##  3 00-042-43  2009-08-26T16:30:16Z Naval Meteorology and Oceanography Command …
##  4 00-042-56  2009-08-26T16:30:19Z Test Control Center exhibit                 
##  5 00-042-71  2009-08-26T16:30:21Z Space Shuttle Cockpit exhibit               
##  6 00-042-94  2009-08-26T16:30:24Z RocKeTeria restaurant                       
##  7 00-050D-01 2009-08-26T16:30:26Z Swamp to Space exhibit                      
##  8 00-057D-01 2009-08-26T16:30:29Z Astro Camp 2000 Rocketry Exercise           
##  9 00-062D-03 2009-08-26T16:30:32Z Launch Pad Tour Stop                        
## 10 00-068D-01 2009-08-26T16:30:34Z Lunar Lander Exhibit                        
## # ... with 90 more rows

(item <- ia_retrieve(nasa$identifier[1]))

## # A tibble: 6 x 4
##   file                       link                                                               last_mod          size 
## 1 00-042-154.jpg                06-Nov-2000 15:34 1.2M 
## 2 00-042-154_archive.torrent 06-Jul-2018 11:14 1.8K 
## 3 00-042-154_files.xml       06-Jul-2018 11:14 1.7K 
## 4 00-042-154_meta.xml        03-Jun-2016 02:06 1.4K 
## 5 00-042-154_thumb.jpg       26-Aug-2009 16:30 7.7K 
## 6 __ia_thumb.jpg                06-Jul-2018 11:14 26.6K

download.file(item$link[1], file.path("man/figures", item$file[1]))

I just happened to know this would take me to an image. You can add the media type to the result (along with a host of other fields) to help with programmatic filtering.

The API is still not sealed in stone, so you're encouraged to submit questions/suggestions.


The vignette is embedded below and frame-busted here. It covers a very helpful and practical use-case identified recently by an OP on StackOverflow.

There's also a new pkgdown-gen'd site for the package.

Issues & PRs welcome at your community coding site of choice.

To leave a comment for the author, please follow the link and comment on their blog: R – offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)