Dr. Evil meets the robotstxt package

[This article was first published on Paul Oldham's Analytics Blog, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

I am fairly new to webscraping in R using rvest and one question is whether a site gives permission for scraping. This information is often contained in the robots.txt file on a website. So, I’m briefly going to explore the ROpenSci robotstxt package by Peter Meissner. robotstxt provides easy access to the robots.txt file for a domain from R.

I’m slowly working on a new R data package for underwater geographic feature names as part of a Norwegian Research Council funded project biospolar on innovation involving biodiversity in marine polar areas. One of the main data sources for the package is the General Bathymetric Chart of the Oceans or GEBCO Gazeteer. I’m also going to be bringing in data from the Interridge database of hydrothermal vents and so wanted to understand whether I am just free to go ahead.

The robots.txt content is advisory, and well we could always choose to be Dr. Evil. If my wife would let me have a cat it would definitely be called Mr. Bigglesworth. But it strikes me that building a package for a data source that tries to prohibit scraping might not be a brilliant idea.

There are a bunch of functions in the robotstxt package but I’m just going to use the main one robotstxt(). Take a look at the vignette for more information. For a very quick check on whether scraping on a path is allowed try the paths_allowed() function. I’ll come back to that at the end.

The first place I am going to look is the main GEBCO domain.

library(robotstxt)
gebco <- robotstxt("https://www.gebco.net")
gebco
## $domain
## [1] "https://www.gebco.net"
## 
## $text
## [1] "Sitemap: https://www.gebco.net/sitemap.xml \r\n\r\nUser-agent: *\r\nHost: www.gebco.net\r\nDisallow: /cgi-bin/\r\nDisallow: /perl/\r\nDisallow: /css/\r\nDisallow: /js/\r\nDisallow: /_mm/\r\nDisallow: /_notes/\r\n\n[... 36 lines omitted ...]"
## 
## $bots
## [1] "*"                "Googlebot"        "Googlebot-Image" 
## [4] "Googlebot-Mobile"
## 
## $comments
## [1] line    comment
## <0 rows> (or 0-length row.names)
## 
## $permissions
##                         field useragent     value
## 1                    Disallow         * /cgi-bin/
## 2                    Disallow         *    /perl/
## 3                    Disallow         *     /css/
## 4                    Disallow         *      /js/
## 5                    Disallow         *     /_mm/
## 6                    Disallow         *  /_notes/
## 7                                                
## 8 [...  31 items omitted ...]                    
## 
## $crawl_delay
## [1] field     useragent value    
## <0 rows> (or 0-length row.names)
## 
## $host
##   field useragent         value
## 1  Host         * www.gebco.net
## 
## $sitemap
##     field useragent                             value
## 1 Sitemap         * https://www.gebco.net/sitemap.xml
## 
## $other
## [1] field     useragent value    
## <0 rows> (or 0-length row.names)
## 
## $robexclobj
## <Robots Exclusion Protocol Object>
## $check
## function (paths = "/", bot = "*") 
## {
##     spiderbar::can_fetch(obj = self$robexclobj, path = paths, 
##         user_agent = bot)
## }
## <bytecode: 0x7fc3af22a750>
## <environment: 0x7fc3af24bef8>
## 
## attr(,"class")
## [1] "robotstxt"

This returns a list from the robots txt where the main bit I am interested in is the data frame under gebco$permissions.

field useragent value
Disallow * /cgi-bin/
Disallow * /perl/
Disallow * /css/
Disallow * /js/
Disallow * /_mm/
Disallow * /_notes/
Disallow * /_baks/
Disallow * /MMWIP/
Disallow Googlebot /cgi-bin/
Disallow Googlebot /perl/
Disallow Googlebot /css/
Disallow Googlebot /js/
Disallow Googlebot /_mm/
Disallow Googlebot /_notes/
Disallow Googlebot /_baks/
Disallow Googlebot /MMWIP/
Disallow Googlebot /*templates
Disallow Googlebot */log.gif
Disallow Googlebot /*_baks
Disallow Googlebot /*_notes
Disallow Googlebot /js
Disallow Googlebot *.csi
Disallow Googlebot *.vcf
Disallow Googlebot-Image /cgi-bin/
Disallow Googlebot-Image /perl/
Disallow Googlebot-Image /css/
Disallow Googlebot-Image /js/
Disallow Googlebot-Image /_mm/
Disallow Googlebot-Image /_notes/
Disallow Googlebot-Image /_baks/
Disallow Googlebot-Image /MMWIP/
Disallow Googlebot-Image */log.gif
Disallow Googlebot-Mobile /*templates
Disallow Googlebot-Mobile */log.gif
Disallow Googlebot-Mobile /*_baks
Disallow Googlebot-Mobile /*_notes

What is of interest here are the entries under Value which can be a bit difficult to interpret. With the help of the handy Wikipedia article on the Robots Exclusion Standard I can see that:

  • Disallow + * means to stay out of the website altogether.

  • Disallow + /xyz means to stay out of the specific directories.

  • Disallow Googlebot means that the named bot should stay out of either the website or (as in this case) specific directories. Note that Googlebot appears to be in the naughty seat because the site is more specific about what it should stay out of while others would be free to enter?

However, the GEBCO data files that I am interested in are not hosted on the gebco.net domain but on the NOAA National Centers for Environmental Information domain.

noaa <- robotstxt(domain = "https://www.ngdc.noaa.gov")
noaa
## $domain
## [1] "https://www.ngdc.noaa.gov"
## 
## $text
## [1] "User-agent: *\nCrawl-delay: 60\nDisallow: /cgi-bin\nDisallow: /dmsp/cgi-bin\nDisallow: /dmsp/data\nDisallow: /dmsp/include\nDisallow: /dmsp/protected\nDisallow: /eog\nDisallow: /geomag/cdroms\nDisallow: /geomag/data\n\n[... 67 lines omitted ...]"
## 
## $bots
## [1] "*"                                                                                            
## [2] "LinkChecker"                                                                                  
## [3] "siteimprove"                                                                                  
## [4] "Mozilla/5.0(compatible;MSIE10.0;WindowsNT6.1;Trident/6.0)LinkCheckbySiteimprove.com"          
## [5] "Mozilla/5.0(compatible;MSIE10.0;WindowsNT6.1;Trident/6.0)SiteCheck-sitecrawlbySiteimprove.com"
## [6] "HTMLValidatorbysiteimprove.com/1.3"                                                           
## 
## $comments
## [1] line    comment
## <0 rows> (or 0-length row.names)
## 
## $permissions
##                         field useragent           value
## 1                    Disallow         *        /cgi-bin
## 2                    Disallow         *   /dmsp/cgi-bin
## 3                    Disallow         *      /dmsp/data
## 4                    Disallow         *   /dmsp/include
## 5                    Disallow         * /dmsp/protected
## 6                    Disallow         *            /eog
## 7                                                      
## 8 [...  73 items omitted ...]                          
## 
## $crawl_delay
##         field useragent value
## 1 Crawl-delay         *    60
## 
## $host
## [1] field     useragent value    
## <0 rows> (or 0-length row.names)
## 
## $sitemap
## [1] field     useragent value    
## <0 rows> (or 0-length row.names)
## 
## $other
## [1] field     useragent value    
## <0 rows> (or 0-length row.names)
## 
## $robexclobj
## <Robots Exclusion Protocol Object>
## $check
## function (paths = "/", bot = "*") 
## {
##     spiderbar::can_fetch(obj = self$robexclobj, path = paths, 
##         user_agent = bot)
## }
## <bytecode: 0x7fc3af22a750>
## <environment: 0x7fc3aee6a4e0>
## 
## attr(,"class")
## [1] "robotstxt"

The NOAA robotstxt provides some different information. For example, NOAA specifies a crawl delay of 60 seconds which tells me to build in a delay of 60 seconds to a call.

noaa$text
## User-agent: *
## Crawl-delay: 60
## Disallow: /cgi-bin
## Disallow: /dmsp/cgi-bin
## Disallow: /dmsp/data
## Disallow: /dmsp/include
## Disallow: /dmsp/protected
## Disallow: /eog
## Disallow: /geomag/cdroms
## Disallow: /geomag/data
## Disallow: /geomag/EMM/data
## Disallow: /geomag/pmag/datafiles
## Disallow: /geomag/WMM/data
## Disallow: /globe
## Disallow: /hazard/data
## Disallow: /hazard/img 
## Disallow: /IAGA/cgi-bin
## Disallow: /idb
## Disallow: /ionosonde
## Disallow: /mgg/cgi-bin
## Disallow: /mgg/curator/data
## Disallow: /mgg/curator/userfiles
## Disallow: /mgg/dat
## Disallow: /mgg/ecs/data
## Disallow: /mgg/gdas/data
## Disallow: /mgg/geology/data
## Disallow: /mgg/geology/odp/data
## Disallow: /mgg/grids/data
## Disallow: /mgg/oracle
## Disallow: /mgg/tmp
## Disallow: /mgg/trk
## Disallow: /ngdc/cgi-bin
## Disallow: /ngdc/hn
## Disallow: /ngdc/Counter
## Disallow: /ngdc/NOAAServer/adm
## Disallow: /ngdc/NOAAServer/converters
## Disallow: /ngdc/NOAAServer/gif
## Disallow: /ngdc/NOAAServer/java
## Disallow: /ngdc/NOAAServer/lib
## Disallow: /ngdc/NOAAServer_N
## Disallow: /ngdc/Store
## Disallow: /nmmr
## Disallow: /nndc
## Disallow: /paleo
## Disallow: /riwebapp/rest
## Disallow: /seg/cgi-bin
## Disallow: /stp/bin
## Disallow: /stp/cgi-bin
## Disallow: /stp/drap/data
## Disallow: /stp/include
## Disallow: /stp/image
## Disallow: /stp/images
## Disallow: /stp/include
## Disallow: /stp/iono/drap
## Disallow: /stp/iono/ustec/products
## Disallow: /stp/satellite/poes/dataaccess.html
## Disallow: /stp/satellite/goes/dataaccess.html
## Disallow: /sxi/servlet/sxibrowse
## Disallow: /sxi/servlet/sximovie
## Disallow: /sxi/servlet/sxisearch
## Disallow: /stp/IONO/ionosonde
## Disallow: /thredds
## Disallow: /wdc/cgi-bin
## 
## 
## User-agent: LinkChecker
## Disallow:
## 
## User-agent: siteimprove
## Disallow: /
## User-agent: Mozilla/5.0 (compatible; MSIE 10.0; Windows NT 6.1; Trident/6.0) LinkCheck by Siteimprove.com
## Disallow: /
## User-agent: Mozilla/5.0 (compatible; MSIE 10.0; Windows NT 6.1; Trident/6.0) SiteCheck-sitecrawl by Siteimprove.com
## Disallow: /
## User-agent: HTML Validator by siteimprove.com/1.3
## Disallow: /

We then see a list of disallowed directories and in this case I am interested in the https://www.ngdc.noaa.gov/gazetteer/

The dir I am interested in for the package is not on the list so I think I am free to go ahead… yay!

If I wanted to do this more quickly I would use the paths_allowed() function.

paths_allowed("https://www.ngdc.noaa.gov/gazetteer/")
## [1] TRUE

So, there we have it. If we prefer to be good web scraping citizens rather than the Dr. Evil of web scraping in R then the robotstxt package will help us out. On the other hand we could just be evil and see what happens. I’m off to stroke Mr. Bigglesworth.

To leave a comment for the author, please follow the link and comment on their blog: Paul Oldham's Analytics Blog.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)