With a mixture of R’s command-line tool, a batch file, and the Windows Task Scheduler, a simple automated web-scraper can be built.
Invoking R at the command-line
It is possible to invoke R from the Windows command-line by entering the full path name of the executable, such as
C:\"Program Files"\R\R-3.3.0\bin\R --vanilla
--vanilla is an alias for several options, which in short summary tell R to not load any files at startup and to not ask the user whether to save the workspace image upon exit. If you were to invoke R from within the
bin/ directory, you could enter the much simpler command
And in the spirit of keeping things simple, if you place the
bin/ directory in your
PATH variable, then no matter the location of your current directory, you can always use the simpler command to invoke R. Here’s how to set that
bin/ directory in your
- Press the Windows Key.
systempropertiesadvanced– all one word – and press enter — OR — type
sysdm.cpl, press enter, and click the “Advanced” tab.
- Click “Environment Variables”.
- Select “Path” and click “Edit”.
- Place the cursor at the very end of the “Variable value” field.
- Type the appropriate path name to the
bin/directory with a preceding
;(path names are
;delimited); here’s an example of what I typed:
- Click all the “OK” buttons until you have exited.
Congratulations. You can now invoke R from anywhere within the command-line.
The BATCH tool
The above invocation of R will launch R in the command-line window – just as though you were using the command-line in RStudio or R GUI. However, from within the command line there are several
CMD “tools” which are available to the user which are not meant to be called directly (from a GUI).
One such tool,
BATCH, allows the user to run R files at the command-line (similar to using
source() in the interactive GUI). The command
R --vanilla CMD BATCH file.R file.Rout
file.R and save the output to
file.Rout — assuming you are within (your working directory is)
.Rout file, if not given, is created in the same directory as the
.R file and is given the same name but with extension
.Rout. In the above example, once
R CMD BATCH has finished executing
file.R, it calls
proc.time() and inserts the returned value in the
.Rout file — giving an indication of how long it took to execute the file. Warning messages and errors are also written to the
About batch files
Instead of repeatedly entering an
R CMD BATCH command to run an R file, the command can be both stored in and executed from a batch file. Batch files, which have extension
.bat, are plain text files whose content can be read and executed by the shell. These files can be created and edited using any text editing program (including RStudio).
Here is a batch file based on the above example:
@echo off R --vanilla CMD BATCH file.R file.Rout
@echo off= do not print the lines of code.
- The directory that the batch file is saved to and executed from is the same directory as
file.R‘s directory — if not, then change the working directory or specify the full file path.
Windows Task Scheduler
The Windows Task Scheduler allows users to schedule various types of tasks. One such task that can be scheduled is the execution of a batch file.
Using the GUI interface, it is possible to schedule an R file to execute daily by telling the scheduler to run a batch file which runs an
R CMD BATCH command to execute that R file. Using the Task Scheduler GUI is a straight forward process:
- Press the Windows Key, type either
taskschd.mscor “task scheduler”, and press enter to open the program.
- Click on “Create Task”.
- Assign a name and give a description.
- Create a new trigger and action to execute a batch file on a daily basis.
- Select additional conditions and settings as needed (such as “Wake to run” and “Run task as soon as possible after a scheduled start is missed”).
There are other features you can use such as “Hidden” or “Run weather user is logged on or not”, but the above should be a good enough.
Putting it all together
I have taken some web-scraping code from a previous post on scraping North Dakota rig count data and modified and saved it in a file called
rigcount.data.R. You can find the modified code bellow, plus some caveats about writing R files that are executed by
R CMD BATCH, at the end of this post.
Here is all that is need to create a simple automated web-scraper based on
- Create a batch file to execute
rigcount.data.R. The batch file will run in the
C:\Windows\System32directory, so be sure to change the directory to where your R file is located, such as
@echo off cd %USERPROFILE%\R\ R --vanilla CMD BATCH rigcount.data.R rigcount.data.Rout
- Use the task scheduler to create a task that will execute the above batch file on a daily basis.
There you have it. With a scheduled task to execute the batch file, you have just created a simple automated web-scraper.
Because you are executing an R file in batch mode, there will be a few changes to how R normally works when used with a program such as RStudio (which redirects standard input and output among other things).
- The library path to your
%USERPROFILE%\Rdirectory that is normally available when using RStudio will not be seen when using
R CMD BATCH. That is why, before calling
library(), it is necessary to specify that path, as in my case
- When using
write.csv()to create a new CSV file within RStudio, you normally don’t need to create and connect to that file. Using
R CMD BATCH, however, you will need to do this, such as
fname <- "C:/Users/Luke/Documents/R/newFile.csv" file.create(fname) fcon <- file(fname, open = "w") write.csv(some.object, fname, row.names = FALSE) close(fcon)
Here is the code for
# Scrape Rig Count Data --------------------------------------------------- # Load Dependencies .libPaths("C:/Users/Luke/Documents/R/win-library/3.3") library(rvest) # Set today's date; to be used in file name. today <- Sys.Date() # Create and load URL; scrape table nodes and attributes ("summary"). url <- "https://www.dmr.nd.gov/oilgas/riglist.asp" html <- url %>% read_html() table <- html %>% html_nodes("table") table.summary <- table %>% html_attr("summary") # Find the table with rig count data, which is called "results". table.filter <- grep("results", table.summary) rig.table <- table[table.filter] %>% html_table() # Extract the table from the list; find and apply the header to the table. rig.table <- rig.table[] rig.table.header <- table[table.filter] %>% html_nodes("thead") %>% html_nodes("th") %>% html_text() colnames(rig.table) <- rig.table.header # Add "Publication Date" and make it the first column. rig.table[ncol(rig.table) + 1L] <- today names(rig.table)[ncol(rig.table)] <- "Publication Date" rig.table <- rig.table[, c(ncol(rig.table), 1:(ncol(rig.table) - 1L))] # Write table to CSV file. fname <- paste0(getwd(), "/", today, ".csv") file.create(fname) fcon <- file(fname, open = "w") write.csv(rig.table, fname, row.names = FALSE) close(fcon)