[This article was first published on Pareto's Playground, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
After high school I made my way from Johannesburg, situated in the northern part of South Africa, to the famous wine country known as Stellenbosch in the south. Here for the first time I got a ton of exposure to wine and the countless varietals that make up this “drink of the gods”.
The one trick in wine tasting and exploring vini- and viticulture is the fact that the best way to learn about it all is going out to the farm and drinking the wine for yourself. Over the years I have become familiar with all the farms in the region and it can sometimes be a real prize when discovering a cellar which you have not visited.
As one can see from the map, there are quite a few farms that are ‘officially’ listed, but believe me there are a few hidden ones which one will never know of without some research.
To ease this process I have used R in order to compile a sort of wine farm repository for myself as a go to guide whenever I visit the region. I did this as an exercise to explore Hadley Wickham’s amazing rvest library which makes online data collection an ease.
The idea was to compile wine review data of the Stellenbosch wine region from winemag.com’s online database and use it to gain an analytical insight into this beautiful wine region in the Cape.
The first step in collecting the data was to deal with some housekeeping issues. First I assign the base URL to start the collecting from and collect the number of pages from the Stellenbosch search result as part of the pagination block in the html. Secondly I collect the information around the wine such as name, points and price. Lastly I use the information on this first page to create a function, info_xtr(), to extract information of each of the wines from their own specific page. Specifically I extract the written review of the wine, the date of review and also the alcohol content of the wine.
The info_xtr() function returns a data.frame that contains the review, date of the specific review and the alcohol content:
The wine selection data from the page is collated in a nice data.frame() for ease of reading. With all the information I need, I start my scraper to run through the website and collect all the information surrounding the wines
With my scraper having completed its task in retrieving all the necessary information from the website, I bind the information from the Wine_cellar list object. For brevity I truncate my sample by removing the href and review columns which takes up a lot of space
As you can see from the information displayed, we now have a complete data-set of all reviews of the the wines in the Stellenbosch region. This information can now be used to analyse the traits of the wine region in a number of ways. I will explore this data in the next post and will see what interesting insights we can draw from it.
I am familiar with collecting online data using the RCurl package and have been wanting to see how I can integrate the rvest package into some of the work I do. I must admit, I love exploring online data with this package as the html_node function makes the collection a breeze. I would definitely recommend rvest as the go-to library for any online scraping needs.
To leave a comment for the author, please follow the link and comment on their blog: Pareto's Playground.