SQL Saturday statistics – Web Scraping with R and SQL Server

[This article was first published on R – TomazTsql, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

I wanted to check a simple query: How many times has a particular topic been presented and from how many different presenters.

Sounds interesting, tackling the problem should not be a problem, just that the end numbers may vary, since there will be some text analysis included.

First of all, some web scraping and getting the information from Sqlsaturday web page. Reading the information from the website, and with R/Python integration into SQL Server, this is fairly straightforward task:

EXEC sp_execute_external_script
 @language = N'R'
 ,@script = N'
 library(rvest)
 library(XML)
 library(dplyr)

#URL to schedule
 url_schedule <- ''http://www.sqlsaturday.com/687/Sessions/Schedule.aspx''

#Read HTML
 webpage <- read_html(url_schedule)

# Event schedule
 schedule_info <- html_nodes(webpage, ''.session-schedule-cell-info'') # OK

# Extracting HTML content
 ht <- html_text(schedule_info)

df <- data.frame(data=ht)

#create empty DF
 df_res <- data.frame(title=c(), speaker=c())

for (i in 1:nrow(df)){
 #print(df[i])
 if (i %% 2 != 0) #odd flow
 print(paste0("title is: ", df$data[i]))
 if (i %% 2 == 0) #even flow
 print(paste0("speaker is: ", df$data[i]))
 df_res <- rbind(df_res, data.frame(title=df$data[i], speaker=df$data[i+1]))
 }
df_res_new = df_res[seq(1, nrow(df_res), 2), ]
OutputDataSet <- df_res_new'

Python offers Beautifulsoup library that will do pretty much the same (or even better) job as rvest and XML packages combined. Nevertheless, once we have the data from a test page out (in this case I am reading the Slovenian SQLSaturday 2017 schedule, simply because, it is awesome), we can “walk though” the whole web page and generate all the needed information.

SQLSaturday website has every event enumerated, making it very easy to parametrize the web scrapping process:

2017-11-12 13_13_30-SQLSaturday #687 - Slovenia 2017 _ Sessions _ Schedule

So we will scrape through last 100 events, by simply incrementing the integer of the event; so input parameter will be parsed as:

http://www.sqlsaturday.com/600/Sessions/Schedule.aspx

http://www.sqlsaturday.com/601/Sessions/Schedule.aspx

http://www.sqlsaturday.com/602/Sessions/Schedule.aspx

and so on, regardless of the fact if the website functions or not. Results will be returned back to the SQL Server database.

Creating stored procedure will go the job:

USE SqlSaturday;
GO

CREATE OR ALTER PROCEDURE GetSessions
 @eventID SMALLINT
AS

DECLARE @URL VARCHAR(500)
SET @URL = 'http://www.sqlsaturday.com/' +CAST(@eventID AS NVARCHAR(5)) + '/Sessions/Schedule.aspx'

PRINT @URL

DECLARE @TEMP TABLE
(
 SqlSatTitle NVARCHAR(500)
 ,SQLSatSpeaker NVARCHAR(200)
)

DECLARE @RCODE NVARCHAR(MAX)
SET @RCODE = N' 
 library(rvest)
 library(XML)
 library(dplyr)
 library(httr)
 library(curl)
 library(selectr)
 
 #URL to schedule
 url_schedule <- "'
 
DECLARE @RCODE2 NVARCHAR(MAX) 
SET @RCODE2 = N'"
 #Read HTML
 webpage <- html_session(url_schedule) %>%
 read_html()

# Event schedule
 schedule_info <- html_nodes(webpage, ''.session-schedule-cell-info'') # OK

# Extracting HTML content
 ht <- html_text(schedule_info)

df <- data.frame(data=ht)

#create empty DF
 df_res <- data.frame(title=c(), speaker=c())

for (i in 1:nrow(df)){
 #print(df[i])
 if (i %% 2 != 0) #odd flow
 print(paste0("title is: ", df$data[i]))
 if (i %% 2 == 0) #even flow
 print(paste0("speaker is: ", df$data[i]))
 df_res <- rbind(df_res, data.frame(title=df$data[i], speaker=df$data[i+1]))
 }

df_res_new = df_res[seq(1, nrow(df_res), 2), ]
 OutputDataSet <- df_res_new ';

DECLARE @FINAL_RCODE NVARCHAR(MAX)
SET @FINAL_RCODE = CONCAT(@RCODE, @URL, @RCODE2)

INSERT INTO @Temp
EXEC sp_execute_external_script
 @language = N'R'
 ,@script = @FINAL_RCODE


INSERT INTO SQLSatSessions (sqlSat,SqlSatTitle,SQLSatSpeaker)
SELECT 
 @EventID AS sqlsat
 ,SqlSatTitle
 ,SqlSatSpeaker
FROM @Temp

 

Before you run this, just a little environement setup:

USE [master];
GO

CREATE DATABASe SQLSaturday;
GO

USE SQLSaturday;
GO

CREATE TABLE SQLSatSessions
(
 id SMALLINT IDENTITY(1,1) NOT NULL
,SqlSat SMALLINT NOT NULL
,SqlSatTitle NVARCHAR(500) NOT NULL
,SQLSatSpeaker NVARCHAR(200) NOT NULL
)

 

There you go! Now you can run a stored procedure for a particular event (in this case SQL Saturday Slovenia 2017):

EXECUTE GetSessions @eventID = 687

or you can run this procedure against multiple SQLSaturday events and web scrape data from SQLSaturday.com website instantly.

For Slovenian SQLSaturday, I get the following sessions and speakers list:

2017-11-13 19_19_46-49_blog_post.sql - SICN-KASTRUN.SQLSaturday (SPAR_si01017988 (57))_ - Microsoft .png

Please note that you are running this code behind the firewall and proxy, so some additional changes for the proxy or firewall might be needed!

So going to original question, how many times has the query store been presented on SQL Saturdays (from SQLSat600 until  SqlSat690), here is the frequency table:

2017-11-13 19_57_04-Statistics_on_web_scraping_results.sql - SICN-KASTRUN.SQLSaturday (SPAR_si010179

Or presented with pandas graph:

session_stats

Query store is popular, beyond all R, Python or Azure ML topics, but Powershell is gaining its popularity like crazy. Good work PowerShell people! ?

As always, code is available at Github.

 


To leave a comment for the author, please follow the link and comment on their blog: R – TomazTsql.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)