Web Scraping to Item Response Theory – A College Football Adventure

Posted on December 4, 2015 by Educate-R - R in R bloggers | 0 Comments

[This article was first published on Educate-R - R, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Web Scraping to Item Response Theory: A College Football Adventure

Brandon LeBeau, Andrew Zieffler, and Kyle Nickodem

University of Iowa & University of Minnesota

Background

Began after Tim Brewster was fired
Wanted to try to predict next great coach

Data Available

Data is available at three levels
1. Coach
2. Game by Game
3. Team

Coach

Data
- Overall record
- Team history
Not Available
- Coordinator history

Example Coach Data

##   Year Team Win Loss Tie     Pct  PF  PA Delta        coach
## 1 2010 Iowa   8    5   0 0.61538 376 221   155 Kirk Ferentz
## 2 2011 Iowa   7    6   0 0.53846 358 310    48 Kirk Ferentz
## 3 2012 Iowa   4    8   0 0.33333 232 275   -43 Kirk Ferentz
## 4 2013 Iowa   8    5   0 0.61538 342 246    96 Kirk Ferentz
## 5 2014 Iowa   7    6   0 0.53846 367 333    34 Kirk Ferentz

Game by Game

Data
- Final score of each game
- Date played
- Location
Not Available
- No information within a game

Example GBG Data

##    Team           Official Year       Date WL          Opponent PF PA
## 1  Iowa University of Iowa 2014  8/30/2014  W     Northern Iowa 31 23
## 2  Iowa University of Iowa 2014   9/6/2014  W     Ball St. (IN) 17 13
## 3  Iowa University of Iowa 2014  9/13/2014  L          Iowa St. 17 20
## 4  Iowa University of Iowa 2014  9/20/2014  W   Pittsburgh (PA) 24 20
## 5  Iowa University of Iowa 2014  9/27/2014  W       Purdue (IN) 24 10
## 6  Iowa University of Iowa 2014 10/11/2014  W           Indiana 45 29
## 7  Iowa University of Iowa 2014 10/18/2014  L          Maryland 31 38
## 8  Iowa University of Iowa 2014  11/1/2014  W Northwestern (IL) 48  7
## 9  Iowa University of Iowa 2014  11/8/2014  L         Minnesota 14 51
## 10 Iowa University of Iowa 2014 11/15/2014  W          Illinois 30 14
## 11 Iowa University of Iowa 2014 11/22/2014  L         Wisconsin 24 26
## 12 Iowa University of Iowa 2014 11/28/2014  L          Nebraska 34 37
## 13 Iowa University of Iowa 2014   1/2/2015  L         Tennessee 28 45
##              Location
## 1       Iowa City, IA
## 2       Iowa City, IA
## 3       Iowa City, IA
## 4      Pittsburgh, PA
## 5  West Lafayette, IN
## 6       Iowa City, IA
## 7    College Park, MD
## 8       Iowa City, IA
## 9     Minneapolis, MN
## 10      Champaign, IL
## 11      Iowa City, IA
## 12      Iowa City, IA
## 13   Jacksonville, FL

Team

Data
- Overall team record
- Team statistics
- Rankings
- Conference Affiliation
Data is very similar to that of the coach level

Web Scraping

Data were obtained from many sources
- Much from http://cfbdatawarehouse.com
- Also used wikipedia, ESPN, and rivals

Iowa Coaches Over Time

Iowa State Coaches Over Time

Strengths in web scraping

Data is relatively easily obtained
Structured process for obtaining data
Can be easily updated

Challenges of web scraping

At the mercy of the website
- Many sites are old
- Not up to date on current design standards
Data validation can be difficult and time consuming
Need some basic knowledge of html

When is Web Scraping Worthwhile?

Best when scraping many pages
- Particularly when web addresses are not structured
Useful when data need to be updated

Not useful if only scraping a single page/table

HTML Basics

HTML is structured by start tags (e.g. <table>) and end tags (e.g. <⁄table>)
Common tags

<h1> – <h6>
<b> <i>
<a href="http://www.google.com">
<table>

<p>
<ul> & <li>
<div>
<img>

Highly structured pages are the easiest to scrape

HTML Code Example

Tools for web scraping

R
- rvest: http://blog.rstudio.org/2014/11/24/rvest-easy-web-scraping-with-r/
- XML: http://www.omegahat.org/RSXML/
Python
- beautiful soup: http://www.crummy.com/software/BeautifulSoup/
Misc
- SelectorGadget: http://selectorgadget.com/

Basics of rvest

read_html is the most basic function
html_node or html_nodes
- These functions need css selectors or xpath
- SelectorGadget is the easiest way to get this

SelectorGadget

SelectorGadget is a Javascript addon for web browsers
Can quickly identify a css selector or xpath to select correct portion of web page
Demo:
- https://en.wikipedia.org/wiki/Kirk_Ferentz

Combine SelectorGadget with rvest

library(rvest)
wiki_kirk <- read_html("https://en.wikipedia.org/wiki/Kirk_Ferentz")
wiki_kirk_extract <- wiki_kirk %>%
    html_nodes(".vcard td , .vcard th")
head(wiki_kirk_extract)

## {xml_nodeset (6)}
## [1] <td colspan="2" style="text-align:center"><a href="/wiki/File:Kirk_p ...
## [2] <th scope="row">Sport(s)</th>
## [3] <td class="category">n  <a href="/wiki/American_football" title="Am ...
## [4] <th colspan="2" style="text-align:center;background-color: lightgray ...
## [5] <th scope="row">Title</th>
## [6] <td>n  <a href="/wiki/Head_coach" title="Head coach">Head coach</a> ...

Extract text

Use the html_text function

wiki_kirk_extract <- wiki_kirk %>%
  html_nodes(".vcard td , .vcard th") %>%
  html_text()
head(wiki_kirk_extract)

## [1] "nFerentz at the 2010 Orange Bowln"
## [2] "Sport(s)"                           
## [3] "Football"                           
## [4] "Current position"                   
## [5] "Title"                              
## [6] "Head coach"

Encoding problems

Two solutions to fix encoding problems
- guess_encoding
- repair_encoding: fix encoding problems

wiki_kirk %>%
  html_nodes(".vcard td , .vcard th") %>%
  html_text() %>%
  guess_encoding()

##       encoding language confidence
## 1        UTF-8                1.00
## 2 windows-1252       en       0.36
## 3 windows-1250       ro       0.18
## 4 windows-1254       tr       0.13
## 5     UTF-16BE                0.10
## 6     UTF-16LE                0.10

Fix Encoding Problems

Best practice to reload page with correct encoding

wiki_kirk <- read_html("https://en.wikipedia.org/wiki/Kirk_Ferentz", 
                       encoding = 'UTF-8')

Can also repair encoding after the fact

wiki_kirk_extract <- wiki_kirk %>%
  html_nodes(".vcard td , .vcard th") %>%
  html_text() %>% 
  repair_encoding()

Extract html tags

Use the html_tags function

wiki_kirk_extract <- wiki_kirk %>%
  html_nodes(".vcard td , .vcard th") %>%
  html_name()
head(wiki_kirk_extract)

## [1] "td" "th" "td" "th" "th" "td"

Extract html attributes

Use the html_attrs function

wiki_kirk_extract <- wiki_kirk %>%
  html_nodes(".vcard td , .vcard th") %>%
  html_attrs()
head(wiki_kirk_extract)

## [[1]]
##             colspan               style 
##                 "2" "text-align:center" 
## 
## [[2]]
## scope 
## "row" 
## 
## [[3]]
##      class 
## "category" 
## 
## [[4]]
##                                          colspan 
##                                              "2" 
##                                            style 
## "text-align:center;background-color: lightgray;" 
## 
## [[5]]
## scope 
## "row" 
## 
## [[6]]
## named character(0)

Extract links

Use the html_attrs function again

wiki_kirk_extract <- wiki_kirk %>%
  html_nodes(".vcard a") %>%
  html_attr('href')
head(wiki_kirk_extract)

## [1] "/wiki/File:Kirk_pressconference_orangebowl2010.JPG"
## [2] "/wiki/American_football"                           
## [3] "/wiki/Head_coach"                                  
## [4] "/wiki/Iowa_Hawkeyes_football"                      
## [5] "/wiki/Big_Ten_Conference"                          
## [6] "/wiki/Iowa_City,_Iowa"

Valid Links

The paste0 function is helpful for this

valid_links <- paste0('https://www.wikipedia.org', wiki_kirk_extract)
head(valid_links)

## [1] "https://www.wikipedia.org/wiki/File:Kirk_pressconference_orangebowl2010.JPG"
## [2] "https://www.wikipedia.org/wiki/American_football"                           
## [3] "https://www.wikipedia.org/wiki/Head_coach"                                  
## [4] "https://www.wikipedia.org/wiki/Iowa_Hawkeyes_football"                      
## [5] "https://www.wikipedia.org/wiki/Big_Ten_Conference"                          
## [6] "https://www.wikipedia.org/wiki/Iowa_City,_Iowa"

Extract Tables

The html_table function is useful to scrape well formatted tables

record_kirk <- wiki_kirk %>%
  html_nodes(".wikitable") %>%
  .[[1]] %>%
  html_table(fill = TRUE)

Caveats to Web Scraping

Keep in mind when scraping we are using their bandwidth
- Do not want to repeatedly do expensive bandwidth operations
- Better to scrape once, then run only to update data
Some websites are copyrighted (i.e. illegal to scrape)

Data Modeling

Research Questions
1. Who is the next great coach?
2. What characteristics are in common for these coaches?

IRT modeling

So far we have explored the win/loss records of teams in the BCS era with item response theory (IRT)
IRT is commonly used to model assessment data to estimate item parameters and person 'ability'
We recode the Win/Loss/Tie game by game results
- 1 = Win
- 0 = Otherwise

Example code with lme4

A 1 parameter multilevel IRT model can be fitted using glmer in the lme4 package

library(lme4)
fm1a <- glmer(wingbg ~ 0 + (1|coach) + (1|Team), 
              data = yby_coach, family = binomial)

Plot Showing Team Ability

Connect

e-mail: brandon-lebeau (at) uiowa.edu
Twitter: @blebeau11; https://twitter.com/blebeau11
Linkedin: https://www.linkedin.com/in/lebeaubr
Website: http://educate-r.org
- http://educate-r.org/2015/12/04/centraliowaruser/

To leave a comment for the author, please follow the link and comment on their blog: Educate-R - R.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

R-bloggers

R news and tutorials contributed by hundreds of R bloggers

Web Scraping to Item Response Theory – A College Football Adventure

Web Scraping to Item Response Theory: A College Football Adventure

Brandon LeBeau, Andrew Zieffler, and Kyle Nickodem

University of Iowa & University of Minnesota

Background

Data Available

Coach

Example Coach Data

Game by Game

Example GBG Data

Team

Web Scraping

Iowa Coaches Over Time

Iowa State Coaches Over Time

Strengths in web scraping

Challenges of web scraping

When is Web Scraping Worthwhile?

HTML Basics

HTML Code Example

Tools for web scraping

Basics of rvest

SelectorGadget

Combine SelectorGadget with rvest

Extract text

Encoding problems

Fix Encoding Problems

Extract html tags

Extract html attributes

Extract links

Valid Links

Extract Tables

Caveats to Web Scraping

Data Modeling

IRT modeling

Example code with lme4

Plot Showing Team Ability

Connect

Related

Web Scraping to Item Response Theory: A College Football Adventure

Brandon LeBeau, Andrew Zieffler, and Kyle Nickodem

University of Iowa & University of Minnesota

Background

Data Available

Coach

Example Coach Data

Game by Game

Example GBG Data

Team

Web Scraping

Iowa Coaches Over Time

Iowa State Coaches Over Time

Strengths in web scraping

Challenges of web scraping

When is Web Scraping Worthwhile?

HTML Basics

HTML Code Example

Tools for web scraping

Basics of rvest

SelectorGadget

Combine SelectorGadget with rvest

Extract text

Encoding problems

Fix Encoding Problems

Extract html tags

Extract html attributes

Extract links

Valid Links

Extract Tables

Caveats to Web Scraping

Data Modeling

IRT modeling

Example code with lme4

Plot Showing Team Ability

Connect

Related

Never miss an update! Subscribe to R-bloggers to receive e-mails with the latest R posts. (You will not see this message again.)

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)