Web Scraping to Item Response Theory – A College Football Adventure

December 4, 2015
By

(This article was first published on Educate-R - R, and kindly contributed to R-bloggers)

Web Scraping to Item Response Theory: A College Football Adventure

Brandon LeBeau, Andrew Zieffler, and Kyle Nickodem

University of Iowa & University of Minnesota

Background

  • Began after Tim Brewster was fired
  • Wanted to try to predict next great coach

Data Available

  • Data is available at three levels
    1. Coach
    2. Game by Game
    3. Team

Coach

  • Data
    • Overall record
    • Team history
  • Not Available
    • Coordinator history

Example Coach Data

##   Year Team Win Loss Tie     Pct  PF  PA Delta        coach
## 1 2010 Iowa   8    5   0 0.61538 376 221   155 Kirk Ferentz
## 2 2011 Iowa   7    6   0 0.53846 358 310    48 Kirk Ferentz
## 3 2012 Iowa   4    8   0 0.33333 232 275   -43 Kirk Ferentz
## 4 2013 Iowa   8    5   0 0.61538 342 246    96 Kirk Ferentz
## 5 2014 Iowa   7    6   0 0.53846 367 333    34 Kirk Ferentz

Game by Game

  • Data
    • Final score of each game
    • Date played
    • Location
  • Not Available
    • No information within a game

Example GBG Data

##    Team           Official Year       Date WL          Opponent PF PA
## 1  Iowa University of Iowa 2014  8/30/2014  W     Northern Iowa 31 23
## 2  Iowa University of Iowa 2014   9/6/2014  W     Ball St. (IN) 17 13
## 3  Iowa University of Iowa 2014  9/13/2014  L          Iowa St. 17 20
## 4  Iowa University of Iowa 2014  9/20/2014  W   Pittsburgh (PA) 24 20
## 5  Iowa University of Iowa 2014  9/27/2014  W       Purdue (IN) 24 10
## 6  Iowa University of Iowa 2014 10/11/2014  W           Indiana 45 29
## 7  Iowa University of Iowa 2014 10/18/2014  L          Maryland 31 38
## 8  Iowa University of Iowa 2014  11/1/2014  W Northwestern (IL) 48  7
## 9  Iowa University of Iowa 2014  11/8/2014  L         Minnesota 14 51
## 10 Iowa University of Iowa 2014 11/15/2014  W          Illinois 30 14
## 11 Iowa University of Iowa 2014 11/22/2014  L         Wisconsin 24 26
## 12 Iowa University of Iowa 2014 11/28/2014  L          Nebraska 34 37
## 13 Iowa University of Iowa 2014   1/2/2015  L         Tennessee 28 45
##              Location
## 1       Iowa City, IA
## 2       Iowa City, IA
## 3       Iowa City, IA
## 4      Pittsburgh, PA
## 5  West Lafayette, IN
## 6       Iowa City, IA
## 7    College Park, MD
## 8       Iowa City, IA
## 9     Minneapolis, MN
## 10      Champaign, IL
## 11      Iowa City, IA
## 12      Iowa City, IA
## 13   Jacksonville, FL

Team

  • Data
    • Overall team record
    • Team statistics
    • Rankings
    • Conference Affiliation
  • Data is very similar to that of the coach level

Web Scraping

Iowa Coaches Over Time

Iowa State Coaches Over Time

Strengths in web scraping

  • Data is relatively easily obtained
  • Structured process for obtaining data
  • Can be easily updated

Challenges of web scraping

  • At the mercy of the website
    • Many sites are old
    • Not up to date on current design standards
  • Data validation can be difficult and time consuming
  • Need some basic knowledge of html

When is Web Scraping Worthwhile?

  • Best when scraping many pages
    • Particularly when web addresses are not structured
  • Useful when data need to be updated

  • Not useful if only scraping a single page/table

HTML Basics

  • HTML is structured by start tags (e.g. ) and end tags (e.g. <⁄table>)
  • Common tags

    HTML Code Example


    Tools for web scraping

    Basics of rvest

    • read_html is the most basic function
    • html_node or html_nodes
      • These functions need css selectors or xpath
      • SelectorGadget is the easiest way to get this

    SelectorGadget

    Combine SelectorGadget with rvest

    library(rvest)
    wiki_kirk <- read_html("https://en.wikipedia.org/wiki/Kirk_Ferentz")
    wiki_kirk_extract <- wiki_kirk %>%
        html_nodes(".vcard td , .vcard th")
    head(wiki_kirk_extract)
    
    ## {xml_nodeset (6)}
    ## [1] 
    Sport(s) ## [3] n Title ## [6] n Head coach ...

    Extract text

    • Use the html_text function
    wiki_kirk_extract <- wiki_kirk %>%
      html_nodes(".vcard td , .vcard th") %>%
      html_text()
    head(wiki_kirk_extract)
    
    ## [1] "nFerentz at the 2010 Orange Bowln"
    ## [2] "Sport(s)"                           
    ## [3] "Football"                           
    ## [4] "Current position"                   
    ## [5] "Title"                              
    ## [6] "Head coach"
    

    Encoding problems

    • Two solutions to fix encoding problems
      • guess_encoding
      • repair_encoding: fix encoding problems
    wiki_kirk %>%
      html_nodes(".vcard td , .vcard th") %>%
      html_text() %>%
      guess_encoding()
    
    ##       encoding language confidence
    ## 1        UTF-8                1.00
    ## 2 windows-1252       en       0.36
    ## 3 windows-1250       ro       0.18
    ## 4 windows-1254       tr       0.13
    ## 5     UTF-16BE                0.10
    ## 6     UTF-16LE                0.10
    

    Fix Encoding Problems

    • Best practice to reload page with correct encoding
    wiki_kirk <- read_html("https://en.wikipedia.org/wiki/Kirk_Ferentz", 
                           encoding = 'UTF-8')
    
    • Can also repair encoding after the fact
    wiki_kirk_extract <- wiki_kirk %>%
      html_nodes(".vcard td , .vcard th") %>%
      html_text() %>% 
      repair_encoding()
    

    Extract html tags

    • Use the html_tags function
    wiki_kirk_extract <- wiki_kirk %>%
      html_nodes(".vcard td , .vcard th") %>%
      html_name()
    head(wiki_kirk_extract)
    
    ## [1] "td" "th" "td" "th" "th" "td"
    

    Extract html attributes

    • Use the html_attrs function
    wiki_kirk_extract <- wiki_kirk %>%
      html_nodes(".vcard td , .vcard th") %>%
      html_attrs()
    head(wiki_kirk_extract)
    
    ## [[1]]
    ##             colspan               style 
    ##                 "2" "text-align:center" 
    ## 
    ## [[2]]
    ## scope 
    ## "row" 
    ## 
    ## [[3]]
    ##      class 
    ## "category" 
    ## 
    ## [[4]]
    ##                                          colspan 
    ##                                              "2" 
    ##                                            style 
    ## "text-align:center;background-color: lightgray;" 
    ## 
    ## [[5]]
    ## scope 
    ## "row" 
    ## 
    ## [[6]]
    ## named character(0)
    

    Extract links

    • Use the html_attrs function again
    wiki_kirk_extract <- wiki_kirk %>%
      html_nodes(".vcard a") %>%
      html_attr('href')
    head(wiki_kirk_extract)
    
    ## [1] "/wiki/File:Kirk_pressconference_orangebowl2010.JPG"
    ## [2] "/wiki/American_football"                           
    ## [3] "/wiki/Head_coach"                                  
    ## [4] "/wiki/Iowa_Hawkeyes_football"                      
    ## [5] "/wiki/Big_Ten_Conference"                          
    ## [6] "/wiki/Iowa_City,_Iowa"
    

    Valid Links

    • The paste0 function is helpful for this
    valid_links <- paste0('https://www.wikipedia.org', wiki_kirk_extract)
    head(valid_links)
    
    ## [1] "https://www.wikipedia.org/wiki/File:Kirk_pressconference_orangebowl2010.JPG"
    ## [2] "https://www.wikipedia.org/wiki/American_football"                           
    ## [3] "https://www.wikipedia.org/wiki/Head_coach"                                  
    ## [4] "https://www.wikipedia.org/wiki/Iowa_Hawkeyes_football"                      
    ## [5] "https://www.wikipedia.org/wiki/Big_Ten_Conference"                          
    ## [6] "https://www.wikipedia.org/wiki/Iowa_City,_Iowa"
    

    Extract Tables

    • The html_table function is useful to scrape well formatted tables
    record_kirk <- wiki_kirk %>%
      html_nodes(".wikitable") %>%
      .[[1]] %>%
      html_table(fill = TRUE)
    

    Caveats to Web Scraping

    • Keep in mind when scraping we are using their bandwidth
      • Do not want to repeatedly do expensive bandwidth operations
      • Better to scrape once, then run only to update data
    • Some websites are copyrighted (i.e. illegal to scrape)

    Data Modeling

    • Research Questions
      1. Who is the next great coach?
      2. What characteristics are in common for these coaches?

    IRT modeling

    • So far we have explored the win/loss records of teams in the BCS era with item response theory (IRT)
    • IRT is commonly used to model assessment data to estimate item parameters and person ‘ability’
    • We recode the Win/Loss/Tie game by game results
      • 1 = Win
      • 0 = Otherwise

    Example code with lme4

    • A 1 parameter multilevel IRT model can be fitted using glmer in the lme4 package
    library(lme4)
    fm1a <- glmer(wingbg ~ 0 + (1|coach) + (1|Team), 
                  data = yby_coach, family = binomial)
    

    Plot Showing Team Ability

    Connect

    To leave a comment for the author, please follow the link and comment on their blog: Educate-R - R.

    R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...



    If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

    Comments are closed.

    Search R-bloggers

    Sponsors

    Never miss an update!
    Subscribe to R-bloggers to receive
    e-mails with the latest R posts.
    (You will not see this message again.)

    Click here to close (This popup will not appear again)