Web Scraping to Item Response Theory – A College Football Adventure

[This article was first published on Educate-R - R, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Web Scraping to Item Response Theory: A College Football Adventure

Brandon LeBeau, Andrew Zieffler, and Kyle Nickodem

University of Iowa & University of Minnesota

Background

  • Began after Tim Brewster was fired
  • Wanted to try to predict next great coach

Data Available

  • Data is available at three levels
    1. Coach
    2. Game by Game
    3. Team

Coach

  • Data
    • Overall record
    • Team history
  • Not Available
    • Coordinator history

Example Coach Data

##   Year Team Win Loss Tie     Pct  PF  PA Delta        coach
## 1 2010 Iowa   8    5   0 0.61538 376 221   155 Kirk Ferentz
## 2 2011 Iowa   7    6   0 0.53846 358 310    48 Kirk Ferentz
## 3 2012 Iowa   4    8   0 0.33333 232 275   -43 Kirk Ferentz
## 4 2013 Iowa   8    5   0 0.61538 342 246    96 Kirk Ferentz
## 5 2014 Iowa   7    6   0 0.53846 367 333    34 Kirk Ferentz

Game by Game

  • Data
    • Final score of each game
    • Date played
    • Location
  • Not Available
    • No information within a game

Example GBG Data

##    Team           Official Year       Date WL          Opponent PF PA
## 1  Iowa University of Iowa 2014  8/30/2014  W     Northern Iowa 31 23
## 2  Iowa University of Iowa 2014   9/6/2014  W     Ball St. (IN) 17 13
## 3  Iowa University of Iowa 2014  9/13/2014  L          Iowa St. 17 20
## 4  Iowa University of Iowa 2014  9/20/2014  W   Pittsburgh (PA) 24 20
## 5  Iowa University of Iowa 2014  9/27/2014  W       Purdue (IN) 24 10
## 6  Iowa University of Iowa 2014 10/11/2014  W           Indiana 45 29
## 7  Iowa University of Iowa 2014 10/18/2014  L          Maryland 31 38
## 8  Iowa University of Iowa 2014  11/1/2014  W Northwestern (IL) 48  7
## 9  Iowa University of Iowa 2014  11/8/2014  L         Minnesota 14 51
## 10 Iowa University of Iowa 2014 11/15/2014  W          Illinois 30 14
## 11 Iowa University of Iowa 2014 11/22/2014  L         Wisconsin 24 26
## 12 Iowa University of Iowa 2014 11/28/2014  L          Nebraska 34 37
## 13 Iowa University of Iowa 2014   1/2/2015  L         Tennessee 28 45
##              Location
## 1       Iowa City, IA
## 2       Iowa City, IA
## 3       Iowa City, IA
## 4      Pittsburgh, PA
## 5  West Lafayette, IN
## 6       Iowa City, IA
## 7    College Park, MD
## 8       Iowa City, IA
## 9     Minneapolis, MN
## 10      Champaign, IL
## 11      Iowa City, IA
## 12      Iowa City, IA
## 13   Jacksonville, FL

Team

  • Data
    • Overall team record
    • Team statistics
    • Rankings
    • Conference Affiliation
  • Data is very similar to that of the coach level

Web Scraping

Iowa Coaches Over Time

Iowa State Coaches Over Time

Strengths in web scraping

  • Data is relatively easily obtained
  • Structured process for obtaining data
  • Can be easily updated

Challenges of web scraping

  • At the mercy of the website
    • Many sites are old
    • Not up to date on current design standards
  • Data validation can be difficult and time consuming
  • Need some basic knowledge of html

When is Web Scraping Worthwhile?

  • Best when scraping many pages
    • Particularly when web addresses are not structured
  • Useful when data need to be updated

  • Not useful if only scraping a single page/table

HTML Basics

  • HTML is structured by start tags (e.g.
    ) and end tags (e.g. <⁄table>)
  • Common tags



    HTML Code Example


    Tools for web scraping

    Basics of rvest

    • read_html is the most basic function
    • html_node or html_nodes
      • These functions need css selectors or xpath
      • SelectorGadget is the easiest way to get this

    SelectorGadget

    Combine SelectorGadget with rvest

    library(rvest)
    wiki_kirk <- read_html("https://en.wikipedia.org/wiki/Kirk_Ferentz")
    wiki_kirk_extract <- wiki_kirk %>%
        html_nodes(".vcard td , .vcard th")
    head(wiki_kirk_extract)

    ## {xml_nodeset (6)}
    ## [1] <td colspan="2" style="text-align:center"><a href="/wiki/File:Kirk_p ...
    ## [2] <th scope="row">Sport(s)</th>
    ## [3] <td class="category">n  <a href="/wiki/American_football" title="Am ...
    ## [4] <th colspan="2" style="text-align:center;background-color: lightgray ...
    ## [5] <th scope="row">Title</th>
    ## [6] <td>n  <a href="/wiki/Head_coach" title="Head coach">Head coach</a> ...

    Extract text

    • Use the html_text function

    wiki_kirk_extract <- wiki_kirk %>%
      html_nodes(".vcard td , .vcard th") %>%
      html_text()
    head(wiki_kirk_extract)

    ## [1] "nFerentz at the 2010 Orange Bowln"
    ## [2] "Sport(s)"                           
    ## [3] "Football"                           
    ## [4] "Current position"                   
    ## [5] "Title"                              
    ## [6] "Head coach"

    Encoding problems

    • Two solutions to fix encoding problems
      • guess_encoding
      • repair_encoding: fix encoding problems

    wiki_kirk %>%
      html_nodes(".vcard td , .vcard th") %>%
      html_text() %>%
      guess_encoding()

    ##       encoding language confidence
    ## 1        UTF-8                1.00
    ## 2 windows-1252       en       0.36
    ## 3 windows-1250       ro       0.18
    ## 4 windows-1254       tr       0.13
    ## 5     UTF-16BE                0.10
    ## 6     UTF-16LE                0.10

    Fix Encoding Problems

    • Best practice to reload page with correct encoding

    wiki_kirk <- read_html("https://en.wikipedia.org/wiki/Kirk_Ferentz", 
                           encoding = 'UTF-8')

    • Can also repair encoding after the fact

    wiki_kirk_extract <- wiki_kirk %>%
      html_nodes(".vcard td , .vcard th") %>%
      html_text() %>% 
      repair_encoding()

    Extract html tags

    • Use the html_tags function

    wiki_kirk_extract <- wiki_kirk %>%
      html_nodes(".vcard td , .vcard th") %>%
      html_name()
    head(wiki_kirk_extract)

    ## [1] "td" "th" "td" "th" "th" "td"

    Extract html attributes

    • Use the html_attrs function

    wiki_kirk_extract <- wiki_kirk %>%
      html_nodes(".vcard td , .vcard th") %>%
      html_attrs()
    head(wiki_kirk_extract)

    ## [[1]]
    ##             colspan               style 
    ##                 "2" "text-align:center" 
    ## 
    ## [[2]]
    ## scope 
    ## "row" 
    ## 
    ## [[3]]
    ##      class 
    ## "category" 
    ## 
    ## [[4]]
    ##                                          colspan 
    ##                                              "2" 
    ##                                            style 
    ## "text-align:center;background-color: lightgray;" 
    ## 
    ## [[5]]
    ## scope 
    ## "row" 
    ## 
    ## [[6]]
    ## named character(0)

    Extract links

    • Use the html_attrs function again

    wiki_kirk_extract <- wiki_kirk %>%
      html_nodes(".vcard a") %>%
      html_attr('href')
    head(wiki_kirk_extract)

    ## [1] "/wiki/File:Kirk_pressconference_orangebowl2010.JPG"
    ## [2] "/wiki/American_football"                           
    ## [3] "/wiki/Head_coach"                                  
    ## [4] "/wiki/Iowa_Hawkeyes_football"                      
    ## [5] "/wiki/Big_Ten_Conference"                          
    ## [6] "/wiki/Iowa_City,_Iowa"

    Valid Links

    • The paste0 function is helpful for this

    valid_links <- paste0('https://www.wikipedia.org', wiki_kirk_extract)
    head(valid_links)

    ## [1] "https://www.wikipedia.org/wiki/File:Kirk_pressconference_orangebowl2010.JPG"
    ## [2] "https://www.wikipedia.org/wiki/American_football"                           
    ## [3] "https://www.wikipedia.org/wiki/Head_coach"                                  
    ## [4] "https://www.wikipedia.org/wiki/Iowa_Hawkeyes_football"                      
    ## [5] "https://www.wikipedia.org/wiki/Big_Ten_Conference"                          
    ## [6] "https://www.wikipedia.org/wiki/Iowa_City,_Iowa"

    Extract Tables

    • The html_table function is useful to scrape well formatted tables

    record_kirk <- wiki_kirk %>%
      html_nodes(".wikitable") %>%
      .[[1]] %>%
      html_table(fill = TRUE)

    Caveats to Web Scraping

    • Keep in mind when scraping we are using their bandwidth
      • Do not want to repeatedly do expensive bandwidth operations
      • Better to scrape once, then run only to update data
    • Some websites are copyrighted (i.e. illegal to scrape)

    Data Modeling

    • Research Questions
      1. Who is the next great coach?
      2. What characteristics are in common for these coaches?

    IRT modeling

    • So far we have explored the win/loss records of teams in the BCS era with item response theory (IRT)
    • IRT is commonly used to model assessment data to estimate item parameters and person ‘ability’
    • We recode the Win/Loss/Tie game by game results
      • 1 = Win
      • 0 = Otherwise

    Example code with lme4

    • A 1 parameter multilevel IRT model can be fitted using glmer in the lme4 package

    library(lme4)
    fm1a <- glmer(wingbg ~ 0 + (1|coach) + (1|Team), 
                  data = yby_coach, family = binomial)

    Plot Showing Team Ability

    Connect

    To leave a comment for the author, please follow the link and comment on their blog: Educate-R - R.

    R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
    Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

    Never miss an update!
    Subscribe to R-bloggers to receive
    e-mails with the latest R posts.
    (You will not see this message again.)

    Click here to close (This popup will not appear again)