How to use nflfastR with Google BigQuery?

[This article was first published on rstats-tips.net, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Lately I wanted to play around with nflfastR. That’s a great package giving you access to NFL’s play-by-play data since 1999. It let’s you download all the data and store it in several different databases.

Unfortunately I ran into trouble when I tried to import the data to Google’s BigQuery.

Two problems

I’ve identified to problems.

My solutions

The open source way

The solution for the first bug was relative simple: Check if the table exists. If not don’t set the append parameter to TRUE when calling DBI::dbWriteTable(). (See here).

The other one was a little bit more tricky. As it turned out the new table is created by using data consisting of

  • the first season loaded (usually 1999) and
  • a default_play. Both are binded together (see here) and then given to DBI::dbWriteTable().

The default_play is somehow created by the package author.

So DBI (or the database) guesses what column type should be created for each column by this data. And that’s where the trouble starts:

The column weather is empty for seasons 1999 and 2000. So the only value provided for this column is in default_play. The value is a very long string of repeating “NA”.

As it turned out DBI::dbWriteTable() or BigQuery uses column type BYTES as the best fitting type for this value. (Why this happens is not quite clear. It depends on the length of the string. See my question here.) Unfortunately this column types can’t handle Unicode values which are used in the weather-column in seasons 2001 and younger.

So I provided two ways for fixing this issue:

  • Change the value of default_play$weather, see commit
  • Load the data in reverse order so you provide at creation time correct data, see commit

Both solutions were rejected by the package maintainers. They argue that the DBI-implementation for BigQuery is the reason for the problems. This might be true. But it doesn’t help someone who wants to use nflfastR with BigQuery.

What to do else?

So if you want to use nflfastR with BigQuery you can either use my fork of this package. I’ll try to update it when the upstream package is updated.

Pre-Import Script

But I looked for another solution tackling the main reason why the original packages fails.

The main reason the import of the data fails is that the database guesses what the best column types are for our data. I think that’s bad habit. I think it’s much better to be precise. So tell the database the correct column types for the new table. If this would be done in the original package all problems would be gone.

So here’s my script you can run before importing the data:

  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
library(tidyverse)
library(DBI)


project <- "bigquery-project-name"
dataset <- "bigquery-dataset-name"
billing <- "bigquery-project-name"

table_name <- "bigquery-table-name"

# Connection for DBI
con <- DBI::dbConnect(
  bigrquery::bigquery(),
  project = project,
  dataset = dataset,
  billing = project
)



field_types <- structure(list(name = c("play_id", "game_id", "old_game_id", 
                                       "home_team", "away_team", "season_type", "week", "posteam", "posteam_type", 
                                       "defteam", "side_of_field", "yardline_100", "game_date", "quarter_seconds_remaining", 
                                       "half_seconds_remaining", "game_seconds_remaining", "game_half", 
                                       "quarter_end", "drive", "sp", "qtr", "down", "goal_to_go", "time", 
                                       "yrdln", "ydstogo", "ydsnet", "desc", "play_type", "yards_gained", 
                                       "shotgun", "no_huddle", "qb_dropback", "qb_kneel", "qb_spike", 
                                       "qb_scramble", "pass_length", "pass_location", "air_yards", "yards_after_catch", 
                                       "run_location", "run_gap", "field_goal_result", "kick_distance", 
                                       "extra_point_result", "two_point_conv_result", "home_timeouts_remaining", 
                                       "away_timeouts_remaining", "timeout", "timeout_team", "td_team", 
                                       "td_player_name", "td_player_id", "posteam_timeouts_remaining", 
                                       "defteam_timeouts_remaining", "total_home_score", "total_away_score", 
                                       "posteam_score", "defteam_score", "score_differential", "posteam_score_post", 
                                       "defteam_score_post", "score_differential_post", "no_score_prob", 
                                       "opp_fg_prob", "opp_safety_prob", "opp_td_prob", "fg_prob", "safety_prob", 
                                       "td_prob", "extra_point_prob", "two_point_conversion_prob", "ep", 
                                       "epa", "total_home_epa", "total_away_epa", "total_home_rush_epa", 
                                       "total_away_rush_epa", "total_home_pass_epa", "total_away_pass_epa", 
                                       "air_epa", "yac_epa", "comp_air_epa", "comp_yac_epa", "total_home_comp_air_epa", 
                                       "total_away_comp_air_epa", "total_home_comp_yac_epa", "total_away_comp_yac_epa", 
                                       "total_home_raw_air_epa", "total_away_raw_air_epa", "total_home_raw_yac_epa", 
                                       "total_away_raw_yac_epa", "wp", "def_wp", "home_wp", "away_wp", 
                                       "wpa", "vegas_wpa", "vegas_home_wpa", "home_wp_post", "away_wp_post", 
                                       "vegas_wp", "vegas_home_wp", "total_home_rush_wpa", "total_away_rush_wpa", 
                                       "total_home_pass_wpa", "total_away_pass_wpa", "air_wpa", "yac_wpa", 
                                       "comp_air_wpa", "comp_yac_wpa", "total_home_comp_air_wpa", "total_away_comp_air_wpa", 
                                       "total_home_comp_yac_wpa", "total_away_comp_yac_wpa", "total_home_raw_air_wpa", 
                                       "total_away_raw_air_wpa", "total_home_raw_yac_wpa", "total_away_raw_yac_wpa", 
                                       "punt_blocked", "first_down_rush", "first_down_pass", "first_down_penalty", 
                                       "third_down_converted", "third_down_failed", "fourth_down_converted", 
                                       "fourth_down_failed", "incomplete_pass", "touchback", "interception", 
                                       "punt_inside_twenty", "punt_in_endzone", "punt_out_of_bounds", 
                                       "punt_downed", "punt_fair_catch", "kickoff_inside_twenty", "kickoff_in_endzone", 
                                       "kickoff_out_of_bounds", "kickoff_downed", "kickoff_fair_catch", 
                                       "fumble_forced", "fumble_not_forced", "fumble_out_of_bounds", 
                                       "solo_tackle", "safety", "penalty", "tackled_for_loss", "fumble_lost", 
                                       "own_kickoff_recovery", "own_kickoff_recovery_td", "qb_hit", 
                                       "rush_attempt", "pass_attempt", "sack", "touchdown", "pass_touchdown", 
                                       "rush_touchdown", "return_touchdown", "extra_point_attempt", 
                                       "two_point_attempt", "field_goal_attempt", "kickoff_attempt", 
                                       "punt_attempt", "fumble", "complete_pass", "assist_tackle", "lateral_reception", 
                                       "lateral_rush", "lateral_return", "lateral_recovery", "passer_player_id", 
                                       "passer_player_name", "passing_yards", "receiver_player_id", 
                                       "receiver_player_name", "receiving_yards", "rusher_player_id", 
                                       "rusher_player_name", "rushing_yards", "lateral_receiver_player_id", 
                                       "lateral_receiver_player_name", "lateral_receiving_yards", "lateral_rusher_player_id", 
                                       "lateral_rusher_player_name", "lateral_rushing_yards", "lateral_sack_player_id", 
                                       "lateral_sack_player_name", "interception_player_id", "interception_player_name", 
                                       "lateral_interception_player_id", "lateral_interception_player_name", 
                                       "punt_returner_player_id", "punt_returner_player_name", "lateral_punt_returner_player_id", 
                                       "lateral_punt_returner_player_name", "kickoff_returner_player_name", 
                                       "kickoff_returner_player_id", "lateral_kickoff_returner_player_id", 
                                       "lateral_kickoff_returner_player_name", "punter_player_id", "punter_player_name", 
                                       "kicker_player_name", "kicker_player_id", "own_kickoff_recovery_player_id", 
                                       "own_kickoff_recovery_player_name", "blocked_player_id", "blocked_player_name", 
                                       "tackle_for_loss_1_player_id", "tackle_for_loss_1_player_name", 
                                       "tackle_for_loss_2_player_id", "tackle_for_loss_2_player_name", 
                                       "qb_hit_1_player_id", "qb_hit_1_player_name", "qb_hit_2_player_id", 
                                       "qb_hit_2_player_name", "forced_fumble_player_1_team", "forced_fumble_player_1_player_id", 
                                       "forced_fumble_player_1_player_name", "forced_fumble_player_2_team", 
                                       "forced_fumble_player_2_player_id", "forced_fumble_player_2_player_name", 
                                       "solo_tackle_1_team", "solo_tackle_2_team", "solo_tackle_1_player_id", 
                                       "solo_tackle_2_player_id", "solo_tackle_1_player_name", "solo_tackle_2_player_name", 
                                       "assist_tackle_1_player_id", "assist_tackle_1_player_name", "assist_tackle_1_team", 
                                       "assist_tackle_2_player_id", "assist_tackle_2_player_name", "assist_tackle_2_team", 
                                       "assist_tackle_3_player_id", "assist_tackle_3_player_name", "assist_tackle_3_team", 
                                       "assist_tackle_4_player_id", "assist_tackle_4_player_name", "assist_tackle_4_team", 
                                       "tackle_with_assist", "tackle_with_assist_1_player_id", "tackle_with_assist_1_player_name", 
                                       "tackle_with_assist_1_team", "tackle_with_assist_2_player_id", 
                                       "tackle_with_assist_2_player_name", "tackle_with_assist_2_team", 
                                       "pass_defense_1_player_id", "pass_defense_1_player_name", "pass_defense_2_player_id", 
                                       "pass_defense_2_player_name", "fumbled_1_team", "fumbled_1_player_id", 
                                       "fumbled_1_player_name", "fumbled_2_player_id", "fumbled_2_player_name", 
                                       "fumbled_2_team", "fumble_recovery_1_team", "fumble_recovery_1_yards", 
                                       "fumble_recovery_1_player_id", "fumble_recovery_1_player_name", 
                                       "fumble_recovery_2_team", "fumble_recovery_2_yards", "fumble_recovery_2_player_id", 
                                       "fumble_recovery_2_player_name", "sack_player_id", "sack_player_name", 
                                       "half_sack_1_player_id", "half_sack_1_player_name", "half_sack_2_player_id", 
                                       "half_sack_2_player_name", "return_team", "return_yards", "penalty_team", 
                                       "penalty_player_id", "penalty_player_name", "penalty_yards", 
                                       "replay_or_challenge", "replay_or_challenge_result", "penalty_type", 
                                       "defensive_two_point_attempt", "defensive_two_point_conv", "defensive_extra_point_attempt", 
                                       "defensive_extra_point_conv", "safety_player_name", "safety_player_id", 
                                       "season", "cp", "cpoe", "series", "series_success", "series_result", 
                                       "order_sequence", "start_time", "time_of_day", "stadium", "weather", 
                                       "nfl_api_id", "play_clock", "play_deleted", "play_type_nfl", 
                                       "special_teams_play", "st_play_type", "end_clock_time", "end_yard_line", 
                                       "fixed_drive", "fixed_drive_result", "drive_real_start_time", 
                                       "drive_play_count", "drive_time_of_possession", "drive_first_downs", 
                                       "drive_inside20", "drive_ended_with_score", "drive_quarter_start", 
                                       "drive_quarter_end", "drive_yards_penalized", "drive_start_transition", 
                                       "drive_end_transition", "drive_game_clock_start", "drive_game_clock_end", 
                                       "drive_start_yard_line", "drive_end_yard_line", "drive_play_id_started", 
                                       "drive_play_id_ended", "away_score", "home_score", "location", 
                                       "result", "total", "spread_line", "total_line", "div_game", "roof", 
                                       "surface", "temp", "wind", "home_coach", "away_coach", "stadium_id", 
                                       "game_stadium", "aborted_play", "success", "passer", "passer_jersey_number", 
                                       "rusher", "rusher_jersey_number", "receiver", "receiver_jersey_number", 
                                       "pass", "rush", "first_down", "special", "play", "passer_id", 
                                       "rusher_id", "receiver_id", "name", "jersey_number", "id", "fantasy_player_name", 
                                       "fantasy_player_id", "fantasy", "fantasy_id", "out_of_bounds", 
                                       "home_opening_kickoff", "qb_epa", "xyac_epa", "xyac_mean_yardage", 
                                       "xyac_median_yardage", "xyac_success", "xyac_fd", "xpass", "pass_oe"
), type = c("FLOAT", "STRING", "STRING", "STRING", "STRING", 
            "STRING", "INTEGER", "STRING", "STRING", "STRING", "STRING", 
            "FLOAT", "STRING", "FLOAT", "FLOAT", "FLOAT", "STRING", "FLOAT", 
            "FLOAT", "FLOAT", "FLOAT", "FLOAT", "FLOAT", "STRING", "STRING", 
            "FLOAT", "FLOAT", "STRING", "STRING", "FLOAT", "FLOAT", "FLOAT", 
            "FLOAT", "FLOAT", "FLOAT", "FLOAT", "STRING", "STRING", "FLOAT", 
            "FLOAT", "STRING", "STRING", "STRING", "FLOAT", "STRING", "STRING", 
            "FLOAT", "FLOAT", "FLOAT", "STRING", "STRING", "STRING", "STRING", 
            "FLOAT", "FLOAT", "FLOAT", "FLOAT", "FLOAT", "FLOAT", "FLOAT", 
            "FLOAT", "FLOAT", "FLOAT", "FLOAT", "FLOAT", "FLOAT", "FLOAT", 
            "FLOAT", "FLOAT", "FLOAT", "FLOAT", "FLOAT", "FLOAT", "FLOAT", 
            "FLOAT", "FLOAT", "FLOAT", "FLOAT", "FLOAT", "FLOAT", "FLOAT", 
            "FLOAT", "FLOAT", "FLOAT", "FLOAT", "FLOAT", "FLOAT", "FLOAT", 
            "FLOAT", "FLOAT", "FLOAT", "FLOAT", "FLOAT", "FLOAT", "FLOAT", 
            "FLOAT", "FLOAT", "FLOAT", "FLOAT", "FLOAT", "FLOAT", "FLOAT", 
            "FLOAT", "FLOAT", "FLOAT", "FLOAT", "FLOAT", "FLOAT", "FLOAT", 
            "FLOAT", "FLOAT", "FLOAT", "FLOAT", "FLOAT", "FLOAT", "FLOAT", 
            "FLOAT", "FLOAT", "FLOAT", "FLOAT", "FLOAT", "FLOAT", "FLOAT", 
            "FLOAT", "FLOAT", "FLOAT", "FLOAT", "FLOAT", "FLOAT", "FLOAT", 
            "FLOAT", "FLOAT", "FLOAT", "FLOAT", "FLOAT", "FLOAT", "FLOAT", 
            "FLOAT", "FLOAT", "FLOAT", "FLOAT", "FLOAT", "FLOAT", "FLOAT", 
            "FLOAT", "FLOAT", "FLOAT", "FLOAT", "FLOAT", "FLOAT", "FLOAT", 
            "FLOAT", "FLOAT", "FLOAT", "FLOAT", "FLOAT", "FLOAT", "FLOAT", 
            "FLOAT", "FLOAT", "FLOAT", "FLOAT", "FLOAT", "FLOAT", "FLOAT", 
            "FLOAT", "FLOAT", "FLOAT", "FLOAT", "FLOAT", "STRING", "STRING", 
            "FLOAT", "STRING", "STRING", "FLOAT", "STRING", "STRING", "FLOAT", 
            "STRING", "STRING", "FLOAT", "STRING", "STRING", "FLOAT", "STRING", 
            "STRING", "STRING", "STRING", "STRING", "STRING", "STRING", "STRING", 
            "STRING", "STRING", "STRING", "STRING", "STRING", "STRING", "STRING", 
            "STRING", "STRING", "STRING", "STRING", "STRING", "STRING", "STRING", 
            "STRING", "STRING", "STRING", "STRING", "STRING", "STRING", "STRING", 
            "STRING", "STRING", "STRING", "STRING", "STRING", "STRING", "STRING", 
            "STRING", "STRING", "STRING", "STRING", "STRING", "STRING", "STRING", 
            "STRING", "STRING", "STRING", "STRING", "STRING", "STRING", "STRING", 
            "STRING", "STRING", "STRING", "STRING", "FLOAT", "STRING", "STRING", 
            "STRING", "STRING", "STRING", "STRING", "STRING", "STRING", "STRING", 
            "STRING", "STRING", "STRING", "STRING", "STRING", "STRING", "STRING", 
            "STRING", "FLOAT", "STRING", "STRING", "STRING", "FLOAT", "STRING", 
            "STRING", "STRING", "STRING", "STRING", "STRING", "STRING", "STRING", 
            "STRING", "FLOAT", "STRING", "STRING", "STRING", "FLOAT", "FLOAT", 
            "STRING", "STRING", "FLOAT", "FLOAT", "FLOAT", "FLOAT", "STRING", 
            "STRING", "INTEGER", "FLOAT", "FLOAT", "FLOAT", "FLOAT", "STRING", 
            "FLOAT", "STRING", "STRING", "STRING", "STRING", "STRING", "STRING", 
            "FLOAT", "STRING", "FLOAT", "STRING", "STRING", "STRING", "FLOAT", 
            "STRING", "STRING", "FLOAT", "STRING", "FLOAT", "FLOAT", "FLOAT", 
            "FLOAT", "FLOAT", "FLOAT", "STRING", "STRING", "STRING", "STRING", 
            "STRING", "STRING", "FLOAT", "FLOAT", "INTEGER", "INTEGER", "STRING", 
            "INTEGER", "INTEGER", "FLOAT", "FLOAT", "INTEGER", "STRING", 
            "STRING", "INTEGER", "INTEGER", "STRING", "STRING", "STRING", 
            "STRING", "FLOAT", "FLOAT", "STRING", "INTEGER", "STRING", "INTEGER", 
            "STRING", "INTEGER", "FLOAT", "FLOAT", "FLOAT", "FLOAT", "FLOAT", 
            "STRING", "STRING", "STRING", "STRING", "INTEGER", "STRING", 
            "STRING", "STRING", "STRING", "STRING", "FLOAT", "FLOAT", "FLOAT", 
            "FLOAT", "FLOAT", "FLOAT", "FLOAT", "FLOAT", "FLOAT", "FLOAT"
)), class = "data.frame", row.names = c(NA, -372L))

data_source <- bigrquery::bq_table(project, dataset, table_name)

# create table in BigQuery
bigrquery::bq_table_create(data_source, 
                           fields = bigrquery::as_bq_fields(
                             field_types %>% purrr::pmap(list)
                             )
                           )

This might be a little bit tedious but it’s precise. The table is created in the way we want it to be.

Now we can use the original package and import all the data.

1
2
3
4
5
# Load data in BigQuery using nflfastR
nflfastR::update_db(
  tblname = table_name,
  db_connection = con
)
To leave a comment for the author, please follow the link and comment on their blog: rstats-tips.net.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)