How I won an In-Class Kaggle Competition
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
During my 4th year as an undergraduate student in statistics, I took a Big Data class where we had to compete in a Kaggle competition.
This competition was about imputing missing values in a times series weather data set, acquired from 76 different weather stations in the United Kingdom. The data set consisted of 1,880,892 rows and 11 columns. In total, there were 2,074,873 missing values where we only had to make predictions for 29,000 of them. The final score was evaluated by calculating the absolute mean error. Our approaches for missing data imputations included linear interpolation, random forests, and linear regressions. In the end, however, the crucial part for this module was to figure out for which NA values we needed to use linear interpolation, and which ones we needed to predict with modeling.
Data Preparation
# reading in training set train <- readr::read_csv(here::here("data/train.csv")) %>% # create new column with row number # important for looking up missing values # in the training set that are in the test set dplyr::mutate(ID = dplyr::row_number()) %>% # some values have abnormally large values. Hence, # we decided that we are going to impute NA for outliers dplyr::mutate_at(vars(WINDDIR, WINDSPEED, TEMPERATURE, DEWPOINT, PRESSURE), .funs = ~ base::ifelse((. > 4 | . < -4), NA, .) ) # reading in testing set test <- readr::read_csv(here::here("data/test.csv")) head(train) ## # A tibble: 6 × 12 ## USAF YEAR MONTH DAY HOUR MINUTE WINDDIR WINDSPEED TEMPERATURE DEWPOINT ## <dbl> <dbl> <chr> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> ## 1 992700 2008 01 00 00 00 NA 0.0555 0.441 0.874 ## 2 992700 2008 01 00 00 00 -0.246 NA 0.441 0.760 ## 3 992700 2008 01 00 00 00 -0.246 -0.0529 NA 0.760 ## 4 992700 2008 01 00 00 00 -0.246 0.597 0.305 NA ## 5 992700 2008 01 00 00 00 -0.142 0.489 0.305 0.532 ## 6 992700 2008 01 00 00 00 -0.142 1.14 0.168 0.418 ## # … with 2 more variables: PRESSURE <dbl>, ID <int> head(test) ## # A tibble: 6 × 1 ## ID ## <chr> ## 1 374076-5 ## 2 1327550-2 ## 3 450121-4 ## 4 1697534-3 ## 5 952016-5 ## 6 1772604-1
In the code above, we are replacing all outliers with NA values.
# interpolation beats any model for small gaps of NA values # therefore, we are imputing values by interpolation for a gap of 3 or smaller train %>% dplyr::mutate_at(vars(WINDDIR, WINDSPEED, TEMPERATURE, DEWPOINT, PRESSURE), .funs = ~ imputeTS::na_interpolation(., maxgap = 5) ) -> train head(train) ## # A tibble: 6 × 12 ## USAF YEAR MONTH DAY HOUR MINUTE WINDDIR WINDSPEED TEMPERATURE DEWPOINT ## <dbl> <dbl> <chr> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> ## 1 992700 2008 01 00 00 00 -0.246 0.0555 0.441 0.874 ## 2 992700 2008 01 00 00 00 -0.246 0.00131 0.441 0.760 ## 3 992700 2008 01 00 00 00 -0.246 -0.0529 0.373 0.760 ## 4 992700 2008 01 00 00 00 -0.246 0.597 0.305 0.646 ## 5 992700 2008 01 00 00 00 -0.142 0.489 0.305 0.532 ## 6 992700 2008 01 00 00 00 -0.142 1.14 0.168 0.418 ## # … with 2 more variables: PRESSURE <dbl>, ID <int>
Next, we are doing linear interpolation when there are no more than 5 consecutive NA values.
Building the Test Set
test %>% tidyr::separate(col = ID, into = c("row", "col"), sep = "-") %>% dplyr::mutate_all(.funs = as.numeric) %>% dplyr::mutate(column = dplyr::case_when( col == 1 ~ "WINDDIR", col == 2 ~ "WINDSPEED", col == 3 ~ "TEMPERATURE", col == 4 ~ "DEWPOINT", col == 5 ~ "PRESSURE" )) -> to_be_predicted to_be_predicted %>% dplyr::mutate(value = NA) -> to_be_predicted head(to_be_predicted) ## # A tibble: 6 × 4 ## row col column value ## <dbl> <dbl> <chr> <lgl> ## 1 374076 5 PRESSURE NA ## 2 1327550 2 WINDSPEED NA ## 3 450121 4 DEWPOINT NA ## 4 1697534 3 TEMPERATURE NA ## 5 952016 5 PRESSURE NA ## 6 1772604 1 WINDDIR NA
We separated the ID column into two columns and also added the actual column name to the data set.
Now we will be filling up the to_b_predicted
data frame with the linear interpolated values from previously. We are using a for loop where we are going over every row of thee to_be_predicted
data frame and then use the row and column values to get the values from the train
data set.
# fill out missing values in df to_be_predicted # by the interpolated values in the training data set for (i in 1:nrow(to_be_predicted)) { to_be_predicted$value[i] <- train[to_be_predicted$row[i], to_be_predicted$column[i]] %>% dplyr::pull() } head(to_be_predicted) ## # A tibble: 6 × 4 ## row col column value ## <dbl> <dbl> <chr> <dbl> ## 1 374076 5 PRESSURE 0.293 ## 2 1327550 2 WINDSPEED -0.775 ## 3 450121 4 DEWPOINT 1.16 ## 4 1697534 3 TEMPERATURE 0.168 ## 5 952016 5 PRESSURE 0.466 ## 6 1772604 1 WINDDIR -1.69 # missing values left sum(is.na(to_be_predicted$value)) ## [1] 3461
After getting the linear interpolated values in the test data set, we still have 3461 values we need to fill.
# breakdown of missing values to_be_predicted %>% dplyr::filter(is.na(value)) %>% dplyr::pull(column) %>% table() ## . ## DEWPOINT PRESSURE TEMPERATURE WINDDIR WINDSPEED ## 517 692 868 595 789
Next, we will be creating two columns in the to_be_predicted
data set where we are computing how many consecutive NA values there are above the NA value that we need to predict and how many consecutive NA values there are below the missing value we need to predict.
Linear Interpolation Versus Regression
df <- to_be_predicted %>% dplyr::filter(is.na(value)) %>% dplyr::mutate( above = NA, below = NA ) ### Functions Implemented in C++ to reduce run time by about 10min compared to R loops Rcpp::sourceCpp(here::here("scripts/missing_gap.cpp")) # count how many missing values there are above and below # each target missing value in the test set for (i in c("WINDDIR", "TEMPERATURE", "DEWPOINT", "PRESSURE", "WINDSPEED")) { missing_below( df$row, df$column, train[, i] %>% dplyr::pull(), df$below, i ) -> df$below } for (i in c("WINDDIR", "TEMPERATURE", "DEWPOINT", "PRESSURE", "WINDSPEED")) { missing_above( df$row, df$column, train[, i] %>% dplyr::pull(), df$above, i ) -> df$above } head(df) ## # A tibble: 6 × 6 ## row col column value above below ## <dbl> <dbl> <chr> <dbl> <dbl> <dbl> ## 1 1044015 2 WINDSPEED NA 100 100 ## 2 779897 5 PRESSURE NA 100 100 ## 3 123439 1 WINDDIR NA 100 100 ## 4 1124760 3 TEMPERATURE NA 100 100 ## 5 1240458 5 PRESSURE NA 100 100 ## 6 995863 1 WINDDIR NA 100 100
If the column above OR below is less than 6, then we are doing linear interpolation because the NA value is still very close in time to a known value.
df %>% dplyr::filter(is.na(value)) %>% dplyr::filter(above > 6 & below > 6) -> df nrow(df) ## [1] 3161
Here is the C++ function we used.
#include <Rcpp.h> using namespace Rcpp; // [[Rcpp::export]] NumericVector missing_below(Rcpp::NumericVector row_number, Rcpp::CharacterVector column_name, Rcpp::NumericVector train, Rcpp::NumericVector below, Rcpp::String var) { int n = row_number.length(); int counter; for(int i = 0; i <= n - 1; i++) { if(column_name[i] == var) { counter = 0; while(Rcpp::NumericVector::is_na(train[row_number[i] - counter - 1])) { counter = counter + 1; if(counter > 100) { break; } } below[i] = counter - 1; } } return(below); } // [[Rcpp::export]] NumericVector missing_above(Rcpp::NumericVector row_number, Rcpp::CharacterVector column_name, Rcpp::NumericVector train, Rcpp::NumericVector above, Rcpp::String var) { int n = row_number.length(); int counter; for(int i = 0; i <= n - 1; i++) { if(column_name[i] == var) { counter = 0; while(Rcpp::NumericVector::is_na(train[row_number[i] + counter - 1])) { counter = counter + 1; if(counter > 100) { break; } } above[i] = counter - 1; } } return(above); }
Linear Regression Criteria
There are still 3161 missing values that we need to predict. And for th other 308 values we can do linear interpolation. Firstly, we are doing the regression and then the linear interpolation.
The linear regression will be a little bit tricky because there might be multiple missing value within a row. So when we want to predict an NA value for WINDSPEED
for example and all the other four columns have NA values in them then we won’t be able to do a regression because there are no predictors.
Therefore, we need to identify which columns are missing in the row where we need to predict a missing value.
The code below takes care of the it. We are fitting arround 3000 separate linear models.
– First, we identify the row of the missing value we need to predict
– Next, we look if the predictors also have NA values in that row. If it does, we are removing that predictor from the regression.
– Now, if all predictors have NA values, then we are just using the data from the station and the same month from the NA value and then using mean imputation.
– I decided to only use data in the regression from the same station. So next, I am checking the percentage of missing values within the column of each predictor. If the column has more than 40% missing values, then that column will be removed as a predictor because otherwise we would lose too many observations for the regression.
– If all of the predictor columns have more than 40% of missing values, then we are only using the mean value of the response column as prediction.
– Finally, if we are left with more than 500 observations, we are doing a linear regression.
The 500 observations and 40% missing values were randomly chosen.
################################### ### Fitting ~3000 Linear Models ### ################################### # keep rows for which regression model failed error_row <- c() # keep r squared r_squared <- c() # for each target missing values, determine which other predictors in the row # of the target missing values are missing as well so we can exclude # these predictors from our regression and that also ensures we choose as many predictors as possible for (i in 1:nrow(df)) { # pull out predictors which are not missing in response variable row train[df$row[[i]], ] %>% dplyr::select(WINDDIR:PRESSURE, -c(df$column[[i]], which(is.na(.)))) %>% names() -> names_pred # if all predictors in row are missing use mean imputation if (length(names_pred) == 0) { train %>% dplyr::filter(USAF == train[df$row[[i]], ]$USAF[[1]] & MONTH == train[df$row[[i]], ]$MONTH[[1]]) %>% { mean(.[, df$column[[i]]] %>% dplyr::pull(), na.rm = TRUE ) } -> df[i, "value"] next } # check if predictors have more than 40% missing values in column # if they do, discard that particular predictor train %>% dplyr::filter(USAF == train$USAF[[df$row[[i]]]]) %>% dplyr::select(names_pred) %>% dplyr::summarise_all(.funs = ~ (sum(is.na(.)) / length(.))) %>% dplyr::distinct() %>% tidyr::gather() %>% dplyr::filter(value < 0.40) %>% dplyr::pull(key) %>% paste(collapse = " + ") -> predictor_vars # if there are no predictor_vars because there are more then 40% # of missing values in each predictor column, then just impute the # average of the response column by station if (predictor_vars == "") { train %>% dplyr::filter(USAF == train[df$row[[i]], ]$USAF[[1]] & MONTH == train[df$row[[i]], ]$MONTH[[1]]) %>% { mean(.[, df$column[[i]]] %>% dplyr::pull(), na.rm = TRUE ) } -> df[i, "value"] next } # if there are enough observations (500 or more) for a particular station, # we do a linear regression by station if (train %>% dplyr::select(c(USAF, df$column[[i]], stringr::str_split(predictor_vars, pattern = "\\+") %>% unlist() %>% stringi::stri_trim_both())) %>% dplyr::filter(USAF == train$USAF[[df$row[[i]]]]) %>% na.omit() %>% nrow() > 500) { # linear regression formula formula <- paste0(df$column[[i]], " ~ MONTH + ", predictor_vars) formula <- as.formula(formula) # linear model lm(formula, data = train %>% dplyr::filter(USAF == train$USAF[[df$row[[i]]]]) ) -> lm_model # try catch statement when factor MONTH in predict row does not appear in training data tryCatch( expr = { predicted_value <- predict(lm_model, newdata = train[df$row[[i]], ]) df[i, "value"] <- predicted_value }, error = function(error) { error_row <<- c(error_row, i) df[i, "value"] <<- NA message(paste("Caught an error in row", i)) } ) } # if there are not enough observations for a particular station, # we use the entire training data set with the station as a predictor else { formula <- paste0(df$column[[i]], " ~ MONTH + as.factor(USAF) + ", predictor_vars) formula <- as.formula(formula) lm(formula, data = train ) -> lm_model # try catch statement when factor MONTH in predict row does not appear in training data tryCatch( expr = { predicted_value <- predict(lm_model, newdata = train[df$row[[i]], ]) df[i, "value"] <- predicted_value }, error = function(error) { error_row <<- c(error_row, i) df[i, "value"] <<- NA message(paste("Caught an error in row", i)) } ) } # safe r squared in case we want to deal with missing values for # low r squared models sepeartely r_squared[[i]] <- summary(lm_model)$r.squared names(r_squared) <- i } train %>% dplyr::select(WINDDIR:PRESSURE) %>% na.omit() %>% cor() ## WINDDIR WINDSPEED TEMPERATURE DEWPOINT PRESSURE ## WINDDIR 1.00000000 0.07615423 -0.01390792 -0.03722569 -0.02001083 ## WINDSPEED 0.07615423 1.00000000 -0.13940574 -0.09789144 -0.31549858 ## TEMPERATURE -0.01390792 -0.13940574 1.00000000 0.89136749 0.05715878 ## DEWPOINT -0.03722569 -0.09789144 0.89136749 1.00000000 0.01802579 ## PRESSURE -0.02001083 -0.31549858 0.05715878 0.01802579 1.00000000
There is a strong linear correlation between TEMPERATURE
and DEWPOINT
. Therefore, we are only fitting a linear regression for these two columns where the previous model failed.
df[error_row, ] %>% dplyr::filter(column == "DEWPOINT" | column == "TEMPERATURE") -> temp_dew_df r <- c() for (i in 1:nrow(temp_dew_df)) { if (temp_dew_df$column[[i]] == "TEMPERATURE") { lm(TEMPERATURE ~ DEWPOINT + as.factor(MONTH), data = train %>% dplyr::filter(USAF == train[temp_dew_df$row[[i]], ]$USAF) ) -> lm_model predicted_value <- predict(lm_model, newdata = train[temp_dew_df$row[[i]], ]) temp_dew_df[i, "value"] <- predicted_value r[[i]] <- summary(lm_model)$r.squared } else { lm(DEWPOINT ~ TEMPERATURE + as.factor(MONTH), data = train %>% dplyr::filter(USAF == train[temp_dew_df$row[[i]], ]$USAF) ) -> lm_model predicted_value <- predict(lm_model, newdata = train[temp_dew_df$row[[i]], ]) temp_dew_df[i, "value"] <- predicted_value r[[i]] <- summary(lm_model)$r.squared } } # join imputed values with test data set df %>% tidyr::unite(together, c("row", "column")) %>% dplyr::left_join( temp_dew_df %>% tidyr::unite(together, c("row", "column")) %>% select(together, value), by = "together" ) %>% dplyr::mutate(value.x = ifelse(!(is.na(value.y)), value.y, value.x)) %>% dplyr::rename(value = value.x) %>% tidyr::separate(together, into = c("row", "column"), remove = FALSE) %>% dplyr::select(-c(value.y, together)) -> df to_be_predicted %>% tidyr::unite(together, c("row", "column")) %>% dplyr::left_join(df %>% tidyr::unite(together, c("row", "column")) %>% select(together, value), by = "together") %>% dplyr::mutate(value.x = ifelse(!(is.na(value.y)), value.y, value.x)) %>% dplyr::rename(value = value.x) %>% tidyr::separate(together, into = c("row", "column"), remove = FALSE) %>% dplyr::select(-c(value.y, together)) -> to_be_predicted # interpolation for rest of missing NAs train %>% dplyr::mutate_at(vars(WINDDIR, WINDSPEED, TEMPERATURE, DEWPOINT, PRESSURE), .funs = ~ imputeTS::na_interpolation(.) ) -> train for (i in 1:nrow(to_be_predicted)) { if (is.na(to_be_predicted$value[[i]])) { to_be_predicted[i, "value"] <- train[to_be_predicted$row[i], to_be_predicted$column[i]] %>% dplyr::pull() } } to_be_predicted %>% dplyr::rename(ID = row, Value = value) %>% tidyr::unite(col = ID, ID, col, sep = "-", remove = FALSE) %>% dplyr::select(-column, -col) -> predicted_values readr::write_csv(predicted_values, "predictions/interpolation_lm.csv")
The interpolation_lm.csv file will give you the lowest error on the leader board.
Final Thoughts
- Here is a link to the github repository: Weather Kaggle Competition
- And another link where the code is implemented in python and also a link to the Kaggle Competition here.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.