# Preprocessing and resampling using #TidyTuesday college data

March 9, 2020
By

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

I’ve been publishing screencasts demonstrating how to use the tidymodels framework, from first getting started to how to tune machine learning models. Today, I’m using this week’s `#TidyTuesday` dataset on college tuition and diversity at US colleges to show some data preprocessing steps and how to use resampling!

Here is the code I used in the video, for those who prefer reading instead of or in addition to video.

## Explore the data

Our modeling goal here is to predict which US colleges have higher proportions of minority students based on college data such as tuition from the #TidyTuesday dataset. There are several related datasets this week, and this modeling analysis uses two of them.

``````library(tidyverse)

filter(category == "Total Minority") %>%
mutate(TotalMinority = enrollment / total_enrollment)``````

What is the distribution of total minority student population?

``````diversity_school <- diversity_raw %>%
filter(category == "Total Minority") %>%
mutate(TotalMinority = enrollment / total_enrollment)

diversity_school %>%
ggplot(aes(TotalMinority)) +
geom_histogram(alpha = 0.7, fill = "midnightblue") +
scale_x_continuous(labels = scales::percent_format()) +
labs(x = "% of student population who identifies as any minority")`````` The median proportion of minority students for this dataset is 30%.

Let’s build a dataset for modeling, joining the two dataframes we have. Let’s also move from individual states in the US to US regions, as found in `state.region`.

``````university_df <- diversity_school %>%
filter(category == "Total Minority") %>%
mutate(TotalMinority = enrollment / total_enrollment) %>%
transmute(
diversity = case_when(
TotalMinority > 0.3 ~ "high",
TRUE ~ "low"
),
name, state,
total_enrollment
) %>%
inner_join(tuition_cost %>%
select(
name, type, degree_length,
in_state_tuition:out_of_state_total
)) %>%
left_join(tibble(state = state.name, region = state.region)) %>%
select(-state, -name) %>%
mutate_if(is.character, factor)

skimr::skim(university_df)``````
``````## ── Data Summary ────────────────────────
##                            Values
## Name                       university_df
## Number of rows             2159
## Number of columns          9
## _______________________
## Column type frequency:
##   factor                   4
##   numeric                  5
## ________________________
## Group variables            None
##
## ── Variable type: factor ───────────────────────────────────────────────────────
##   skim_variable n_missing complete_rate ordered n_unique
## 1 diversity             0             1 FALSE          2
## 2 type                  0             1 FALSE          3
## 3 degree_length         0             1 FALSE          2
## 4 region                0             1 FALSE          4
##   top_counts
## 1 low: 1241, hig: 918
## 2 Pub: 1145, Pri: 955, For: 59
## 3 4 Y: 1296, 2 Y: 863
## 4 Sou: 774, Nor: 543, Nor: 443, Wes: 399
##
## ── Variable type: numeric ──────────────────────────────────────────────────────
##   skim_variable        n_missing complete_rate   mean     sd    p0   p25   p50
## 1 total_enrollment             0             1  6184.  8264.    15  1352  3133
## 2 in_state_tuition             0             1 17044. 15461.   480  4695 10161
## 3 in_state_total               0             1 23545. 19782.   962  5552 17749
## 4 out_of_state_tuition         0             1 20798. 13725.   480  9298 17045
## 5 out_of_state_total           0             1 27299. 18221.  1376 11018 23036
##      p75  p100 hist
## 1  7644. 81459 ▇▁▁▁▁
## 2 28780  59985 ▇▂▂▁▁
## 3 38519  75003 ▇▅▂▂▁
## 4 29865  59985 ▇▆▅▂▁
## 5 40154  75003 ▇▅▅▂▁``````

How are some of these quantities related to the proportion of minority students at a college?

``````university_df %>%
ggplot(aes(type, in_state_tuition, fill = diversity)) +
geom_boxplot(alpha = 0.8) +
scale_y_continuous(labels = scales::dollar_format()) +
labs(x = NULL, y = "In-State Tuition", fill = "Diversity")`````` ## Build models with recipes

Now it is time for modeling! First, we split our data into training and testing sets. Then, we build a recipe for data preprocessing.

• First, we must tell the `recipe()` what our model is going to be (using a formula here) and what our training data is.
• We then filter out variables that are too correlated with each other. We had several different ways of measuring the tuition in our dataset that are correlated with each other, and this step shows how to handle a situation like that.
• We then convert the factor columns into (one or more) numeric binary (0 and 1) variables for the levels of the training data.
• Next, we remove any numeric variables that have zero variance.
• As a last step, we normalize (center and scale) the numeric variables. We need to do this because some of them are on very different scales from each other and a model we want to train is sensitive to this.
• Finally, we `prep()` the `recipe()`. This means we actually do something with the steps and our training data; we estimate the required parameters from `uni_train` to implement these steps so this whole sequence can be applied later to another dataset.
``````library(tidymodels)

set.seed(1234)
uni_split <- initial_split(university_df, strata = diversity)
uni_train <- training(uni_split)
uni_test <- testing(uni_split)

uni_rec <- recipe(diversity ~ ., data = uni_train) %>%
step_corr(all_numeric()) %>%
step_dummy(all_nominal(), -all_outcomes()) %>%
step_zv(all_numeric()) %>%
step_normalize(all_numeric()) %>%
prep()

uni_rec``````
``````## Data Recipe
##
## Inputs:
##
##       role #variables
##    outcome          1
##  predictor          8
##
## Training data contained 1620 data points and no missing data.
##
## Operations:
##
## Correlation filter removed in_state_tuition, ... [trained]
## Dummy variables from type, degree_length, region [trained]
## Zero variance filter removed no terms [trained]
## Centering and scaling for total_enrollment, ... [trained]``````

Now it’s time to specify and then fit our models. Here, we specify and fit three models:

• logistic regression
• k-nearest neighbor
• decision tree

Check out what data we are training these models on: `juice(uni_rec)`. The recipe `uni_rec` contains all our transformations for data preprocessing and feature engineering, as well as the data these transformations were estimated from. When we `juice()` the recipe, we squeeze that training data back out, transformed in all the ways we specified.

``````uni_juiced <- juice(uni_rec)

glm_spec <- logistic_reg() %>%
set_engine("glm")

glm_fit <- glm_spec %>%
fit(diversity ~ ., data = uni_juiced)

glm_fit``````
``````## parsnip model object
##
## Fit time:  23ms
##
## Call:  stats::glm(formula = formula, family = stats::binomial, data = data)
##
## Coefficients:
##           (Intercept)       total_enrollment     out_of_state_total
##                0.3704                -0.4581                 0.5074
##          type_Private            type_Public  degree_length_X4.Year
##               -0.1656                 0.2058                 0.2082
##          region_South   region_North.Central            region_West
##               -0.5175                 0.3004                -0.5363
##
## Degrees of Freedom: 1619 Total (i.e. Null);  1611 Residual
## Null Deviance:       2210
## Residual Deviance: 1859  AIC: 1877``````
``````knn_spec <- nearest_neighbor() %>%
set_engine("kknn") %>%
set_mode("classification")

knn_fit <- knn_spec %>%
fit(diversity ~ ., data = uni_juiced)

knn_fit``````
``````## parsnip model object
##
## Fit time:  65ms
##
## Call:
## kknn::train.kknn(formula = formula, data = data, ks = 5)
##
## Type of response variable: nominal
## Minimal misclassification: 0.3277778
## Best kernel: optimal
## Best k: 5``````
``````tree_spec <- decision_tree() %>%
set_engine("rpart") %>%
set_mode("classification")

tree_fit <- tree_spec %>%
fit(diversity ~ ., data = uni_juiced)

tree_fit``````
``````## parsnip model object
##
## Fit time:  21ms
## n= 1620
##
## node), split, n, loss, yval, (yprob)
##       * denotes terminal node
##
##  1) root 1620 689 low (0.4253086 0.5746914)
##    2) region_North.Central< 0.5346496 1192 586 high (0.5083893 0.4916107)
##      4) out_of_state_total< -0.7087237 418 130 high (0.6889952 0.3110048) *
##      5) out_of_state_total>=-0.7087237 774 318 low (0.4108527 0.5891473)
##       10) out_of_state_total< 0.35164 362 180 low (0.4972376 0.5027624)
##         20) region_South>=0.3002561 212  86 high (0.5943396 0.4056604)
##           40) degree_length_X4.Year>=-0.2001293 172  62 high (0.6395349 0.3604651) *
##           41) degree_length_X4.Year< -0.2001293 40  16 low (0.4000000 0.6000000) *
##         21) region_South< 0.3002561 150  54 low (0.3600000 0.6400000)
##           42) region_West>=0.8128302 64  28 high (0.5625000 0.4375000) *
##           43) region_West< 0.8128302 86  18 low (0.2093023 0.7906977) *
##       11) out_of_state_total>=0.35164 412 138 low (0.3349515 0.6650485)
##         22) region_West>=0.8128302 88  38 high (0.5681818 0.4318182)
##           44) out_of_state_total>=1.547681 30   5 high (0.8333333 0.1666667) *
##           45) out_of_state_total< 1.547681 58  25 low (0.4310345 0.5689655) *
##         23) region_West< 0.8128302 324  88 low (0.2716049 0.7283951) *
##    3) region_North.Central>=0.5346496 428  83 low (0.1939252 0.8060748)
##      6) out_of_state_total< -1.19287 17   5 high (0.7058824 0.2941176) *
##      7) out_of_state_total>=-1.19287 411  71 low (0.1727494 0.8272506) *``````

Models! 🎉

## Evaluate models with resampling

Well, we fit models, but how do we evaluate them? We can use resampling to compute performance metrics across some set of resamples, like the cross-validation splits we create here. The function `fit_resamples()` fits a model such as `glm_spec` to the analysis subset of each resample and evaluates on the heldout bit (the assessment subset) from each resample. We can use `metrics = metric_set()` to specify which metrics we want to compute if we don’t want to only use the default ones; here let’s check out sensitivity and specificity.

``````set.seed(123)
folds <- vfold_cv(uni_juiced, strata = diversity)

set.seed(234)
glm_rs <- glm_spec %>%
fit_resamples(diversity ~ .,
folds,
metrics = metric_set(roc_auc, sens, spec),
control = control_resamples(save_pred = TRUE)
)

set.seed(234)
knn_rs <- knn_spec %>%
fit_resamples(diversity ~ .,
folds,
metrics = metric_set(roc_auc, sens, spec),
control = control_resamples(save_pred = TRUE)
)

set.seed(234)
tree_rs <- tree_spec %>%
fit_resamples(diversity ~ .,
folds,
metrics = metric_set(roc_auc, sens, spec),
control = control_resamples(save_pred = TRUE)
)``````

What do these results look like?

``tree_rs``
``````## #  10-fold cross-validation using stratification
## # A tibble: 10 x 5
##    splits             id     .metrics         .notes           .predictions
##  *
##  1  Fold01    Fold02    Fold03    Fold04    Fold05    Fold06    Fold07    Fold08    Fold09    Fold10   ``````
``` We can use collect_metrics() to see the summarized performance metrics for each set of resamples. glm_rs %>% collect_metrics() ## # A tibble: 3 x 5 ## .metric .estimator mean n std_err ## ## 1 roc_auc binary 0.758 10 0.00916 ## 2 sens binary 0.607 10 0.0209 ## 3 spec binary 0.737 10 0.00638 knn_rs %>% collect_metrics() ## # A tibble: 3 x 5 ## .metric .estimator mean n std_err ## ## 1 roc_auc binary 0.732 10 0.00639 ## 2 sens binary 0.599 10 0.0115 ## 3 spec binary 0.730 10 0.0100 tree_rs %>% collect_metrics() ## # A tibble: 3 x 5 ## .metric .estimator mean n std_err ## ## 1 roc_auc binary 0.722 10 0.00599 ## 2 sens binary 0.636 10 0.0175 ## 3 spec binary 0.730 10 0.00901 In realistic situations, we often care more about one of sensitivity or specificity than overall accuracy. What does the ROC curve look like for these models? glm_rs %>% unnest(.predictions) %>% mutate(model = "glm") %>% bind_rows(knn_rs %>% unnest(.predictions) %>% mutate(model = "knn")) %>% bind_rows(tree_rs %>% unnest(.predictions) %>% mutate(model = "rpart")) %>% group_by(model) %>% roc_curve(diversity, .pred_high) %>% ggplot(aes(x = 1 - specificity, y = sensitivity, color = model)) + geom_line(size = 1.5) + geom_abline( lty = 2, alpha = 0.5, color = "gray50", size = 1.2 )  If we decide the logistic regression model is the best fit for our purposes, we can look at the parameters in detail. glm_fit %>% tidy() %>% arrange(-estimate) ## # A tibble: 9 x 5 ## term estimate std.error statistic p.value ## ## 1 out_of_state_total 0.507 0.0934 5.43 5.58e- 8 ## 2 (Intercept) 0.370 0.0572 6.47 9.72e-11 ## 3 region_North.Central 0.300 0.0825 3.64 2.72e- 4 ## 4 degree_length_X4.Year 0.208 0.0863 2.41 1.58e- 2 ## 5 type_Public 0.206 0.193 1.07 2.86e- 1 ## 6 type_Private -0.166 0.197 -0.838 4.02e- 1 ## 7 total_enrollment -0.458 0.0737 -6.22 5.07e-10 ## 8 region_South -0.517 0.0782 -6.62 3.64e-11 ## 9 region_West -0.536 0.0724 -7.41 1.27e-13 Larger, less expensive schools in the South and West are more likely to have higher proportions of minority students. Finally, we can return to our test data as a last, unbiased check on how we can expect this model to perform on new data. We bake() our recipe using the testing set to apply the same preprocessing steps that we used on the training data. glm_fit %>% predict( new_data = bake(uni_rec, uni_test), type = "prob" ) %>% mutate(truth = uni_test\$diversity) %>% roc_auc(truth, .pred_high) ## # A tibble: 1 x 3 ## .metric .estimator .estimate ## ## 1 roc_auc binary 0.756 We can also explore other metrics with the test set, such as specificity. glm_fit %>% predict( new_data = bake(uni_rec, new_data = uni_test), type = "class" ) %>% mutate(truth = uni_test\$diversity) %>% spec(truth, .pred_class) ## # A tibble: 1 x 3 ## .metric .estimator .estimate ## ## 1 spec binary 0.719 Our metrics for the test set agree pretty well with what we found from resampling, indicating we had a good estimate of how the model will perform on new data. ```
``` var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t); r.parentNode.insertBefore(s, r); }(document, 'script')); Related ShareTweet To leave a comment for the author, please follow the link and comment on their blog: Rstats on Julia Silge. R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job. Want to share your content on R-bloggers? click here if you have a blog, or here if you don't. If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook... ```
``` ```
``` Comments are closed. ```
``` Search R-bloggers Most visited articles of the week 5 Ways to Subset a Data Frame in R How to write the first for loop in R Date Formats in R R – Sorting a data frame by the contents of a column Installing R packages Which function in R In-depth introduction to machine learning in 15 hours of expert videos AutoML Frameworks in R & Python Making Of: A Free API For COVID-19 Data Sponsors // https://support.cloudflare.com/hc/en-us/articles/200169436-How-can-I-have-Rocket-Loader-ignore-my-script-s-in-Automatic-Mode- // this must be placed higher. Otherwise it doesn't work. // data-cfasync="false" is for making sure cloudflares' rocketcache doesn't interfeare with this // in this case it only works because it was used at the original script in the text widget function createCookie(name,value,days) { var expires = ""; if (days) { var date = new Date(); date.setTime(date.getTime() + (days*24*60*60*1000)); expires = "; expires=" + date.toUTCString(); } document.cookie = name + "=" + value + expires + "; path=/"; } function readCookie(name) { var nameEQ = name + "="; var ca = document.cookie.split(';'); for(var i=0;i < ca.length;i++) { var c = ca[i]; while (c.charAt(0)==' ') c = c.substring(1,c.length); if (c.indexOf(nameEQ) == 0) return c.substring(nameEQ.length,c.length); } return null; } function eraseCookie(name) { createCookie(name,"",-1); } async function readTextFile(file) { // Helps people browse between pages without the need to keep downloading the same // ads txt page everytime. This way, it allows them to use their browser's cache. var random_number = readCookie("ad_random_number_cookie"); if(random_number == null) { var random_number = Math.floor(Math.random()*100*(new Date().getTime()/10000000000)); createCookie("ad_random_number_cookie",random_number,1) } file += '?t='+random_number; var rawFile = new XMLHttpRequest(); rawFile.onreadystatechange = function () { if(rawFile.readyState === 4) { if(rawFile.status === 200 || rawFile.status == 0) { // var allText = rawFile.responseText; // document.write(allText); document.write(rawFile.responseText); } } } rawFile.open("GET", file, false); rawFile.send(null); } // readTextFile('https://raw.githubusercontent.com/Raynos/file-store/master/temp.txt'); readTextFile("https://www.r-bloggers.com/wp-content/uploads/text-widget_anti-cache.txt"); Jobs for R usersResearch Lab Coordinator at the University of IowaSenior Enterprise Advocate (Sales – New Business)Major Accounts ExecutiveSolutions EngineerData Scientist Position for Developing Software and Tools in Genomics, Big Data and Precision MedicineSenior Scientist, Translational Informatics @ Vancouver, BC, CanadaSenior Principal Data Scientist @ Mountain View, California, United States python-bloggers.com (python/data-science news)Documentation for the querier, a query language for Data FramesAutoML Frameworks in R & PythonTop Data Science BlogsMapping the Spread of Covid-19 with PythonOnline R, Python & Git Training!Import data into the querier (now on Pypi), a query language for Data FramesVersion 0.4.0 of nnetsauce, with fruits and breast cancer classification Full list of contributing R-bloggers ```
``` R-bloggers was founded by Tal Galili, with gratitude to the R community. Is powered by WordPress using a bavotasan.com design. Copyright © 2020 R-bloggers. All Rights Reserved. Terms and Conditions for this website var snp_f = []; var snp_hostname = new RegExp(location.host); var snp_http = new RegExp("^(http|https)://", "i"); var snp_cookie_prefix = ''; var snp_separate_cookies = false; var snp_ajax_url = 'https://www.r-bloggers.com/wp-admin/admin-ajax.php'; var snp_ajax_nonce = 'c42c14ab87'; var snp_ignore_cookies = false; var snp_enable_analytics_events = false; var snp_enable_mobile = false; var snp_use_in_all = false; var snp_excluded_urls = []; snp_excluded_urls.push(''); Never miss an update! Subscribe to R-bloggers to receive e-mails with the latest R posts. (You will not see this message again.) Click here to close (This popup will not appear again) .snp-pop-109583 .snp-theme6 { max-width: 700px;} .snp-pop-109583 .snp-theme6 h1 {font-size: 17px;} .snp-pop-109583 .snp-theme6 { color: #a0a4a9;} .snp-pop-109583 .snp-theme6 .snp-field ::-webkit-input-placeholder { color: #a0a4a9;} .snp-pop-109583 .snp-theme6 .snp-field :-moz-placeholder { color: #a0a4a9;} .snp-pop-109583 .snp-theme6 .snp-field :-ms-input-placeholder { color: #a0a4a9;} .snp-pop-109583 .snp-theme6 .snp-field input { border: 1px solid #a0a4a9;} .snp-pop-109583 .snp-theme6 .snp-field { color: #000000;} .snp-pop-109583 .snp-theme6 { background: #f2f2f2;} jQuery(document).ready(function() { }); var CaptchaCallback = function() { jQuery('.g-recaptcha').each(function(index, el) { grecaptcha.render(el, { 'sitekey' : '' }); }); }; (function(){ var corecss = document.createElement('link'); var themecss = document.createElement('link'); var corecssurl = "https://www.r-bloggers.com/wp-content/plugins/syntaxhighlighter/syntaxhighlighter3/styles/shCore.css?ver=3.0.9b"; if ( corecss.setAttribute ) { corecss.setAttribute( "rel", "stylesheet" ); corecss.setAttribute( "type", "text/css" ); corecss.setAttribute( "href", corecssurl ); } else { corecss.rel = "stylesheet"; corecss.href = corecssurl; } document.head.appendChild( corecss ); var themecssurl = "https://www.r-bloggers.com/wp-content/plugins/syntaxhighlighter/syntaxhighlighter3/styles/shThemeDefault.css?ver=3.0.9b"; if ( themecss.setAttribute ) { themecss.setAttribute( "rel", "stylesheet" ); themecss.setAttribute( "type", "text/css" ); themecss.setAttribute( "href", themecssurl ); } else { themecss.rel = "stylesheet"; themecss.href = themecssurl; } document.head.appendChild( themecss ); })(); SyntaxHighlighter.config.strings.expandSource = '+ expand source'; SyntaxHighlighter.config.strings.help = '?'; SyntaxHighlighter.config.strings.alert = 'SyntaxHighlighter\n\n'; SyntaxHighlighter.config.strings.noBrush = 'Can\'t find brush for: '; SyntaxHighlighter.config.strings.brushNotHtmlScript = 'Brush wasn\'t configured for html-script option: '; SyntaxHighlighter.defaults['pad-line-numbers'] = false; SyntaxHighlighter.defaults['toolbar'] = false; SyntaxHighlighter.all(); // Infinite scroll support if ( typeof( jQuery ) !== 'undefined' ) { jQuery( function( \$ ) { \$( document.body ).on( 'post-load', function() { SyntaxHighlighter.highlight(); } ); } ); } _stq = window._stq || []; _stq.push([ 'view', {v:'ext',j:'1:7.3.2',blog:'11524731',post:'194207',tz:'-6',srv:'www.r-bloggers.com'} ]); _stq.push([ 'clickTrackerInit', '11524731', '194207' ]); jQuery(document).ready(function (\$) { //\$( document ).ajaxStart(function() { //}); for (var i = 0; i < document.forms.length; ++i) { var form = document.forms[i]; if (\$(form).attr("method") != "get") { \$(form).append('<input type="hidden" name="nAmdgxejI" value="*g4z8ODZ2" />'); } if (\$(form).attr("method") != "get") { \$(form).append('<input type="hidden" name="QpzFJ-m" value="o]8pN0G" />'); } } \$(document).on('submit', 'form', function () { if (\$(this).attr("method") != "get") { \$(this).append('<input type="hidden" name="nAmdgxejI" value="*g4z8ODZ2" />'); } if (\$(this).attr("method") != "get") { \$(this).append('<input type="hidden" name="QpzFJ-m" value="o]8pN0G" />'); } return true; }); jQuery.ajaxSetup({ beforeSend: function (e, data) { //console.log(Object.getOwnPropertyNames(data).sort()); //console.log(data.type); if (data.type !== 'POST') return; if (typeof data.data === 'object' && data.data !== null) { data.data.append("nAmdgxejI", "*g4z8ODZ2"); data.data.append("QpzFJ-m", "o]8pN0G"); } else { data.data = data.data + '&nAmdgxejI=*g4z8ODZ2&QpzFJ-m=o]8pN0G'; } } }); }); /* <![CDATA[ */ jQuery(function(){ jQuery("ul.sf-menu").supersubs({ minWidth: 12, maxWidth: 27, extraWidth: 1 }).superfish({ delay: 100, speed: 250 }); }); /* ]]> */ ```