Predicting links for network data

Posted on November 26, 2020 by Roel M. Hogervorst in R bloggers | 0 Comments

[This article was first published on Category R on Roel's R-tefacts, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

NETWORKS, PREDICT EDGES

Can we predict if two nodes in the graph are connected or not?

But let’s make it very practical:

Let’s say you work in a social media company and your boss asks you to create a model to predict who will be friends, so you can feed those recommendations back to the website and serve those to users.

You are tasked to create a model that predicts, once a day for all users, who is likely to connect to whom.

PART 2 my previous post about rectangling network data

BUILD MODEL WITH TIDYMODELS

Loading packages and data

Using {tidymodels} (a meta package that also loads {broom}, {recipes}, {dials}, {rsample}, {dplyr}, {tibble}, {ggplot2}, {tidyr}, {infer}, {tune}, {workflows}, {modeldata}, {parsnip}, {yardstick}, and {purrr})

also using {themis}, {vip} {readr}, {dplyr}, {ggplot2}, for models using {glmnet}, {ranger}

library(tidymodels)

── Attaching packages ───────────────────────────────────── tidymodels 0.1.2 ──

✓ broom 0.7.2 ✓ recipes 0.1.15
✓ dials 0.0.9 ✓ rsample 0.0.8
✓ dplyr 1.0.2 ✓ tibble 3.0.4
✓ ggplot2 3.3.2 ✓ tidyr 1.1.2
✓ infer 0.5.3 ✓ tune 0.1.2
✓ modeldata 0.1.0 ✓ workflows 0.2.1
✓ parsnip 0.1.4 ✓ yardstick 0.0.7
✓ purrr 0.3.4

── Conflicts ──────────────────────────────────────── tidymodels_conflicts() ──
x purrr::discard() masks scales::discard()
x dplyr::filter() masks stats::filter()
x dplyr::lag() masks stats::lag()
x recipes::step() masks stats::step()

Load data CONTAINS ALSO FROM ADVANCED THINGY

enriched_trainingset <-
readr::read_rds(file="data/enriched_trainingset2.Rds") %>%
mutate(target=as.factor(target))
names(enriched_trainingset)

 [1] "from" "to" "target"
[4] "degree" "betweenness" "pg_rank"
[7] "eigen" "closeness" "br_score"
[10] "coreness" "degree_to" "betweenness_to"
[13] "pg_rank_to" "eigen_to" "closeness_to"
[16] "br_score_to" "coreness_to" "commonneighbors_1"
[19] "commonneighbors_2" "unique_neighbors"

Feature information

IS THERE ENOUGH INFORMATION IN THE DATASET TO PREDICT

MANY MORE NEGATIVE THAN POSITIVE EXAMPLES.

enriched_trainingset %>%
count(target)

# A tibble: 2 x 2
target n
<fct> <int>
1 0 34306
2 1 1606

MAKE SUBSET TO VISUALISE

smpl_trainingset <-
enriched_trainingset %>%
group_by(target) %>%
sample_n(1000) %>%
mutate(label = ifelse(target ==1,"link","no-link")) %>%
ungroup()
smpl_trainingset %>% count(label)

# A tibble: 2 x 2
label n
<chr> <int>
1 link 1000
2 no-link 1000

OVERVIERW OF ALL VARIABLES THIS IS NOT A BEST PRACTICE, AND I’M NOT EVEN SURE IF THE INFORMATION I’M SHOWING HERE IS TELLING US SOMETHING.

(NODE)-[EDGE]-(NODE) BOTH SIDES OF EDGE HAVE PROPERTIES. WE DON’T CARE ABOUT DIRECTION AND SO IS MORE OR LESS EQUIVALIENT. MAYBE THE COMBIANTOIN OF THE TWO IS MORE IMPORTANT?

smpl_trainingset %>%
mutate(
degree2 = degree*degree_to,
eigen2 = eigen* eigen_to,
pg_rank2 = pg_rank* pg_rank_to,
betweenness2 = betweenness* betweenness_to,
br_score2 = br_score* br_score_to,
coreness2 = coreness * coreness_to,
closeness2 = closeness * closeness_to
) %>%
select(label, degree2:closeness2) %>%
group_by(label) %>%
summarise(across(.fns = c(mean=mean,sd=sd))) %>%
pivot_longer(-label) %>%
tidyr::separate(name, into = c("metric","summary"),sep="2_") %>%
pivot_wider(names_from = summary, values_from = value) %>%
ggplot(aes(label, color=label))+
geom_point(aes(label, y=mean),shape=22, fill="grey50")+
geom_point(aes(label, y=mean+sd), shape=2)+
geom_point(aes(label, y=mean-sd), shape=6)+
geom_linerange(aes(label, ymin=mean-sd, ymax=mean+sd))+
facet_wrap(~metric, scales="free")+
labs(
title="Small differences in features for link vs no-link",
subtitle="mean (+/-) 1 sd",
x=NULL, y="feature"
)

`summarise()` ungrouping output (override with `.groups` argument)

VISUALIZE

BETWEENNESS

smpl_trainingset %>%
select(betweenness, betweenness_to, label) %>%
pivot_longer(-label) %>%
ggplot(aes(value,color = label))+
geom_density() +
facet_wrap(~name)+
scale_x_continuous(trans=scales::log1p_trans())

DEGREE

smpl_trainingset %>%
ggplot(aes(degree, degree_to, color = label))+
geom_point()+
scale_x_continuous(trans=scales::log1p_trans())+
scale_y_continuous(trans=scales::log1p_trans())

PAGE RANK

smpl_trainingset %>%
ggplot(aes(pg_rank, pg_rank_to, color = label))+
geom_point(alpha = 1/2)

EIGEN DOESN’T REALLY SEEM TO BE DIFFERENT

smpl_trainingset %>%
ggplot(aes(eigen, eigen_to, color = label))+
geom_point(alpha = 1/2)

ETCETEA

ADVANCED

smpl_trainingset %>%
ggplot(aes(commonneighbors_1, commonneighbors_2, color=label))+
geom_point(alpha=1/2)+
labs(
title="Neighbors in common between two nodes",
x="Neighbors at distance 1",
y= "Neighbors at distance 2"
)

SO YEAH THERE IS SOME INFOMRATION IN THERE, IS SEE SOME CLUSTERING BUT THE BOUNDARIES ARE QUITE VAGUE.

Actual feature engineering (recipe)

I decided to create some interactions between page rank of two nodes, and the degree of the nodes, drop the identifiers to and from and make the target a factor. Furthermore I drop correlated features and normalize and center all features (there are no nominal variables in this dataset).

This recipe is only a plan of action, nothing has happened yet.

# make it very simple first.
ntwrk_recipe <-
recipe(enriched_trainingset,formula = target~.) %>%
recipes::update_role(to, new_role = "other") %>%
recipes::update_role(from, new_role = "other") %>%
step_interact(terms = ~ pg_rank:pg_rank_to) %>%
step_interact(terms = ~ degree:degree_to) %>%
step_interact(terms = ~ eigen:eigen_to) %>%
step_interact(terms = ~ betweenness:betweenness_to) %>%
step_interact(terms = ~ closeness:closeness_to) %>%
step_interact(terms = ~ coreness:coreness_to) %>%
step_interact(terms = ~ br_score:br_score_to) %>%
step_corr(all_numeric()) %>%
step_nzv(all_predictors()) %>%
step_normalize(all_predictors(), -all_nominal()) %>%
step_mutate(target = as.factor(target))

Model

simple model to start with generalized linear model

So what are we going to do with the model? I’m using a logistic regression from {glmnet} and capture the steps of data preparation and modeling into 1 workflow-object.

ntwrk_spec <-
logistic_reg(penalty = tune(), mixture = 1) %>% # pure lasso
set_engine("glmnet")
ntwrk_workflow <-
workflow() %>%
add_recipe(ntwrk_recipe) %>%
add_model(ntwrk_spec)

Train and test sets

I split the data into a test and train set, but making sure the proportion of targets is the same in test and train data.

### split into training and test set
set.seed(2345)
tr_te_split <- initial_split(enriched_trainingset,strata = target)
val_set <- validation_split(training(tr_te_split),strata = target, prop = .8)

Model tuning

I don’t know what the best penalty is for this model and data, so we have to test different versions and choose the best one.

## Setting up tune grid manually, because it is just one column
lr_reg_grid <- tibble(penalty = 10^seq(-5, -1, length.out = 30))

ntwrk_res <-
ntwrk_workflow %>%
tune_grid(val_set,
grid = lr_reg_grid,
control = control_grid(save_pred = TRUE),
metrics = metric_set(roc_auc))


Attaching package: 'rlang'

The following objects are masked from 'package:purrr':
%@%, as_function, flatten, flatten_chr, flatten_dbl, flatten_int,
flatten_lgl, flatten_raw, invoke, list_along, modify, prepend,
splice


Attaching package: 'vctrs'

The following object is masked from 'package:tibble':
data_frame

The following object is masked from 'package:dplyr':
data_frame

Loading required package: Matrix


Attaching package: 'Matrix'

The following objects are masked from 'package:tidyr':
expand, pack, unpack

Loaded glmnet 4.0-2

Visualise results (I DID SOME DEEPER DIVE ANY SMALL PENALTY LESS THAN 10E-05 LEADS TO THE SAME VALUES ~ 0.0000240 as pentlay is boundary)

ntwrk_res %>%
collect_metrics() %>%
ggplot(aes(x = penalty, y = mean)) +
geom_point() +
geom_line() +
labs(
subtitle="Ideal penalty is larger then 1.610e-05, but certainly less than 0.04",
title = "Penalty values",
y= "Area under the ROC Curve",
x="Penalty values for this GLM"
) +
scale_x_log10(labels = scales::label_number())+
geom_vline(xintercept = 0.0386, color="tomato3")+
geom_vline(xintercept=1.610e-05, color='tomato3')+
theme_minimal()

What are the best models?

## show best models
top_models <-
ntwrk_res %>%
show_best("roc_auc", n = 5) %>%
arrange(penalty)
lr_best <-
ntwrk_res %>%
collect_metrics() %>%
arrange(penalty) %>%
slice(5)
pred_auc <-
ntwrk_res %>%
collect_predictions(parameters = lr_best) %>%
roc_curve(target, .pred_0) %>%
mutate(model = "Logistic Regression")
autoplot(pred_auc)+
ggtitle("ROC curve of GLM")

Let’s use the best performing model and modify the current workflow, by replacing the penalty value in the model with one of the best values.

(this model is still untrained, we used the crossvalidation to find the best parameter values)

best_penalty <- top_models %>% pull(penalty) %>% .[[3]]
ntwrk_spec_1 <-
logistic_reg(penalty = best_penalty, mixture = 1) %>%
set_engine("glmnet")
## change model
updated_workflow <-
ntwrk_workflow %>%
update_model(ntwrk_spec_1)

Last fit is a special function from tune that fits data on the training set and predicts on the testset.

ntwrk_fit <-
updated_workflow %>%
last_fit(tr_te_split)
ntwrk_fit %>% pull(.metrics)

[[1]]
# A tibble: 2 x 4
.metric .estimator .estimate .config
<chr> <chr> <dbl> <chr>
1 accuracy binary 0.957 Preprocessor1_Model1
2 roc_auc binary 0.821 Preprocessor1_Model1

Performs really well!

Unpacking predictions. We never predicted a link when there was one! So the score looks good, but the results are not that useful.

ntwrk_fit$.predictions[[1]] %>%
group_by(target, .pred_class) %>%
summarize(
count = n(),
avg_prob1 = mean(.pred_1)
)

`summarise()` regrouping output by 'target' (override with `.groups` argument)

# A tibble: 3 x 4
# Groups: target [2]
target .pred_class count avg_prob1
<fct> <fct> <int> <dbl>
1 0 0 8591 0.0417
2 0 1 4 0.690
3 1 0 383 0.129

library(vip)


Attaching package: 'vip'

The following object is masked from 'package:utils':
vi

prediction_model_glm <- fit(
ntwrk_fit$.workflow[[1]],
enriched_trainingset
)
prediction_model_glm %>%
pull_workflow_fit() %>%
vip(geom="point")+
ggtitle("Variable importance of Generalized Linear Model", subtitle = "Top 10")

Undersampling for better performance.

PROBABLY SHOULD GIVE SOME REASON FOR THIS. GLM DOESN’T REALLY CARE

using undersampling.

using only mixture doesn’t really help using undersampling doesn’t really help either.

ntwrk_recipe_undersample <-
ntwrk_recipe %>%
themis::step_downsample(target,under_ratio = 1.5)

Registered S3 methods overwritten by 'themis':
method from
bake.step_downsample recipes
bake.step_upsample recipes
prep.step_downsample recipes
prep.step_upsample recipes
tidy.step_downsample recipes
tidy.step_upsample recipes
tunable.step_downsample recipes
tunable.step_upsample recipes

# ntwrk_spec2 <- 
# logistic_reg(penalty = tune(), mixture = tune()) %>% 
# set_engine("glmnet")
ntwrk_workflow2 <-
ntwrk_workflow %>%
update_recipe(ntwrk_recipe_undersample)
crossvalidation_sets <- vfold_cv(training(tr_te_split),v = 3, strata = target)
all_cores <- parallel::detectCores(logical = TRUE)-1
library(doParallel)

Loading required package: foreach


Attaching package: 'foreach'

The following objects are masked from 'package:purrr':
accumulate, when

Loading required package: iterators

Loading required package: parallel

cl <- makePSOCKcluster(all_cores)
registerDoParallel(cl)

HEAVIY COMPUTATION HERE

ntwrk_res2 <-
ntwrk_workflow2 %>%
tune_grid(crossvalidation_sets,
grid = lr_reg_grid,
control = control_grid(save_pred = TRUE, allow_par = TRUE),
metrics = metric_set(roc_auc))

ntwrk_res2 %>%
collect_metrics() %>%
select(mean, penalty) %>%
pivot_longer(penalty,
values_to = "value",
names_to = "parameter"
) %>%
ggplot(aes(value, mean, color = parameter)) +
geom_point(alpha = 0.8, show.legend = FALSE) +
facet_wrap(~parameter, scales = "free_x") +
geom_hline(yintercept = 0.8206, color="tomato3")+
labs(x = NULL, y = "AUC")

Performance equivalent to earlier. (logically, because glm takes care of it?)

OVERLAY ON PREVIOUS DATA?

Let’s try with random forest.

rf_recipe <- ntwrk_recipe
rf_spec <-
rand_forest(mtry = tune(), min_n = tune(), trees = 1000) %>%
set_mode("classification") %>%
set_engine("ranger",importance="impurity")
rf_workflow <-
workflow() %>%
add_recipe(rf_recipe) %>%
add_model(rf_spec)

set.seed(88708)
rf_grid <-
grid_max_entropy(
mtry(range = c(3,15)),
min_n(),
size = 15)
rf_tune <-
tune_grid(rf_workflow,
resamples = crossvalidation_sets,
grid = rf_grid,
control = control_grid(save_pred = TRUE,allow_par = TRUE),
metrics = metric_set(roc_auc))

So what is the best parameter set?

rf_tune %>%
collect_metrics() %>%
select(mean, mtry:min_n) %>%
pivot_longer(mtry:min_n,
values_to = "value",
names_to = "parameter"
) %>%
ggplot(aes(value, mean, color = parameter)) +
geom_point(alpha = 0.8, show.legend = FALSE) +
geom_hline(yintercept = 0.8206, color="tomato3")+
facet_wrap(~parameter, scales = "free_x") +
labs(
x = NULL, y = "AUC",
title="Random Forest approach is way better",
subtitle="Quite some variation, but always better than glm (red line)"
)

top_models_rf <-
rf_tune %>%
show_best("roc_auc", n = 5)
rf_best <-
rf_tune %>%
collect_metrics() %>%
arrange(mtry) %>%
slice(5)
pred_auc_rf <-
rf_tune %>%
collect_predictions(parameters = rf_best) %>%
roc_curve(target, .pred_0) %>%
mutate(model = "Random Forest")

overlay both models

bind_rows(
pred_auc_rf,
pred_auc
) %>%
ggplot(
aes(
x = 1 - specificity,
y = sensitivity,
color = model)
)+
geom_line()+
geom_abline(
lty = 2, alpha = 0.5,
color = "gray50",
size = 1.2
)+
theme_minimal()+
labs(title="Overall better performance in the Random Forest model")

Use the best model to make a final prediction

best_auc_rf <- select_best(rf_tune, "roc_auc")
final_workflow_rf <- finalize_workflow(
rf_workflow,
best_auc_rf
)
final_res_rf <- last_fit(final_workflow_rf, tr_te_split)
collect_metrics(final_res_rf)

# A tibble: 2 x 4
.metric .estimator .estimate .config
<chr> <chr> <dbl> <chr>
1 accuracy binary 0.969 Preprocessor1_Model1
2 roc_auc binary 0.926 Preprocessor1_Model1

Compare this area under the ROC curve (0.91) with the previous value 0.8206.

Investigate feature importance.

library(vip)
prediction_model2 <- fit(
final_res_rf$.workflow[[1]],
enriched_trainingset
)
prediction_model2 %>%
pull_workflow_fit() %>%
vip(geom="point")+
ggtitle("Variable importance of Random Forest model", subtitle = "Top 10")

Conclusion

So the random forest model was better in predicting links than a GLM model. But you should always wonder, what good enough is. Maybe a score of over .80 is enough? In that case why bother using a more complicated model that takes longer to run? GLM’s are usually easier explained and run faster. Provided that we are predicting both classes.

I started this project with the question:

Can we predict if two nodes in the graph are connected or not?

And the practical task was actually:

your boss asks you to create a model to predict who will be friends, so you can feed those recommendations back to the website and serve those to users.

You are tasked to create a model that predicts, once a day for all users, who is likely to connect to whom.

The stakes in this case are not that high. False positives (I predict a link but there is none) is preferable to false negatives (predict no link , but there is).

Bringing it to production

use renv to capture the dependencies
set up pipeline of data from system to dataset
see if you can minimize the number of features necessary
checks on data quality and features
predict all no connections and check if flow from model to website works
predict actual data
keep track of metrics
retrain when problematic

State of the machine

At the moment of creation (when I knitted this document ) this was the state of my machine: **click here to expand**

sessioninfo::session_info()

─ Session info ──────────────────────────────────────────────────────────────
setting value
version R version 4.0.2 (2020-06-22)
os macOS Catalina 10.15.7
system x86_64, darwin17.0
ui RStudio
language (EN)
collate en_US.UTF-8
ctype en_US.UTF-8
tz Europe/Amsterdam
date 2020-11-25
─ Packages ──────────────────────────────────────────────────────────────────
package * version date lib source
assertthat 0.2.1 2019-03-21 [1] CRAN (R 4.0.2)
backports 1.2.0 2020-11-02 [1] CRAN (R 4.0.2)
BBmisc 1.11 2017-03-10 [1] CRAN (R 4.0.2)
blogdown 0.21 2020-10-11 [1] CRAN (R 4.0.2)
bookdown 0.21 2020-10-13 [1] CRAN (R 4.0.2)
broom * 0.7.2 2020-10-20 [1] CRAN (R 4.0.2)
checkmate 2.0.0 2020-02-06 [1] CRAN (R 4.0.2)
class 7.3-17 2020-04-26 [1] CRAN (R 4.0.2)
cli 2.2.0 2020-11-20 [1] CRAN (R 4.0.2)
codetools 0.2-18 2020-11-04 [1] CRAN (R 4.0.2)
colorspace 2.0-0 2020-11-11 [1] CRAN (R 4.0.2)
crayon 1.3.4 2017-09-16 [1] CRAN (R 4.0.2)
data.table 1.13.2 2020-10-19 [1] CRAN (R 4.0.2)
dials * 0.0.9 2020-09-16 [1] CRAN (R 4.0.2)
DiceDesign 1.8-1 2019-07-31 [1] CRAN (R 4.0.2)
digest 0.6.27 2020-10-24 [1] CRAN (R 4.0.2)
doParallel * 1.0.16 2020-10-16 [1] CRAN (R 4.0.2)
dplyr * 1.0.2 2020-08-18 [1] CRAN (R 4.0.2)
ellipsis 0.3.1 2020-05-15 [1] CRAN (R 4.0.2)
evaluate 0.14 2019-05-28 [1] CRAN (R 4.0.1)
fansi 0.4.1 2020-01-08 [1] CRAN (R 4.0.2)
farver 2.0.3 2020-01-16 [1] CRAN (R 4.0.2)
fastmatch 1.1-0 2017-01-28 [1] CRAN (R 4.0.2)
FNN 1.1.3 2019-02-15 [1] CRAN (R 4.0.2)
foreach * 1.5.1 2020-10-15 [1] CRAN (R 4.0.2)
furrr 0.2.1 2020-10-21 [1] CRAN (R 4.0.2)
future 1.20.1 2020-11-03 [1] CRAN (R 4.0.2)
generics 0.1.0 2020-10-31 [1] CRAN (R 4.0.2)
ggplot2 * 3.3.2 2020-06-19 [1] CRAN (R 4.0.2)
glmnet * 4.0-2 2020-06-16 [1] CRAN (R 4.0.2)
globals 0.14.0 2020-11-22 [1] CRAN (R 4.0.2)
glue 1.4.2 2020-08-27 [1] CRAN (R 4.0.2)
gower 0.2.2 2020-06-23 [1] CRAN (R 4.0.2)
GPfit 1.0-8 2019-02-08 [1] CRAN (R 4.0.2)
gridExtra 2.3 2017-09-09 [1] CRAN (R 4.0.2)
gtable 0.3.0 2019-03-25 [1] CRAN (R 4.0.2)
hardhat 0.1.5 2020-11-09 [1] CRAN (R 4.0.2)
hms 0.5.3 2020-01-08 [1] CRAN (R 4.0.2)
htmltools 0.5.0 2020-06-16 [1] CRAN (R 4.0.2)
httpuv 1.5.4 2020-06-06 [1] CRAN (R 4.0.2)
infer * 0.5.3 2020-07-14 [1] CRAN (R 4.0.2)
ipred 0.9-9 2019-04-28 [1] CRAN (R 4.0.2)
iterators * 1.0.13 2020-10-15 [1] CRAN (R 4.0.2)
jsonlite 1.7.1 2020-09-07 [1] CRAN (R 4.0.2)
knitr 1.30 2020-09-22 [1] CRAN (R 4.0.2)
labeling 0.4.2 2020-10-20 [1] CRAN (R 4.0.2)
later 1.1.0.1 2020-06-05 [1] CRAN (R 4.0.2)
lattice 0.20-41 2020-04-02 [1] CRAN (R 4.0.2)
lava 1.6.8.1 2020-11-04 [1] CRAN (R 4.0.2)
lhs 1.1.1 2020-10-05 [1] CRAN (R 4.0.2)
lifecycle 0.2.0 2020-03-06 [1] CRAN (R 4.0.2)
listenv 0.8.0 2019-12-05 [1] CRAN (R 4.0.2)
lubridate 1.7.9.2 2020-11-13 [1] CRAN (R 4.0.2)
magrittr 2.0.1 2020-11-17 [1] CRAN (R 4.0.2)
MASS 7.3-53 2020-09-09 [1] CRAN (R 4.0.2)
Matrix * 1.2-18 2019-11-27 [1] CRAN (R 4.0.2)
mlr 2.18.0 2020-10-05 [1] CRAN (R 4.0.2)
modeldata * 0.1.0 2020-10-22 [1] CRAN (R 4.0.2)
munsell 0.5.0 2018-06-12 [1] CRAN (R 4.0.2)
nnet 7.3-14 2020-04-26 [1] CRAN (R 4.0.2)
parallelly 1.21.0 2020-10-27 [1] CRAN (R 4.0.2)
parallelMap 1.5.0 2020-03-26 [1] CRAN (R 4.0.2)
ParamHelpers 1.14 2020-03-24 [1] CRAN (R 4.0.2)
parsnip * 0.1.4 2020-10-27 [1] CRAN (R 4.0.2)
pillar 1.4.7 2020-11-20 [1] CRAN (R 4.0.2)
pkgconfig 2.0.3 2019-09-22 [1] CRAN (R 4.0.2)
plyr 1.8.6 2020-03-03 [1] CRAN (R 4.0.2)
pROC 1.16.2 2020-03-19 [1] CRAN (R 4.0.2)
processx 3.4.4 2020-09-03 [1] CRAN (R 4.0.2)
prodlim 2019.11.13 2019-11-17 [1] CRAN (R 4.0.2)
promises 1.1.1 2020-06-09 [1] CRAN (R 4.0.2)
ps 1.4.0 2020-10-07 [1] CRAN (R 4.0.2)
purrr * 0.3.4 2020-04-17 [1] CRAN (R 4.0.2)
R6 2.5.0 2020-10-28 [1] CRAN (R 4.0.2)
ranger 0.12.1 2020-01-10 [1] CRAN (R 4.0.2)
RANN 2.6.1 2019-01-08 [1] CRAN (R 4.0.2)
Rcpp 1.0.5 2020-07-06 [1] CRAN (R 4.0.2)
readr 1.4.0 2020-10-05 [1] CRAN (R 4.0.2)
recipes * 0.1.15 2020-11-11 [1] CRAN (R 4.0.2)
rlang * 0.4.8 2020-10-08 [1] CRAN (R 4.0.2)
rmarkdown 2.5 2020-10-21 [1] CRAN (R 4.0.2)
ROSE 0.0-3 2014-07-15 [1] CRAN (R 4.0.2)
rpart 4.1-15 2019-04-12 [1] CRAN (R 4.0.2)
rsample * 0.0.8 2020-09-23 [1] CRAN (R 4.0.2)
rstudioapi 0.13 2020-11-12 [1] CRAN (R 4.0.2)
scales * 1.1.1 2020-05-11 [1] CRAN (R 4.0.2)
servr 0.20 2020-10-19 [1] CRAN (R 4.0.2)
sessioninfo 1.1.1 2018-11-05 [1] CRAN (R 4.0.2)
shape 1.4.5 2020-09-13 [1] CRAN (R 4.0.2)
stringi 1.5.3 2020-09-09 [1] CRAN (R 4.0.2)
stringr 1.4.0 2019-02-10 [1] CRAN (R 4.0.2)
survival 3.2-7 2020-09-28 [1] CRAN (R 4.0.2)
themis 0.1.3 2020-11-12 [1] CRAN (R 4.0.2)
tibble * 3.0.4 2020-10-12 [1] CRAN (R 4.0.2)
tidymodels * 0.1.2 2020-11-22 [1] CRAN (R 4.0.2)
tidyr * 1.1.2 2020-08-27 [1] CRAN (R 4.0.2)
tidyselect 1.1.0 2020-05-11 [1] CRAN (R 4.0.2)
timeDate 3043.102 2018-02-21 [1] CRAN (R 4.0.2)
tune * 0.1.2 2020-11-17 [1] CRAN (R 4.0.2)
unbalanced 2.0 2015-06-26 [1] CRAN (R 4.0.2)
utf8 1.1.4 2018-05-24 [1] CRAN (R 4.0.2)
vctrs * 0.3.5 2020-11-17 [1] CRAN (R 4.0.2)
vip * 0.2.2 2020-04-06 [1] CRAN (R 4.0.2)
withr 2.3.0 2020-09-22 [1] CRAN (R 4.0.2)
workflows * 0.2.1 2020-10-08 [1] CRAN (R 4.0.2)
xfun 0.19 2020-10-30 [1] CRAN (R 4.0.2)
yaml 2.2.1 2020-02-01 [1] CRAN (R 4.0.2)
yardstick * 0.0.7 2020-07-13 [1] CRAN (R 4.0.2)
[1] /Library/Frameworks/R.framework/Versions/4.0/Resources/library

Notes

used this example from Julia Silge as template https://juliasilge.com/blog/xgboost-tune-volleyball/

To leave a comment for the author, please follow the link and comment on their blog: Category R on Roel's R-tefacts.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

R-bloggers

R news and tutorials contributed by hundreds of R bloggers

Predicting links for network data

Loading packages and data

Feature information

Actual feature engineering (recipe)

Model

Train and test sets

Model tuning

Undersampling for better performance.

using undersampling.

Let’s try with random forest.

Conclusion

Bringing it to production

State of the machine

Notes

Related

Loading packages and data

Feature information

Actual feature engineering (recipe)

Model

Train and test sets

Model tuning

Undersampling for better performance.

using undersampling.

Let’s try with random forest.

Conclusion

Bringing it to production

State of the machine

Notes

Related

Never miss an update! Subscribe to R-bloggers to receive e-mails with the latest R posts. (You will not see this message again.)

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)