Extracting a Reference Grid of your Data for Machine Learning Models Visualization

[This article was first published on Dominique Makowski, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Sometimes, for visualization purposes, we want to extract a reference grid of our dataset. This reference grid often contains equally spaced values of a “target” variable, and all other variables “fixed” by their mean, median or reference level. The refdata of the psycho package was built to do just that.

The Model

Let’s build a complex machine learning model (a neural network) predicting the Sex (the probability of being a man, as women are here the reference level) of our participants with all the variables of the dataframe.

<span class="c1"># devtools::install_github("neuropsychology/psycho.R")  # Install the latest psycho version if needed</span><span class="w">

</span><span class="c1"># Load packages</span><span class="w">
</span><span class="n">library</span><span class="p">(</span><span class="n">tidyverse</span><span class="p">)</span><span class="w">
</span><span class="n">library</span><span class="p">(</span><span class="n">caret</span><span class="p">)</span><span class="w">
</span><span class="n">library</span><span class="p">(</span><span class="n">psycho</span><span class="p">)</span><span class="w">

</span><span class="c1"># Import data</span><span class="w">
</span><span class="n">df</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">psycho</span><span class="o">::</span><span class="n">affective</span><span class="w"> </span><span class="o">%>%</span><span class="w"> 
  </span><span class="n">standardize</span><span class="p">()</span><span class="w"> </span><span class="o">%>%</span><span class="w">  </span><span class="c1"># Standardize</span><span class="w">
  </span><span class="n">na.omit</span><span class="p">(</span><span class="n">df</span><span class="p">)</span><span class="w">  </span><span class="c1"># Remove missing values</span><span class="w">

</span><span class="c1"># Fit the model</span><span class="w">
</span><span class="n">model</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">caret</span><span class="o">::</span><span class="n">train</span><span class="p">(</span><span class="n">Sex</span><span class="w"> </span><span class="o">~</span><span class="w"> </span><span class="n">.</span><span class="p">,</span><span class="w">
                      </span><span class="n">data</span><span class="o">=</span><span class="n">df</span><span class="p">,</span><span class="w">
                      </span><span class="n">method</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"nnet"</span><span class="p">)</span><span class="w">
</span>
<span class="n">varImp</span><span class="p">(</span><span class="n">model</span><span class="p">,</span><span class="w"> </span><span class="n">scale</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="kc">TRUE</span><span class="p">)</span><span class="w">
</span>
## nnet variable importance
## 
##                    Overall
## Salary2000+        100.000
## Concealing          48.761
## Adjusting           46.198
## Birth_SeasonSpring  39.289
## Life_Satisfaction   22.567
## Salary<2000          9.176
## Birth_SeasonSummer   8.863
## Birth_SeasonWinter   6.624
## Tolerating           5.686
## Age                  0.000

It seems that the upper salary category (> 2000€ / month) is the most important variable of the model, followed by the concealing and adjusting personality traits. Interesting, but what does it say about the actual relationship between those variables and our outcome?

Simple

To visualize the effect of Salary, we can extract a reference data with all the salary levels and all other variables fixed at their mean level.

<span class="n">newdata</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">df</span><span class="w"> </span><span class="o">%>%</span><span class="w"> 
  </span><span class="n">select</span><span class="p">(</span><span class="o">-</span><span class="n">Sex</span><span class="p">)</span><span class="w"> </span><span class="o">%>%</span><span class="w">  </span><span class="c1"># We remove the  sex as it is our variable "to predict"</span><span class="w">
  </span><span class="n">refdata</span><span class="p">(</span><span class="s2">"Salary"</span><span class="p">)</span><span class="w">
</span><span class="n">newdata</span><span class="w">
</span>
<span class="n">knitr</span><span class="o">::</span><span class="n">kable</span><span class="p">(</span><span class="n">newdata</span><span class="p">,</span><span class="w"> </span><span class="n">digits</span><span class="o">=</span><span class="m">2</span><span class="p">)</span><span class="w">
</span>
Salary Age Birth_Season Life_Satisfaction Concealing Adjusting Tolerating
<1000 0.11 Fall -0.01 0 0.03 -0.02
<2000 0.11 Fall -0.01 0 0.03 -0.02
2000+ 0.11 Fall -0.01 0 0.03 -0.02

We can make predictions from the model on this minimal dataset and visualize it.

<span class="n">predicted</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">predict</span><span class="p">(</span><span class="n">model</span><span class="p">,</span><span class="w"> </span><span class="n">newdata</span><span class="p">,</span><span class="w"> </span><span class="n">type</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"prob"</span><span class="p">)</span><span class="w">
</span><span class="n">newdata</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">cbind</span><span class="p">(</span><span class="n">newdata</span><span class="p">,</span><span class="w"> </span><span class="n">predicted</span><span class="p">)</span><span class="w">

</span><span class="n">newdata</span><span class="w"> </span><span class="o">%>%</span><span class="w"> 
  </span><span class="n">ggplot</span><span class="p">(</span><span class="n">aes</span><span class="p">(</span><span class="n">x</span><span class="o">=</span><span class="n">Salary</span><span class="p">,</span><span class="w"> </span><span class="n">y</span><span class="o">=</span><span class="n">M</span><span class="p">,</span><span class="w"> </span><span class="n">group</span><span class="o">=</span><span class="m">1</span><span class="p">))</span><span class="w"> </span><span class="o">+</span><span class="w">
  </span><span class="n">geom_line</span><span class="p">()</span><span class="w"> </span><span class="o">+</span><span class="w">
  </span><span class="n">theme_classic</span><span class="p">()</span><span class="w"> </span><span class="o">+</span><span class="w">
  </span><span class="n">ylab</span><span class="p">(</span><span class="s2">"Probability of being a man"</span><span class="p">)</span><span class="w">
</span>

Well, it seems that males are more represented in categories with lower and uppper salary classes (that least, that’s what the model learnt).

Multiple Targets

How does this interact with the concealing personality trait?

<span class="n">newdata</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">df</span><span class="w"> </span><span class="o">%>%</span><span class="w"> 
  </span><span class="n">select</span><span class="p">(</span><span class="o">-</span><span class="n">Sex</span><span class="p">)</span><span class="w"> </span><span class="o">%>%</span><span class="w"> 
  </span><span class="n">refdata</span><span class="p">(</span><span class="nf">c</span><span class="p">(</span><span class="s2">"Salary"</span><span class="p">,</span><span class="w"> </span><span class="s2">"Concealing"</span><span class="p">))</span><span class="w">  </span><span class="c1"># We can sepcify multiple targets</span><span class="w">
</span><span class="n">newdata</span><span class="w">
</span>
<span class="n">knitr</span><span class="o">::</span><span class="n">kable</span><span class="p">(</span><span class="n">head</span><span class="p">(</span><span class="n">newdata</span><span class="p">,</span><span class="w"> </span><span class="m">5</span><span class="p">),</span><span class="w"> </span><span class="n">digits</span><span class="o">=</span><span class="m">2</span><span class="p">)</span><span class="w">
</span>
Salary Concealing Age Birth_Season Life_Satisfaction Adjusting Tolerating
<1000 -2.52 0.11 Fall -0.01 0.03 -0.02
<2000 -2.52 0.11 Fall -0.01 0.03 -0.02
2000+ -2.52 0.11 Fall -0.01 0.03 -0.02
<1000 -1.99 0.11 Fall -0.01 0.03 -0.02
<2000 -1.99 0.11 Fall -0.01 0.03 -0.02

This created 10 evenly spread values of Concealing (from min to max) and “merged” them with all the levels of Salary.

<span class="n">predicted</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">predict</span><span class="p">(</span><span class="n">model</span><span class="p">,</span><span class="w"> </span><span class="n">newdata</span><span class="p">,</span><span class="w"> </span><span class="n">type</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"prob"</span><span class="p">)</span><span class="w">
</span><span class="n">newdata</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">cbind</span><span class="p">(</span><span class="n">newdata</span><span class="p">,</span><span class="w"> </span><span class="n">predicted</span><span class="p">)</span><span class="w">

</span><span class="n">newdata</span><span class="w"> </span><span class="o">%>%</span><span class="w"> 
  </span><span class="n">ggplot</span><span class="p">(</span><span class="n">aes</span><span class="p">(</span><span class="n">x</span><span class="o">=</span><span class="n">Concealing</span><span class="p">,</span><span class="w"> </span><span class="n">y</span><span class="o">=</span><span class="n">M</span><span class="p">,</span><span class="w"> </span><span class="n">colour</span><span class="o">=</span><span class="n">Salary</span><span class="p">))</span><span class="w"> </span><span class="o">+</span><span class="w">
  </span><span class="n">geom_line</span><span class="p">()</span><span class="w"> </span><span class="o">+</span><span class="w">
  </span><span class="n">theme_classic</span><span class="p">()</span><span class="w"> </span><span class="o">+</span><span class="w">
  </span><span class="n">ylab</span><span class="p">(</span><span class="s2">"Probability of being a man"</span><span class="p">)</span><span class="w">
</span>

This plot is rather ugly…

Increase Length

<span class="n">newdata</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">df</span><span class="w"> </span><span class="o">%>%</span><span class="w"> 
  </span><span class="n">select</span><span class="p">(</span><span class="o">-</span><span class="n">Sex</span><span class="p">)</span><span class="w"> </span><span class="o">%>%</span><span class="w"> 
  </span><span class="n">refdata</span><span class="p">(</span><span class="nf">c</span><span class="p">(</span><span class="s2">"Salary"</span><span class="p">,</span><span class="w"> </span><span class="s2">"Concealing"</span><span class="p">),</span><span class="w"> </span><span class="n">length.out</span><span class="o">=</span><span class="m">500</span><span class="p">)</span><span class="w">  </span><span class="c1"># Set the length by which to spread numeric targets</span><span class="w">

</span><span class="n">predicted</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">predict</span><span class="p">(</span><span class="n">model</span><span class="p">,</span><span class="w"> </span><span class="n">newdata</span><span class="p">,</span><span class="w"> </span><span class="n">type</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"prob"</span><span class="p">)</span><span class="w">
</span><span class="n">newdata</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">cbind</span><span class="p">(</span><span class="n">newdata</span><span class="p">,</span><span class="w"> </span><span class="n">predicted</span><span class="p">)</span><span class="w">

</span><span class="n">newdata</span><span class="w"> </span><span class="o">%>%</span><span class="w"> 
  </span><span class="n">ggplot</span><span class="p">(</span><span class="n">aes</span><span class="p">(</span><span class="n">x</span><span class="o">=</span><span class="n">Concealing</span><span class="p">,</span><span class="w"> </span><span class="n">y</span><span class="o">=</span><span class="n">M</span><span class="p">,</span><span class="w"> </span><span class="n">colour</span><span class="o">=</span><span class="n">Salary</span><span class="p">))</span><span class="w"> </span><span class="o">+</span><span class="w">
  </span><span class="n">geom_line</span><span class="p">(</span><span class="n">size</span><span class="o">=</span><span class="m">1</span><span class="p">)</span><span class="w"> </span><span class="o">+</span><span class="w">
  </span><span class="n">theme_classic</span><span class="p">()</span><span class="w"> </span><span class="o">+</span><span class="w">
  </span><span class="n">ylab</span><span class="p">(</span><span class="s2">"Probability of being a man"</span><span class="p">)</span><span class="w">
</span>

It seems that for richer people, the concealing treshold for increasing the probability of being a male is lower.

How to Fix (Maintain) Numeric Variables?

For now, all other variables were fixed to their mean level. But maybe their behaviour would be different when other variables are low or high.

<span class="n">newdata_min</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">df</span><span class="w"> </span><span class="o">%>%</span><span class="w"> 
  </span><span class="n">select</span><span class="p">(</span><span class="o">-</span><span class="n">Sex</span><span class="p">)</span><span class="w"> </span><span class="o">%>%</span><span class="w"> 
  </span><span class="n">refdata</span><span class="p">(</span><span class="nf">c</span><span class="p">(</span><span class="s2">"Salary"</span><span class="p">,</span><span class="w"> </span><span class="s2">"Concealing"</span><span class="p">),</span><span class="w"> </span><span class="n">length.out</span><span class="o">=</span><span class="m">500</span><span class="p">,</span><span class="w"> </span><span class="n">numerics</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"min"</span><span class="p">)</span><span class="w"> </span><span class="o">%>%</span><span class="w">  </span><span class="c1"># Set the other numeric variables to their minimum </span><span class="w">
  </span><span class="n">mutate</span><span class="p">(</span><span class="n">Fixed</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"Minimum"</span><span class="p">)</span><span class="w">
</span><span class="n">newdata_max</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">df</span><span class="w"> </span><span class="o">%>%</span><span class="w"> 
  </span><span class="n">select</span><span class="p">(</span><span class="o">-</span><span class="n">Sex</span><span class="p">)</span><span class="w"> </span><span class="o">%>%</span><span class="w"> 
  </span><span class="n">refdata</span><span class="p">(</span><span class="nf">c</span><span class="p">(</span><span class="s2">"Salary"</span><span class="p">,</span><span class="w"> </span><span class="s2">"Concealing"</span><span class="p">),</span><span class="w"> </span><span class="n">length.out</span><span class="o">=</span><span class="m">500</span><span class="p">,</span><span class="w"> </span><span class="n">numerics</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"max"</span><span class="p">)</span><span class="o">%>%</span><span class="w">  </span><span class="c1"># Set the other numeric variables to their maximum </span><span class="w">
  </span><span class="n">mutate</span><span class="p">(</span><span class="n">Fixed</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"Maximum"</span><span class="p">)</span><span class="w">
</span><span class="n">newdata</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">rbind</span><span class="p">(</span><span class="n">newdata_min</span><span class="p">,</span><span class="w"> </span><span class="n">newdata_max</span><span class="p">)</span><span class="w">

</span><span class="n">predicted</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">predict</span><span class="p">(</span><span class="n">model</span><span class="p">,</span><span class="w"> </span><span class="n">newdata</span><span class="p">,</span><span class="w"> </span><span class="n">type</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"prob"</span><span class="p">)</span><span class="w">
</span><span class="n">newdata</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">cbind</span><span class="p">(</span><span class="n">newdata</span><span class="p">,</span><span class="w"> </span><span class="n">predicted</span><span class="p">)</span><span class="w">

</span><span class="n">newdata</span><span class="w"> </span><span class="o">%>%</span><span class="w"> 
  </span><span class="n">ggplot</span><span class="p">(</span><span class="n">aes</span><span class="p">(</span><span class="n">x</span><span class="o">=</span><span class="n">Concealing</span><span class="p">,</span><span class="w"> </span><span class="n">y</span><span class="o">=</span><span class="n">M</span><span class="p">,</span><span class="w"> </span><span class="n">colour</span><span class="o">=</span><span class="n">Salary</span><span class="p">))</span><span class="w"> </span><span class="o">+</span><span class="w">
  </span><span class="n">geom_line</span><span class="p">(</span><span class="n">size</span><span class="o">=</span><span class="m">1</span><span class="p">)</span><span class="w"> </span><span class="o">+</span><span class="w">
  </span><span class="n">theme_classic</span><span class="p">()</span><span class="w"> </span><span class="o">+</span><span class="w">
  </span><span class="n">ylab</span><span class="p">(</span><span class="s2">"Probability of being a man"</span><span class="p">)</span><span class="w"> </span><span class="o">+</span><span class="w">
  </span><span class="n">facet_wrap</span><span class="p">(</span><span class="o">~</span><span class="n">Fixed</span><span class="p">)</span><span class="w">
</span>

When all variables are high, concealing is not related to the sex for richer people. When the variables are set to their minimum, the concealing treshold for the two lower salary classes is higher (around 1.5).

Chains of refdata

Let’s say we want one target of length 500 and another to length 10 To do it, we can nicely chain refdata.

<span class="n">newdata</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">df</span><span class="w"> </span><span class="o">%>%</span><span class="w"> 
  </span><span class="n">select</span><span class="p">(</span><span class="o">-</span><span class="n">Sex</span><span class="p">)</span><span class="w"> </span><span class="o">%>%</span><span class="w"> 
  </span><span class="n">refdata</span><span class="p">(</span><span class="nf">c</span><span class="p">(</span><span class="s2">"Adjusting"</span><span class="p">,</span><span class="w"> </span><span class="s2">"Concealing"</span><span class="p">),</span><span class="w"> </span><span class="n">length.out</span><span class="o">=</span><span class="m">500</span><span class="p">)</span><span class="w"> </span><span class="o">%>%</span><span class="w"> 
  </span><span class="n">refdata</span><span class="p">(</span><span class="s2">"Adjusting"</span><span class="p">,</span><span class="w"> </span><span class="n">length.out</span><span class="o">=</span><span class="m">10</span><span class="p">,</span><span class="w"> </span><span class="n">numerics</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"combination"</span><span class="p">)</span><span class="w">

</span><span class="n">predicted</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">predict</span><span class="p">(</span><span class="n">model</span><span class="p">,</span><span class="w"> </span><span class="n">newdata</span><span class="p">,</span><span class="w"> </span><span class="n">type</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"prob"</span><span class="p">)</span><span class="w">
</span><span class="n">newdata</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">cbind</span><span class="p">(</span><span class="n">newdata</span><span class="p">,</span><span class="w"> </span><span class="n">predicted</span><span class="p">)</span><span class="w">

</span><span class="n">newdata</span><span class="w"> </span><span class="o">%>%</span><span class="w"> 
  </span><span class="n">mutate</span><span class="p">(</span><span class="n">Adjusting</span><span class="o">=</span><span class="n">as.factor</span><span class="p">(</span><span class="nf">round</span><span class="p">(</span><span class="n">Adjusting</span><span class="p">,</span><span class="w"> </span><span class="m">2</span><span class="p">)))</span><span class="w"> </span><span class="o">%>%</span><span class="w"> 
  </span><span class="n">ggplot</span><span class="p">(</span><span class="n">aes</span><span class="p">(</span><span class="n">x</span><span class="o">=</span><span class="n">Concealing</span><span class="p">,</span><span class="w"> </span><span class="n">y</span><span class="o">=</span><span class="n">M</span><span class="p">,</span><span class="w"> </span><span class="n">alpha</span><span class="o">=</span><span class="n">Adjusting</span><span class="p">))</span><span class="w"> </span><span class="o">+</span><span class="w">
  </span><span class="n">geom_line</span><span class="p">(</span><span class="n">size</span><span class="o">=</span><span class="m">1</span><span class="p">)</span><span class="w"> </span><span class="o">+</span><span class="w">
  </span><span class="n">theme_classic</span><span class="p">()</span><span class="w"> </span><span class="o">+</span><span class="w">
  </span><span class="n">ylab</span><span class="p">(</span><span class="s2">"Probability of being a man"</span><span class="p">)</span><span class="w">
</span>

The concealing treshold highly depends on adjusting. The more adjusting is high (dark lines), the less concealing is needed to increase the probability of being a man.

Combinations of Observed Values

Let’s observe the link with Adjusting by generating a reference grid with all combinations of factors (salary, birth month etc.), and fixing numerics to their median (we could also chose “combinations” but it would generate a very, very very big dataframe with all possible combinations of values).

<span class="n">newdata</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">df</span><span class="w"> </span><span class="o">%>%</span><span class="w"> 
  </span><span class="n">select</span><span class="p">(</span><span class="o">-</span><span class="n">Sex</span><span class="p">)</span><span class="w"> </span><span class="o">%>%</span><span class="w"> 
  </span><span class="n">refdata</span><span class="p">(</span><span class="s2">"Adjusting"</span><span class="p">,</span><span class="w"> </span><span class="n">length.out</span><span class="o">=</span><span class="m">10</span><span class="p">,</span><span class="w"> </span><span class="n">factors</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"combination"</span><span class="p">,</span><span class="w"> </span><span class="n">numerics</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"median"</span><span class="p">)</span><span class="w"> 
 
</span><span class="n">predicted</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">predict</span><span class="p">(</span><span class="n">model</span><span class="p">,</span><span class="w"> </span><span class="n">newdata</span><span class="p">,</span><span class="w"> </span><span class="n">type</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"prob"</span><span class="p">)</span><span class="w">
</span><span class="n">newdata</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">cbind</span><span class="p">(</span><span class="n">newdata</span><span class="p">,</span><span class="w"> </span><span class="n">predicted</span><span class="p">)</span><span class="w">

</span><span class="n">newdata</span><span class="w"> </span><span class="o">%>%</span><span class="w"> 
  </span><span class="n">ggplot</span><span class="p">(</span><span class="n">aes</span><span class="p">(</span><span class="n">x</span><span class="o">=</span><span class="n">Adjusting</span><span class="p">,</span><span class="w"> </span><span class="n">y</span><span class="o">=</span><span class="n">M</span><span class="p">))</span><span class="w"> </span><span class="o">+</span><span class="w">
  </span><span class="n">geom_jitter</span><span class="p">(</span><span class="n">size</span><span class="o">=</span><span class="m">1</span><span class="p">,</span><span class="w"> </span><span class="n">width</span><span class="o">=</span><span class="m">0</span><span class="p">,</span><span class="w"> </span><span class="n">height</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">0.01</span><span class="p">)</span><span class="w"> </span><span class="o">+</span><span class="w">
  </span><span class="n">geom_smooth</span><span class="p">(</span><span class="n">size</span><span class="o">=</span><span class="m">1</span><span class="p">,</span><span class="w"> </span><span class="n">se</span><span class="o">=</span><span class="kc">FALSE</span><span class="p">)</span><span class="w"> </span><span class="o">+</span><span class="w">
  </span><span class="n">theme_classic</span><span class="p">()</span><span class="w"> </span><span class="o">+</span><span class="w">
  </span><span class="n">ylab</span><span class="p">(</span><span class="s2">"Probability of being a man"</span><span class="p">)</span><span class="w">
</span>

The more adjusting is high, the more probability there is to be a man. But let’s generate now much more observations.

<span class="n">newdata</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">df</span><span class="w"> </span><span class="o">%>%</span><span class="w"> 
  </span><span class="n">select</span><span class="p">(</span><span class="o">-</span><span class="n">Sex</span><span class="p">)</span><span class="w"> </span><span class="o">%>%</span><span class="w"> 
  </span><span class="n">refdata</span><span class="p">(</span><span class="s2">"Adjusting"</span><span class="p">,</span><span class="w"> </span><span class="n">length.out</span><span class="o">=</span><span class="m">10000</span><span class="p">,</span><span class="w"> </span><span class="n">factors</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"combination"</span><span class="p">,</span><span class="w"> </span><span class="n">numerics</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"median"</span><span class="p">)</span><span class="w"> 
 
</span><span class="n">predicted</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">predict</span><span class="p">(</span><span class="n">model</span><span class="p">,</span><span class="w"> </span><span class="n">newdata</span><span class="p">,</span><span class="w"> </span><span class="n">type</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"prob"</span><span class="p">)</span><span class="w">
</span><span class="n">newdata</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">cbind</span><span class="p">(</span><span class="n">newdata</span><span class="p">,</span><span class="w"> </span><span class="n">predicted</span><span class="p">)</span><span class="w">

</span><span class="n">newdata</span><span class="w"> </span><span class="o">%>%</span><span class="w"> 
  </span><span class="n">ggplot</span><span class="p">(</span><span class="n">aes</span><span class="p">(</span><span class="n">x</span><span class="o">=</span><span class="n">Adjusting</span><span class="p">,</span><span class="w"> </span><span class="n">y</span><span class="o">=</span><span class="n">M</span><span class="p">))</span><span class="w"> </span><span class="o">+</span><span class="w">
  </span><span class="n">geom_jitter</span><span class="p">(</span><span class="n">size</span><span class="o">=</span><span class="m">1</span><span class="p">,</span><span class="w"> </span><span class="n">width</span><span class="o">=</span><span class="m">0</span><span class="p">,</span><span class="w"> </span><span class="n">height</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">0.01</span><span class="p">,</span><span class="w"> </span><span class="n">alpha</span><span class="o">=</span><span class="m">0.2</span><span class="p">)</span><span class="w"> </span><span class="o">+</span><span class="w">
  </span><span class="n">geom_smooth</span><span class="p">(</span><span class="n">size</span><span class="o">=</span><span class="m">1</span><span class="p">,</span><span class="w"> </span><span class="n">se</span><span class="o">=</span><span class="kc">FALSE</span><span class="p">)</span><span class="w"> </span><span class="o">+</span><span class="w">
  </span><span class="n">theme_classic</span><span class="p">()</span><span class="w"> </span><span class="o">+</span><span class="w">
  </span><span class="n">ylab</span><span class="p">(</span><span class="s2">"Probability of being a man"</span><span class="p">)</span><span class="w">
</span>

We can still see, “behind the scenes”, how different factors influence this relationship.

Credits

This package helped you? Don’t forget to cite the various packages you used 🙂

You can cite psycho as follows:

  • Makowski, (2018). The psycho Package: An Efficient and Publishing-Oriented Workflow for Psychological Science. Journal of Open Source Software, 3(22), 470. https://doi.org/10.21105/joss.00470

Contribute

psycho is a young package and still need some love. Therefore, if you have any advices, opinions or such, we encourage you to either let us know by opening an issue, or even better, try to implement them yourself by contributing to the code.

Previous blogposts

To leave a comment for the author, please follow the link and comment on their blog: Dominique Makowski.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)