‘Happiness in intelligent people is the rarest thing I know’
A character in Ernest Hemingway’s novel “The Garden of Eden”
Greetings, humanists, social and data scientists!
In this lesson, we will learn how to evaluate the relationship between two variables with R. Check out the video below for a short introduction.
The Guerry dataset is provided by the R package HistData. To know more about this package, please refer to our lesson ‘Uncovering History with R – A Look at the HistData Package’.
Coding the past: the relationship between literacy and suicides in 1830s France
1. Exploring Andre-Michel Guerry’s Pioneering Data: Moral Statistics of 1830s France
Andre-Michel Guerry was a French lawyer who was passionate about statistics. He is considered to be the founder of moral statistics and had a major influence on the development of modern social science. His work “Essay on the Moral Statistics of France” includes data on several social variables of 86 French departments in the 1830s.
To access this data, we need to load the HistData package. After doing so, we can use the command
help(Guerry) to see the description of the dataset and the details about each of the 23 variables. Variables include information such as population, crime, literacy, suicide, wealth, and location of the 86 French departments.
You can use
df <- Guerry to load the data. Feel free to explore the dataset and check the structure of the dataframe with
2. Add a new column to a dataframe in R
In the documentation of the dataset, the author states “Note that most of the variables (e.g., Crime_pers) are scaled so that ‘more is better’ morally.”. Thus, suicide, for example, is expressed as the population divided by the number of suicides. In this way, the fewer the suicides, the larger the value in the
To make our analysis easier to interpret, we can calculate the inverse of
Suicides, that is, instead of having population/suicides, we will consider suicides/population (suicides per inhabitants). Moreover, to avoid very small numbers, let us multiply this by 100,000 so that we have suicides per 100,000 population. The code below creates this new variable.
3. Use geom_point to create a scatter plot
Now, we’ll examine the relationship between
Literacy using a scatter plot. As per the documentation,
Literacy represents the “percentage of military conscripts who can read and write” in a department. Keep in mind that the relationships studied in this lesson apply only to this subgroup which is not representative of the whole population. The code below leverages
geom_point to visualize this relationship.
Please note, the code above incorporates the function
theme_coding_the_past to style the plot. You can access this theme in the lesson ‘Climate Data Visualization’
The plot suggests that as literacy percentages rise, suicide rates tend to increase. In the distribution of literacy rates below, we also see that the majority of the French departments recorded literacy rates lower than 50% (indicated by the dashed line). If you count the departments to the right of the dashed line, you will find 24 departments, which represents only 24/86 = 28% of the total departments. Notably, the highest suicide rates are in this subgroup.
4. cort.test in R
Having observed a graphical association between
Suicides, let’s use
cor.test to find this association analytically. This function takes two arguments
y and returns a Pearson correlation coefficient (by default) and its statistical significance. As explained in the lesson R programming for climate data analysis and visualization “correlation measures how much two variables change together. It ranges from 1 to -1, where 1 means perfect positive correlation, 0 means no correlation at all and -1 means perfect negative correlation”.
cor.test(x = df$Literacy, df$Suicides_Pop) we obtain a correlation coefficient of 0.4 which means a moderate positive correlation. As literacy increases so does suicide proportion. The p-value is less than 0.01, meaning there is a statistically significant association between
Suicides_Pop. Framed differently, under the hypothesis that there is no correlation between the two variables, the probability of finding a coefficient of 0.4 or higher would be less than 1%. So we can reject the null hypothesis.
5. Linear models with R
To further study the relationship between these two variables let’s model 3 linear regressions. To know more about linear regression, check out the lesson R programming for climate data analysis and visualization.
The first model will only include
Suicides_Pop as the dependent variable and
Literacy as the independent variable. Use
summary(lm(Suicides_Pop ~ Literacy, data = df)) to see the results of this model. The literacy coefficient tells us that if we increase the literacy rate by 1%, then the suicide proportion grows by 0.11. Put differently, a 10% increase in literacy is associated with around 1 suicide more per 100,000 population. This estimate is statistically significant.
In the code below, we use
geom_smooth to plot the regression line describing the positive link between literacy and suicides. The
method argument tells ggplot to use a linear model (lm) to depict the relationship.
Note that we cannot say that higher literacy rates directly cause more suicides, as factors beyond literacy rates might influence suicide rates. In the next section, we will check whether wealth and the distance to Paris influence suicides as well. Moreover, we will determine if the association between literacy and suicides holds even after controlling for these variables. To show the results, we will use stargazer, a very handy package designed for displaying linear model results.
6. How to use stargazer in R
stargazer package offers a very neat and practical way of presenting the results of several linear models. Users can set it up to produce LaTeX or HTML outputs using the
type argument. In the code that follows, we configure it to generate HTML, making it suitable for this blog post. First, we create three models adding variables indicating the wealth and distance to Paris of each department. Second, we pass these models to stargazer.
stargazer table can be seen below. Note in model 2 that
Wealth appears to influence
Suicides negatively, meaning that richer areas are associated with fewer suicides. The coefficient regarding
Literacy decreases a bit but remains statistically significant. Finally, model 3 includes the distance to Paris as an additional variable. The coefficient of
Literacy decreases again but remains statistically significant. Moreover, being close to Paris is associated with more suicides.
|Residual Std. Error||4.360 (df = 84)||3.938 (df = 83)||3.720 (df = 82)|
|F Statistic||16.826*** (df = 1; 84)||20.321*** (df = 2; 83)||18.841*** (df = 3; 82)|
|Note:||*p<0.1; **p<0.05; ***p<0.01|
Like all social phenomena, the incidence of suicide is shaped by a multitude of factors. While we cannot definitively claim that literacy directly caused suicides in 19th-century France, our analysis above does indicate an association between these variables. Delving deeper into the contextual nuances of France in the 1830s might shed light on whether literacy indeed influenced the decision to commit suicide. For instance, check this article by Lisa Lieberman “Romanticism and the Culture of Suicide in Nineteenth-Century France”
If you are interested in this topic, The Sorrows of Young Werther, by Johann Wolfgang Goethe, is a literary representation of a particular view on suicide that would influence the Romantic movement in 19th-century Europe.
If you have any questions or would like to share your thoughts on this topic, please feel free to ask in the comments below.
- Association between two variables can be identified with a scatter plot;
- It can also be explored analytically with
- Linear regression helps us further understand the relationship of two variables, given other relevant variables