Use R to explore the link between literacy and suicide in 1830s France

[This article was first published on coding-the-past, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.


Introduction


‘Happiness in intelligent people is the rarest thing I know’

A character in Ernest Hemingway’s novel “The Garden of Eden”


Greetings, humanists, social and data scientists!


In this lesson, we will learn how to evaluate the relationship between two variables with R. Check out the video below for a short introduction.





Data source

The Guerry dataset is provided by the R package HistData. To know more about this package, please refer to our lesson ‘Uncovering History with R – A Look at the HistData Package’.




Coding the past: the relationship between literacy and suicides in 1830s France


1. Exploring Andre-Michel Guerry’s Pioneering Data: Moral Statistics of 1830s France

Andre-Michel Guerry was a French lawyer who was passionate about statistics. He is considered to be the founder of moral statistics and had a major influence on the development of modern social science. His work “Essay on the Moral Statistics of France” includes data on several social variables of 86 French departments in the 1830s.


To access this data, we need to load the HistData package. After doing so, we can use the command help(Guerry) to see the description of the dataset and the details about each of the 23 variables. Variables include information such as population, crime, literacy, suicide, wealth, and location of the 86 French departments.


You can use df <- Guerry to load the data. Feel free to explore the dataset and check the structure of the dataframe with str(df).


content_copy Copy

library(HistData)
library(ggplot2)

help(Guerry)

df <- Guerry

str(df)




2. Add a new column to a dataframe in R

In the documentation of the dataset, the author states “Note that most of the variables (e.g., Crime_pers) are scaled so that ‘more is better’ morally.”. Thus, suicide, for example, is expressed as the population divided by the number of suicides. In this way, the fewer the suicides, the larger the value in the Suicides column.


To make our analysis easier to interpret, we can calculate the inverse of Suicides, that is, instead of having population/suicides, we will consider suicides/population (suicides per inhabitants). Moreover, to avoid very small numbers, let us multiply this by 100,000 so that we have suicides per 100,000 population. The code below creates this new variable.


content_copy Copy

df$Suicides_Pop <- (1/df$Suicides)*100000


tips_and_updates  
Note that "Pop1831" tells us the population of French departments in the thousands in 1831. "summary(df$Pop1831)" tells us that the least populated department had a population of 129,000 inhabitants and the most populated had around 990,000 inhabitants.




3. Use geom_point to create a scatter plot

Now, we’ll examine the relationship between Suicides_Pop and Literacy using a scatter plot. As per the documentation, Literacy represents the “percentage of military conscripts who can read and write” in a department. Keep in mind that the relationships studied in this lesson apply only to this subgroup which is not representative of the whole population. The code below leverages geom_point to visualize this relationship.


content_copy Copy

ggplot(data = df, aes(x = Literacy, y = Suicides_Pop))+
  geom_point(color = "#FF6885", size = 2)+
  geom_vline(xintercept = 50, linetype = "dashed", color = "white")+
  labs(title = "Relationship between Suicides and Literacy",
       x = "Percentage of literate conscripts",
       y = "Suicides (per 100,000 population)")+
  theme_coding_the_past()


Please note, the code above incorporates the function theme_coding_the_past to style the plot. You can access this theme in the lesson ‘Climate Data Visualization’


Percentage of literate conscripts vs Suicides per inhabitants


The plot suggests that as literacy percentages rise, suicide rates tend to increase. In the distribution of literacy rates below, we also see that the majority of the French departments recorded literacy rates lower than 50% (indicated by the dashed line). If you count the departments to the right of the dashed line, you will find 24 departments, which represents only 24/86 = 28% of the total departments. Notably, the highest suicide rates are in this subgroup.


content_copy Copy

ggplot(data = df, aes(x = Literacy))+
  geom_histogram(color = "#FF6885", fill = "#FF6885",  alpha = 0.2, bins = 25)+
  geom_vline(xintercept = 50, linetype = "dashed", color = "white")+
  labs(title = "Distribution of literacy percentages",
       x = "Literacy",
       y = "Count")+
  theme_coding_the_past()



Distribution of literacy percentages




4. cort.test in R

Having observed a graphical association between Literacy and Suicides, let’s use cor.test to find this association analytically. This function takes two arguments x and y and returns a Pearson correlation coefficient (by default) and its statistical significance. As explained in the lesson R programming for climate data analysis and visualization “correlation measures how much two variables change together. It ranges from 1 to -1, where 1 means perfect positive correlation, 0 means no correlation at all and -1 means perfect negative correlation”.


Using cor.test(x = df$Literacy, df$Suicides_Pop) we obtain a correlation coefficient of 0.4 which means a moderate positive correlation. As literacy increases so does suicide proportion. The p-value is less than 0.01, meaning there is a statistically significant association between Literacy and Suicides_Pop. Framed differently, under the hypothesis that there is no correlation between the two variables, the probability of finding a coefficient of 0.4 or higher would be less than 1%. So we can reject the null hypothesis.




5. Linear models with R

To further study the relationship between these two variables let’s model 3 linear regressions. To know more about linear regression, check out the lesson R programming for climate data analysis and visualization.


The first model will only include Suicides_Pop as the dependent variable and Literacy as the independent variable. Use summary(lm(Suicides_Pop ~ Literacy, data = df)) to see the results of this model. The literacy coefficient tells us that if we increase the literacy rate by 1%, then the suicide proportion grows by 0.11. Put differently, a 10% increase in literacy is associated with around 1 suicide more per 100,000 population. This estimate is statistically significant.


In the code below, we use geom_smooth to plot the regression line describing the positive link between literacy and suicides. The method argument tells ggplot to use a linear model (lm) to depict the relationship.


content_copy Copy

ggplot(data = df, aes(x = Literacy, y = Suicides_Pop))+
  geom_point(color = "#FF6885", size = 2)+
  geom_smooth(method = "lm", color = "white", se = FALSE)+
  labs(title = "Relationship between Suicides and Literacy",
       x = "Percentage of literate conscripts",
       y = "Suicides (per 100,000 population)")+
  theme_coding_the_past()


geom_smooth


Note that we cannot say that higher literacy rates directly cause more suicides, as factors beyond literacy rates might influence suicide rates. In the next section, we will check whether wealth and the distance to Paris influence suicides as well. Moreover, we will determine if the association between literacy and suicides holds even after controlling for these variables. To show the results, we will use stargazer, a very handy package designed for displaying linear model results.




6. How to use stargazer in R

The stargazer package offers a very neat and practical way of presenting the results of several linear models. Users can set it up to produce LaTeX or HTML outputs using the type argument. In the code that follows, we configure it to generate HTML, making it suitable for this blog post. First, we create three models adding variables indicating the wealth and distance to Paris of each department. Second, we pass these models to stargazer.


content_copy Copy

linear_model_01 <- lm(Suicides_Pop ~ Literacy, data = df)

linear_model_02 <- lm(Suicides_Pop ~ Literacy + Wealth, data = df)

linear_model_03 <- lm(Suicides_Pop ~ Literacy + Wealth + Distance, data = df)

library(stargazer)

stargazer(linear_model_01, linear_model_02, linear_model_03, type = "html")


The stargazer table can be seen below. Note in model 2 that Wealth appears to influence Suicides negatively, meaning that richer areas are associated with fewer suicides. The coefficient regarding Literacy decreases a bit but remains statistically significant. Finally, model 3 includes the distance to Paris as an additional variable. The coefficient of Literacy decreases again but remains statistically significant. Moreover, being close to Paris is associated with more suicides.


Dependent variable:
Suicides_Pop
(1)(2)(3)
Literacy0.112***0.080***0.064**
(0.027)(0.026)(0.025)
Wealth-0.080***-0.059***
(0.018)(0.018)
Distance-0.014***
(0.004)
Constant0.6455.347***7.901***
(1.168)(1.489)(1.604)
Observations868686
R20.1670.3290.408
Adjusted R20.1570.3130.386
Residual Std. Error4.360 (df = 84)3.938 (df = 83)3.720 (df = 82)
F Statistic16.826*** (df = 1; 84)20.321*** (df = 2; 83)18.841*** (df = 3; 82)
Note:*p<0.1; **p<0.05; ***p<0.01




Like all social phenomena, the incidence of suicide is shaped by a multitude of factors. While we cannot definitively claim that literacy directly caused suicides in 19th-century France, our analysis above does indicate an association between these variables. Delving deeper into the contextual nuances of France in the 1830s might shed light on whether literacy indeed influenced the decision to commit suicide. For instance, check this article by Lisa Lieberman “Romanticism and the Culture of Suicide in Nineteenth-Century France”


If you are interested in this topic, The Sorrows of Young Werther, by Johann Wolfgang Goethe, is a literary representation of a particular view on suicide that would influence the Romantic movement in 19th-century Europe.


The Sorrows of Young Werther Daniel Chodowiecki. Goethe’s Werther in his bedroom, with him lying dead on his bed. Public Domain.



If you have any questions or would like to share your thoughts on this topic, please feel free to ask in the comments below.




Conclusions


  • Association between two variables can be identified with a scatter plot;
  • It can also be explored analytically with cor.test;
  • Linear regression helps us further understand the relationship of two variables, given other relevant variables



To leave a comment for the author, please follow the link and comment on their blog: coding-the-past.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)