Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

# Introduction

‘Happiness in intelligent people is the rarest thing I know’

A character in Ernest Hemingway’s novel “The Garden of Eden”

Greetings, humanists, social and data scientists!

In this lesson, we will learn how to evaluate the relationship between two variables with R. Check out the video below for a short introduction.

# Data source

The Guerry dataset is provided by the R package HistData. To know more about this package, please refer to our lesson ‘Uncovering History with R – A Look at the HistData Package’.

# Coding the past: the relationship between literacy and suicides in 1830s France

## 1. Exploring Andre-Michel Guerry’s Pioneering Data: Moral Statistics of 1830s France

Andre-Michel Guerry was a French lawyer who was passionate about statistics. He is considered to be the founder of moral statistics and had a major influence on the development of modern social science. His work “Essay on the Moral Statistics of France” includes data on several social variables of 86 French departments in the 1830s.

To access this data, we need to load the HistData package. After doing so, we can use the command `help(Guerry)` to see the description of the dataset and the details about each of the 23 variables. Variables include information such as population, crime, literacy, suicide, wealth, and location of the 86 French departments.

You can use `df <- Guerry` to load the data. Feel free to explore the dataset and check the structure of the dataframe with `str(df)`.

content_copy Copy

## 2. Add a new column to a dataframe in R

In the documentation of the dataset, the author states “Note that most of the variables (e.g., Crime_pers) are scaled so that ‘more is better’ morally.”. Thus, suicide, for example, is expressed as the population divided by the number of suicides. In this way, the fewer the suicides, the larger the value in the `Suicides` column.

To make our analysis easier to interpret, we can calculate the inverse of `Suicides`, that is, instead of having population/suicides, we will consider suicides/population (suicides per inhabitants). Moreover, to avoid very small numbers, let us multiply this by 100,000 so that we have suicides per 100,000 population. The code below creates this new variable.

content_copy Copy

Note that "Pop1831" tells us the population of French departments in the thousands in 1831. "summary(df\$Pop1831)" tells us that the least populated department had a population of 129,000 inhabitants and the most populated had around 990,000 inhabitants.

## 3. Use geom_point to create a scatter plot

Now, we’ll examine the relationship between `Suicides_Pop` and `Literacy` using a scatter plot. As per the documentation, `Literacy` represents the “percentage of military conscripts who can read and write” in a department. Keep in mind that the relationships studied in this lesson apply only to this subgroup which is not representative of the whole population. The code below leverages `geom_point` to visualize this relationship.

content_copy Copy

Please note, the code above incorporates the function `theme_coding_the_past` to style the plot. You can access this theme in the lesson ‘Climate Data Visualization’

The plot suggests that as literacy percentages rise, suicide rates tend to increase. In the distribution of literacy rates below, we also see that the majority of the French departments recorded literacy rates lower than 50% (indicated by the dashed line). If you count the departments to the right of the dashed line, you will find 24 departments, which represents only 24/86 = 28% of the total departments. Notably, the highest suicide rates are in this subgroup.

content_copy Copy

## 4. cort.test in R

Having observed a graphical association between `Literacy` and `Suicides`, let’s use `cor.test` to find this association analytically. This function takes two arguments `x` and `y` and returns a Pearson correlation coefficient (by default) and its statistical significance. As explained in the lesson R programming for climate data analysis and visualization “correlation measures how much two variables change together. It ranges from 1 to -1, where 1 means perfect positive correlation, 0 means no correlation at all and -1 means perfect negative correlation”.

Using `cor.test(x = df\$Literacy, df\$Suicides_Pop)` we obtain a correlation coefficient of 0.4 which means a moderate positive correlation. As literacy increases so does suicide proportion. The p-value is less than 0.01, meaning there is a statistically significant association between `Literacy` and `Suicides_Pop`. Framed differently, under the hypothesis that there is no correlation between the two variables, the probability of finding a coefficient of 0.4 or higher would be less than 1%. So we can reject the null hypothesis.

## 5. Linear models with R

To further study the relationship between these two variables let’s model 3 linear regressions. To know more about linear regression, check out the lesson R programming for climate data analysis and visualization.

The first model will only include `Suicides_Pop` as the dependent variable and `Literacy` as the independent variable. Use `summary(lm(Suicides_Pop ~ Literacy, data = df))` to see the results of this model. The literacy coefficient tells us that if we increase the literacy rate by 1%, then the suicide proportion grows by 0.11. Put differently, a 10% increase in literacy is associated with around 1 suicide more per 100,000 population. This estimate is statistically significant.

In the code below, we use `geom_smooth` to plot the regression line describing the positive link between literacy and suicides. The `method` argument tells ggplot to use a linear model (lm) to depict the relationship.

content_copy Copy

Note that we cannot say that higher literacy rates directly cause more suicides, as factors beyond literacy rates might influence suicide rates. In the next section, we will check whether wealth and the distance to Paris influence suicides as well. Moreover, we will determine if the association between literacy and suicides holds even after controlling for these variables. To show the results, we will use stargazer, a very handy package designed for displaying linear model results.

## 6. How to use stargazer in R

The `stargazer` package offers a very neat and practical way of presenting the results of several linear models. Users can set it up to produce LaTeX or HTML outputs using the `type` argument. In the code that follows, we configure it to generate HTML, making it suitable for this blog post. First, we create three models adding variables indicating the wealth and distance to Paris of each department. Second, we pass these models to stargazer.

content_copy Copy

The `stargazer` table can be seen below. Note in model 2 that `Wealth` appears to influence `Suicides` negatively, meaning that richer areas are associated with fewer suicides. The coefficient regarding `Literacy` decreases a bit but remains statistically significant. Finally, model 3 includes the distance to Paris as an additional variable. The coefficient of `Literacy` decreases again but remains statistically significant. Moreover, being close to Paris is associated with more suicides.

 Dependent variable: Suicides_Pop (1) (2) (3) Literacy 0.112*** 0.080*** 0.064** (0.027) (0.026) (0.025) Wealth -0.080*** -0.059*** (0.018) (0.018) Distance -0.014*** (0.004) Constant 0.645 5.347*** 7.901*** (1.168) (1.489) (1.604) Observations 86 86 86 R2 0.167 0.329 0.408 Adjusted R2 0.157 0.313 0.386 Residual Std. Error 4.360 (df = 84) 3.938 (df = 83) 3.720 (df = 82) F Statistic 16.826*** (df = 1; 84) 20.321*** (df = 2; 83) 18.841*** (df = 3; 82) Note: *p<0.1; **p<0.05; ***p<0.01

Like all social phenomena, the incidence of suicide is shaped by a multitude of factors. While we cannot definitively claim that literacy directly caused suicides in 19th-century France, our analysis above does indicate an association between these variables. Delving deeper into the contextual nuances of France in the 1830s might shed light on whether literacy indeed influenced the decision to commit suicide. For instance, check this article by Lisa Lieberman “Romanticism and the Culture of Suicide in Nineteenth-Century France”

If you are interested in this topic, The Sorrows of Young Werther, by Johann Wolfgang Goethe, is a literary representation of a particular view on suicide that would influence the Romantic movement in 19th-century Europe.

Daniel Chodowiecki. Goethe’s Werther in his bedroom, with him lying dead on his bed. Public Domain.

If you have any questions or would like to share your thoughts on this topic, please feel free to ask in the comments below.

# Conclusions

• Association between two variables can be identified with a scatter plot;
• It can also be explored analytically with `cor.test`;
• Linear regression helps us further understand the relationship of two variables, given other relevant variables