# Introduction to statistical thinking

**R on francojc ⟲**, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Before we begin working on the specifics of our data project, it is important to have a clear understanding of some of the basic concepts that need to be in place to guide our work. In this post I will cover some of these topics including the importance of identifying a research question, how different statistical approaches relate to different types of research, and understanding data from a sampling and organizational standpoint. I will also provide some examples of linking research questions with variables in a toy dataset as we begin to discuss how to approach data analysis, primarily through visualization techniques.

## Research aims

Before jumping into the code, every researcher must come to a project with a clear idea about the purpose of the analysis. This means doing your homework in order to understand what it is exactly that you want to achieve; that is, you need to identify a **research question**. The first step is become versed in the previous literature on the topic. What has been written? What are the main findings? Secondly, it is important to become familiar with the standard methods for approaching the topic of interest. How has the topic been approached methodologically? What are the types, sources, and quality of data employed? What have been the statistical approaches employed? What particular statistical tests have been chosen? Getting an overview not only of the domain-specific findings in the literature but also the methodological choices will help you identify promising plan for carrying out your research.

## Choosing a statistical approach

With a research question in hand and a sense of how similar studies have approached the topic methodologically, it’s time to make a more refined decision about how the data is to be analyzed. This decision will dictate all other methodological choices from data collection to interpreting results.

There are three main statistical approaches:

### Inference

Also commonly known as hypothesis testing or confirmation, statistical inference aims to establish whether there is a reliable and generalizable relationship given patterns in the data. The approach makes the starting assumption that there is no relationship, or that the null hypothesis (\(H_0\)) is true. A relationship is only reliable, or *significant*, if the chance that the null hypothesis is false is less than some predetermined threshold; in which case we accept the alternative hypothesis (\(H_1\)). The standard threshold used in the Social Sciences, Linguistic included, is the famous p-value \(p < .05\). Without digging into the deeper meaning of a p-value, in a nutshell a p-value is a confidence measure to suggest that the relationship you are investigating is robust and reliable given the data. In an inference approach all the data is used and is used *only* once. This is not the case for the other two statistical approaches we will cover, Exploration and Prediction. For this reason it is vital to identify your statistical approach from the beginning. In the case of inference tests, failing to make a clear hypothesis often leads to p-hacking; a practice of running multiple tests and/or parameters on the same data (i.e. reusing the data) until evidence for the alternative hypothesis appears.

### Exploration

One of two statistical learning approaches, this statistical method is used to uncover potential relationships in the data and gain new insight in an area where predictions and hypotheses cannot be clearly made. In statistical learning, exploration is a type of **unsupervised learning**. Supervision here, and for Prediction, refers to the presence or absence of an outcome variable. By choosing exploration as our approach we make no assumptions (or hypotheses) about the relationships between any of the particular variables in the data. Rather we hope to investigate the extent to which we can induce meaningful patterns wherever they may lie. Findings from exploratory analyses can provide valuable insight for future study but they cannot be safely used to generalize to the larger population, which is why exploratory analyses are often known as hypothesis generating analyses (rather than hypothesis confirming). Given our generalizing power is curtailed, the data *can* be reused multiple times trying out various tests. While it is not strictly required, data for exploratory analysis is often partitioned into two sets, training and validation, at roughly an 80%/20% split. The training set is used for refining statistical measures and the test set is used to evaluate the refined measures. Although the evaluation results still cannot be used to generalize, the insight can be taken as stronger evidence that there is a potential relationship, or set of relationships, worthy of further study.

### Prediction

The other statistical learning approach, Prediction, aims to uncover relationships in our data as they pertain to a particular outcome variable. This approach is known as **supervised learning**. Similar to Exploration in many ways, this approach also makes no assumptions about the potential relationships between variables in our data and the data can be used multiple times to refine our statistical tests in order to tease out the most effective method for our goals. Where an exploratory analysis aims to uncover meaningful patterns of any sort, prediction, however, is more focused in that the main aim is to ascertain the extent to which the variables in the data pattern, individually or together, in such a way to make reliable associations to a particular outcome variable in unseen data. To evaluate the robustness of a prediction model the data is partitioned into training and validation sets. Depending on the application and the amount of available data, a third ‘development’ set is sometimes created as a pseudo test set to facilitate the testing of multiple approaches before the final evaluation. The proportions vary, but it a good rule of thumb is to reserve 60% of the data for training, 20% for development, and 20% for validation.

## Understanding data

Knowing the statistical approach to take then frames the next conceptual steps: **data sampling** and **organization of data**. But what is data anyway? Abstractly it is some set of empirical observations about the world. There are innumerable types of observations, as you can imagine, which can be used to describe objects and events. Our scientific aim is to systematically attempt to relate these observations and deduce the nature of their relationships to gain a better understanding of how our world works.

Language research aims to understand a subset of these observations, namely those that concern linguistic behavior. The psycholinguist may observe the reaction times in a lexical decision task, eye-gaze in a visual world paradigm, or electro-magnetic brain activation in an ERP study. A sociolinguist may conduct interviews with members of a community, solicit language attitude responses to a language attitude survey, or ethnographically record face-to-face encounters. A syntactician may solicit acceptability ratings, calculate the frequency of a syntactic structure in a corpus, or document the permutations of subject-verb-object order in the world’s languages. As language is a defining characteristic of our species, language-related observations feature many other disciplines as well such as Anthropology, History, Neurology, Mathematics, and Biology. Linguistic inquiry, then, is not isolated to linguistic form, but rather the connection between linguistic form and other non-linguistic objects and events in the world at large –wherever that may take us.

### Sampling

One major limitation inherent to most data sampling, and a primary reason why statistics are so important to doing and interpreting science, is the fact that our vantage point to the observing the world is restricted. We can only work with the data at our disposal, a **sample**, even when it is clear that there is a much larger existing world, or **population**. Ideally we would have access to the entire population of interest, but in most cases this is either not physically possible to obtain (or even store) the data or it is conceptually impossible to ever observe the entire population. As an example, say we wanted to catalog all the words in the English language. From a logistics point of view, where would we start? Any given dictionary only catalogs a subset of the words in a language –many words that are used in English-speaking communities, especially those from spoken language, will not appear. A corpus may capture linguistic diversity that does not appear in a dictionary, but it too will fall short of our lofty goal. But for argumentation sake, let’s imagine we could somehow capture all the words. What happens to our population in a day, a week, or a month from now? It quickly becomes a sample because new words are created all the time and some words are lost. Our population of words in the English language, then, is a moving target.

This transitory property of populations is well-known and methods for obtaining reliable, or externally valid, samples is an area of study in its own right. In short, we aim for a sample to be balanced and representative of the idealized population. *Representativeness* is the extent to which a sample reflects the total diversity in the population. *Balance* is concerned with modeling the proportions of that diversity. An ideal sample combines both.

The first strategy most often applied to obtaining a valid sample is to increase *sample size*. This is an intuitive technique whose logic appeals to the notion that more is better. More is better, clearly. But more data alone does not always ensure an externally valid sample. For example, say we want to know something about the frequencies of words in written Spanish. Our target population is, then, words written in Spanish. It occurs to us that we can access a lot of written Spanish online via Project Gutenberg. We download works from many authors over a span of many years. After doing some calculations our sample contains around 100 million words. That’s a lot of words, and surely more indicative of the population than say 100 thousand words. But we have potentially overlooked something very important: all of the data in our sample comes from literature, specifically literature in the public domain. In other words, our sample is not random.

A *random sample* will help increase the potential diversity in any sample. In our sample this means drawing data from a number of written sources of Spanish at random. This strategy will increase our chances to capture written Spanish from other genres and registers. Now a 100 million word sample randomly selected from genres and registers of written Spanish is bound to be more representative of the population, but we run into another conceptual snag. Our sample is large and randomly selected from the population, but does it reflect the proportion each subgroup (genres and registers) contributes to the idealized population?

There is no absolute way of knowing if the proportions of each subgroup are balanced, or even what all the subgroups may be for that matter, but in most cases we can make an educated guess on both these fronts that will allow us to increase the validity of our sample. For example, the literary genre ‘self-help’ intuitively constitutes a smaller portion of our target population than say ‘news’. Ideally we would want to reflect this understanding in our sample. Applying this logic is known as *stratified sampling*. A large, stratified random sample is always at least as valid as an equally sized large random sample with the added benefit that we are safeguarded from large skews that a large random sample may potentially produce. Now it is important to keep in mind that stratified sampling has its limitations as well. The difficulties posed in obtaining a valid sample from the macro view (i.e. the total population is never observable) are present at the micro view as well (i.e. sub- and sub-substrata are equally illusive). Again, there are no absolutes in sampling. The key is to keep the aim of the research question clear during the sampling process and strive for sizable, randomly stratified samples to minimize sampling error to the extent that it is feasible –and then work from there.

This lack of certainty in sampling may seem troublesome. Sampling uncertainty, however, does not mean we cannot gain insight into the essence of the objects and events in the world we aim to understand. It just means we need to be aware of any given sample’s limitations, document these limitations, and always approach statistical findings based on this data with caution; suspending generalizations of the absolute nature. This is why science, contrary to popular belief, does not ‘prove’ anything. Rather science aims to collect evidence for or against a hypotheses. Since the data is always changing there are no absolute conclusions. As the evidence grows, so does the case for a particular view of how the world works. It is this systematic approach which makes science so powerful.

### Organization

Identifying and capturing a data sample moves us one step closer to performing our data analysis but the format of the raw or original data is often not in a format conducive for visualization nor statistical tests. The hypothetical written Spanish data we identified to sample in the previous section would most likely take the form of documents of running text with potentially some meta-data about the text (author, title of the work, date published, genre, etc.) in the header of the file and/or the name of each file.

```
Title: Cuando los robots tomen el mando y hagan la guerra
Date: 3 AGO 2015 - 00:00 CEST
Genre: News
Source: El País
Tags: Científicos, Isaac Asimov, Robótica, Gente, Tecnología, Informática, Ciencia, Sociedad, Industria
La primera reflexión abarcadora sobre la coexistencia entre los robots y los humanos no fue obra de un científico de la computación ni de un filósofo ético, sino de un novelista. Isaac Asimov formuló las tres “leyes de la robótica” que deberían incorporarse en la programación de cualquier autómata lo bastante avanzado como para suponer un peligro: “No dañar a los humanos, obedecerles salvo conflicto con lo anterior y autoprotegerse salvo conflicto con todo lo anterior”. Las tres leyes de Asimov configuran una propuesta sólida y autoconsistente, y cuentan con apoyo entre la comunidad de la inteligencia artificial, que reconoce, por ejemplo, que cualquier sistema autónomo funcional debe ser capaz de autoprotegerse.
...
```

As raw data this format is fine, but to gain insight from this data, we will need to explicitly organize the attributes of our data that are key to our analysis. Our data should be in tabular, or ‘tidy’ format where each row is an observation, or **case** and each column, or **variable** is a list of attributes of the observation. Each cell, then, is a particular attribute of a particular observation, or **data point**. Say our objective is to perform an exploratory analysis to evaluate the potential similarities and differences in word frequencies between genres. For this particular analysis we will want to extract and organize the title of each document (`doc_id`

), the genre it is from (`genre`

), and each word (`word`

) as a single, row, or observation, in our tidy dataset.

doc_id | genre | word |
---|---|---|

Cuando los robots tomen el mando y hagan la guerra | News | sociales |

Cuando los robots tomen el mando y hagan la guerra | News | anterior |

Cuando los robots tomen el mando y hagan la guerra | News | de |

Cuando los robots tomen el mando y hagan la guerra | News | el |

Cuando los robots tomen el mando y hagan la guerra | News | sus |

Cosmografía | Astronomy | también |

Cosmografía | Astronomy | aun |

Cosmografía | Astronomy | así |

Cosmografía | Astronomy | nieves |

Cosmografía | Astronomy | de |

Heath’s Modern Language Series: El trovador | Opera | encerrar |

Heath’s Modern Language Series: El trovador | Opera | to |

Heath’s Modern Language Series: El trovador | Opera | adiós |

Heath’s Modern Language Series: El trovador | Opera | de |

Heath’s Modern Language Series: El trovador | Opera | por |

This tidy organization may seem somewhat redundant; a single value for `doc_id`

is repeated for each value of `word`

and a single value of `genre`

is repeated for each value of `doc_id`

. However tidy data, although visually redundant, is an explicit description of the relationship between our variables. Each row corresponds to all of the necessary attributes to describe a particular observation. In this data, the occurrence of a word is associated with the file it appears in and the genre that file is associated with.

Our objective in this toy example is to explore the relationship between word frequencies and genres, yet at this point there is no explicit variable for the frequencies of words. The information we need, however, is in the data and since we have an organized, tidy dataset, calculating `word_freq`

is a matter of tabulating the occurrences of each word. This can be done easily with R, as we will see in detail in future posts, but for our discussion on data organization let’s skip the details and jump to the new dataset with a column for `word_freq`

.

doc_id | genre | word | word_freq |
---|---|---|---|

Cosmografía | Astronomy | de | 747 |

Cosmografía | Astronomy | la | 559 |

Cosmografía | Astronomy | el | 376 |

Cosmografía | Astronomy | en | 309 |

Cosmografía | Astronomy | que | 306 |

Cuando los robots tomen el mando y hagan la guerra | News | de | 33 |

Cuando los robots tomen el mando y hagan la guerra | News | la | 30 |

Cuando los robots tomen el mando y hagan la guerra | News | y | 20 |

Cuando los robots tomen el mando y hagan la guerra | News | los | 17 |

Cuando los robots tomen el mando y hagan la guerra | News | que | 14 |

Heath’s Modern Language Series: El trovador | Opera | to | 314 |

Heath’s Modern Language Series: El trovador | Opera | de | 217 |

Heath’s Modern Language Series: El trovador | Opera | a | 206 |

Heath’s Modern Language Series: El trovador | Opera | que | 175 |

Heath’s Modern Language Series: El trovador | Opera | _m | 172 |

Other measures and/or attributes can be added as necessary to this tabular format and in some cases we may convert our tidy tabular dataset to other data formats that may be required for some particular statistic approaches but at all times the relationship between the variables should be maintained in line with our research purpose. We will touch on examples of other types of data formats when we dive into particular statistical approaches that require them later in the series.

### Informational value

Let’s turn now to the informational nature of our variables as it will set up how we implement our data analysis. Taking our variable `word_freq`

as an example, it is important to point out there are many ways to define ‘frequency’. Some frequency measures are more appropriate than others given the statistical approach we intend to apply to our data. Our current dataset contains raw frequency scores, that is the frequency is measured in observed counts for each word in each file of our data. We could, for example, instead bin our frequency scores under the labels “high” and “low” frequency converting frequency from counts to labels. In this case we change the **informational value** of `word_freq`

. Some variables in our dataset, on the other hand, cannot be converted. Take for example, `genre`

. The values for `genre`

label the genre of the file from which the word was observed. We could of course summarize the genres under meta-genres, but we maintain labeled data; the same informational value as before.

Understanding the informational value of variables in key to organizing and preparing your data for analysis as it has implications for what insight we can gain from the data and what visualization techniques and statistical measures we can use to interrogate the data. There are four potential informational values for all data: nominal, ordinal, interval, and ratio.

*Nominal variables*contain attributes which are labels denoting the membership in a class in which there is no relationship between the labels. Examples of nominal data include part-of-speech labels, the sex of a participant, the genre of a text, etc.*Ordinal variables*also contain labels of classes, but in contrast to nominal variables, there is a relationship between the classes, namely one in which there is a precedence relationship or rank. Our frequency conversion from scores to high- and low-frequency bins is a type of ordinal data –there is an explicit ordering of these two categories. Grouping participants in a study as “young”, “middle-aged”, and “old” would also be ordinal values; again, each value can be interpreted in relationship to the other values.*Interval variables*are like ordinal variables in which there is an explicit precedence relationship, but in addition the values describe precise intervals between each value. So take our earlier operationalization of age as “young”, “middle-aged”, and “old”. As an ordinal variable no assumption is made that the differences in age between young and middle-aged are the same as between middle-aged and old –only that one class is ordered before or after another. If our criterion to code our values of age, however, were based regular intervals between age groups, not some non-regular assignment, then our values of age would be interval-valued.*Ratio variables*have all the properties of interval variables but also include a non-arbitrary definition of zero. Frequency counts are ratio variables as it is clear that there is a potential value for 0 and any value greater can be interpreted in reference to this anchor. A word with a frequency of 100 is two times as large as a word with frequency 50. By the same token, a participant that is 20 years old is half the age of a 40 year old participant.

These informational types are often described in macro terms grouping nominal and ordinal variables as **categorical variables** and interval and ratio variables as **continous variables**. All continuous variables can be converted to categorical variables, but the reverse is not true. In most cases it is preferred to cast your data as continuous, if the nature of the variable permits it, as the recasting of continuous data to categorical data results in a loss of information –which will result in a loss of statistical power and may lead to results that obscure meaningful patterns in the data (Baayen 2004).

### Dependent and independent variables

The last step before we move to visualization and statistical tests is to identify our **dependent variable** and/ or **independent variables**. A dependent variable is the outcome variable that is used in inference and prediction analyses that reflects the observations of the behavior we want to gain understanding about. The identification of a dependent variable should be guided by your research question; it is the measure of the phenomenon in question. An independent variable is a predictor variable, or a variable which we assume will be related to the values of the dependent variable in some systematic way. There is typically only one dependent variable in an analysis, but there can be multiple independent variables. In an exploratory analysis, however, all the variables are independent variables as this approach assumes no particular relationship between the variables; the goal in this approach, remember, is to uncover patterns that may suggest a relationship between particular set of variables.

## Data analysis

The primary goal of a data analysis is to reduce the observed data to a human-interpretable summary that best approximates the nature of the phenomenon we are investigating. With well-sampled data in a tidy dataset in hand where observations and variables are explicitly related, identified for their informational value, and the dependent and/or independent variables are clear, we can now proceed to visualizing and applying the appropriate statistical tests to the data to come to some more concrete, actionable insight.

### Visualization

It is always key to gain insight into the behavior of the data visually before jumping in to the statistical analysis. Using our research aim as our guide, we will choose the most appropriate visualization to use given the number and informational value of our target variables. To get a sense of how this looks, let’s work with an example dataset and pose different questions to the data with an eye towards seeing how various combinations of variables are visualized.

The dataset we will use here is from the TalkBank repository which provides data from various language learning contexts.^{1} The specific data we will use is the ‘narratives’ section of the BELC (Barcelona English Language Corpus) (Muñoz 2006). It is a corpus of writing samples from second language learners of English at different ages. Participants were given the task of writing for 15 minutes on the topic of “Me: my past, present and future”. Data was collected for many (but not all) participants up to four times over the course of seven years. The entire dataset includes 123 observations from 54 participants. Below I’ve included the first 10 observations from the dataset which reflects some data cleaning I’ve done so we start with a tidy dataset.

participant_id | sex | learner_group | age | tokens |
---|---|---|---|---|

01 | female | 1 | 12 | 120 |

01 | female | 2 | 14 | 78 |

02 | female | 1 | 10 | 11 |

02 | female | 2 | 12 | 43 |

02 | female | 3 | 16 | 80 |

02 | female | 4 | 17 | 26 |

03 | male | 1 | 10 | 16 |

03 | male | 2 | 12 | 28 |

04 | male | 1 | 10 | 32 |

04 | male | 2 | 12 | 73 |

The variables `participant_id`

, `sex`

, and `age`

should be self-explanatory. `learner_group`

contains the values 1-4 which record the stage for each participant formally learning English. The number of words written in each sample is listed for each participant at each stage in the variable `tokens`

. We should also note the informational value of these variables. `participant_id`

, `sex`

, and `learner_group`

are categorical variables; both `participant_id`

and `sex`

are nominal and `learner_group`

is ordinal. `age`

and `tokens`

are continuous variables; both of the ratio type as they are scaled in relation to a non-arbitrary value for zero.

With general understanding of the data, let’s run through various data analysis scenarios and their corresponding visualizations grouping them by the information value of the dependent variable.

**Categorical dependent variable**

- No independent variable

Starting basic, let’s say we are interested in investigating the difference in the number of `males`

and `females`

in our study. This is not a particularly interesting question, but it allow us to illustrate a scenario in which we have a single dependent variable, `sex`

, which is categorical. When summarizing categorical data we produce counts of each of the levels of that variable. We can visualize this summary in one of two ways, textually and graphically. A text summary would look like this:

```
sex
female male
67 56
```

A graphic display does not necessarily facilitate a better understanding, in such a simple case, but let’s graphically visualize this scenario anyway. The type of plot we want to use is a ‘bar plot’, which simply plots the dependent variable on the x-axis and the counts on the y-axis.

Inspecting these visualizations it is clear that there is a numeric difference between the number of writing samples in the data written by women. At this point, however, we only have a trend. To decide whether this is a reliable contrast is the purpose of our statistical tests, but we’ll leave the details of statistical testing for this scenario, and those that follow, for subsequent posts.

- One categorical independent variable

A more common scenario is one in which we have a categorical dependent variable and a categorical independent variable. With our data we can investigate the relationship between `sex`

and the `learner_group`

. Are there more males than females in a particular learner group? In this case both variables are categorical and the dimensions are such that we can textually represent them and gain some insight.

```
learner_group
sex 1 2 3 4
female 20 24 13 10
male 15 23 13 5
```

It’s more difficult to see the pattern here than in the basic single dependent variable scenario for two reasons: 1) as the number of independent variables and/or the levels within an independent variable increase, our ability to interpret the results decreases. 2) the relationship between `sex`

and `learner_group`

does not take into account that there are more female samples than males, and therefore the raw counts here can be misleading.

A graphic representation of this contrast will be a bit easier to interpret; although it is important to be aware that more variables and levels always leads to interpretability problems. The bar plot below reflects the raw counts from the cross-tabulation of the variables `sex`

and `learner_group`

.

Adjusting the bar plot to account for the proportions of males to females in each group provides a clearer picture of the relationship between `sex`

and `learner_group`

.

From this visualization it appears that there are more females in the first and last learner groups.

- Two categorical independent variables

Let’s look at a more complex case in which we have two categorical independent variables. Now the dataset, as is, does not have a third categorical variable for us to explore but we can recast the continuous `tokens`

variable as a categorical variable if we bin the scores into groups. I’ve binned `tokens`

into three score groups with equal ranges in a new variable called `token_bins`

.

participant_id | sex | learner_group | age | tokens | token_bins |
---|---|---|---|---|---|

01 | female | 1 | 12 | 120 | mid |

01 | female | 2 | 14 | 78 | mid |

02 | female | 1 | 10 | 11 | low |

02 | female | 2 | 12 | 43 | low |

02 | female | 3 | 16 | 80 | mid |

02 | female | 4 | 17 | 26 | low |

03 | male | 1 | 10 | 16 | low |

03 | male | 2 | 12 | 28 | low |

04 | male | 1 | 10 | 32 | low |

04 | male | 2 | 12 | 73 | mid |

Adding a second categorical independent variable ups the complexity of our analysis and as a result our visualization strategy will change. As text our data will include individual two-way cross-tabulations for each of the levels for the third variable. In this case it is often best to use the variable with the fewest levels as the third variable.

```
, , sex = female
token_bins
learner_group low mid high
1 18 2 0
2 18 6 0
3 9 4 0
4 7 2 1
, , sex = male
token_bins
learner_group low mid high
1 13 2 0
2 15 7 1
3 8 5 0
4 2 3 0
```

To graphically visualize three categorical variables we turn to a mosaic plot.

From these visualizations we can see there is a general trend for the tokens from writing samples to increase in higher learner groups. There are some apparent divergent scores from this trend to be cautious of, however.

**Continuous dependent variable**

- No independent variable

Working with a single continuous dependent variable means that the only practical way to summarize the data is graphically –as textual visualization will be very verbose and by and large uninterpretable. Plotting a single continuous variable often takes the form of a histogram which summarizes the frequency of the values of the dependent variable. So from our dataset, we may want to know what the distribution of `token`

scores looks like. That is, are they normally distributed (‘bell-shaped’), or skewed to the left or right (more values in the low or high range), or some other type of distribution (i.e. bi-modal, etc.)?

The plot on the left is a standard histogram and the plot on the right is the same histogram with a density line added to highlight the distribution. From these plots we see that token counts are slightly left skewed. The longer tail to the right for higher token scores shows some evidence of outliers –that is, scores that are uncharacteristic of the general data distribution. For many analyses plotting a histogram is a key first step to identifying they type of statistical test to use on the data as certain test have assumptions about how the data should be distributed for their results to be reliable. For example, a class of tests called *parametric* assume that continuous data is normally distributed (*non-parametric* tests do not make this assumption). Having plotted the data we can see that it is probably not normally distributed. All is not lost, however. There are methods for *transforming* the data that can often times take mildly skewed data and coerce it into a normal distribution.

- One categorical independent variable

Another common data analysis scenario is one in which we have a continuous dependent variable and one categorical independent variable. Say we wanted to know the average number of `tokens`

used by men and by women. We would use the `tokens`

variable as our dependent variable and and `sex`

as the independent variable. We can visualize these means textually, as we are calculating the mean for each level of `sex`

,

```
female male
57.14925 60.80357
```

or graphically with a box plot, in which we get a host of information about the distribution.

If you take a look at the results in the box plot you will see that the medians (the **bold horizontal lines**) are not that different. Note that the mean and the median measure different things. And the mean is more sensitive to outliers –and the plot shows that there are some outliers in the male and female data. When we statistically analyze the data these types of outliers contribute to unexplained variation, or ‘noise’ at it is often called. Noise has the effect of reducing our confidence that the differences between populations are real, and not likely due to chance. We see much more on this later in the series.

- One continuous independent variable

The behavior of two continuous variables, one dependent and the other independent is represented using a scatter plot. The value for each variable for each observation is plotted as a coordinate pair. From this mapping we evaluate the relationship, or correlation, between the variables. Sticking with `tokens`

as our measure, let’s explore the extent to which `age`

conditions the number of tokens used by a participant.

From a visual inspection it appears that there is a slight effect for `age`

on `tokens`

, namely that the older the participant the more tokens they tend to use.^{2} It is often helpful, and/ or necessary to add a trend line to the plot to help see the relationship more clearly. Note that the ribbon (in grey) surrounding the trend line is the ‘standard error’, or SE. The SE is a confidence interval suggesting that the trend line could have been drawn within any part of this space. You will note that a larger ribbon width corresponds with more variability. This is clearly the case for the tokens used by participants at age 14.

Another note on the trend line. The trend line drawn here is a non-linear, which is why you see the line is bendy. Often times when we are doing hypothesis testing we will be making the assumption that the relationship between two or more variables is linear, not non-linear. We can add a linear trend line to get a better understanding of the linear relationship between our variables.

Interpreting the linear representation we see there is an apparent trend for increasing values for `tokens`

as `age`

increases.

Say our aim was to understand the relationship between `age`

and `tokens`

as a potential function of `sex`

. We can incorporate `age`

as a categorical variable. The result provides us a scatter plot with two trend lines, one for each level of `sex`

.

From this visualization we see there is an apparent difference between males and females. Namely, males appear to increase their token production more than females of the same age. The SE ribbons here, however, are telling. Since they overlap we should be very cautious in interpreting the difference the trend line shows. Overlapping SE ribbons suggest the trend line could have been drawn in this space and therefore there is a likely probability that the visual difference will not result in a statistical difference. Again, this is another example of why visualization is such an integral first step in data analysis.

### Statistical tests

In the previous section various visualization strategies were illustrated through the lens of typical data analysis scenarios. To gain confidence that the trends in the data that we observe are reliable we submit the data to statistical tests. There are numerous tests available, too many to discuss at this point. But it is important to understand that much like our visualization choices, the test we choose depends on the number of variables we are investigating and the information values of these variables. However, particular statistical tests also potentially require a number of test specific assumptions (such as whether the data is normally distributed, for example). We will cover these on a case by case basis later on in the series.

## Round up

In this post we covered some foundational topics for any data analysis project. Guiding the entire process is a clear research question. From there we can proceed to acquire data that is relevant and reliable on the phenomenon we hope to understand better, choose a statistical approach which matches our analysis goals, and organize our data into a format conducive for achieving these goals. Only at this point can we confidently move to analyzing our data, first beginning with appropriate visualization techniques to get a feel for the trends and then moving to performing the relevant statistical tests to provide confidence that the trends captured in our visualizations are in fact reliable.

There is still much left to discuss, in particular what statistical tests to apply to a given data analysis scenario and the assumptions behind statistical tests. We will address these topics later in the series when we have established a stronger and practical understanding of the preceding project steps and increase our proficiency programming in R. The next steps in this journey will be to learn about data types and sources and the specifics of acquiring data for language research.

## References

Baayen, R. Harald. 2004. “Statistics in Psycholinguistics: A critique of some current gold standards.” *Mental Lexicon Working Papers* 1 (1):1–47.

Gries, ST. 2013. *Statistics for Linguistics with R. A Practical Introduction*. 2nd revise.

Johnson, K. 2008. *Quantitative methods in linguistics*. Blackwell Pub.

Muñoz, Carme, ed. 2006. *Age and the rate of foreign language learning*. Clevedon: Multilingual Matters.

Wickham, Hadley, and Garrett Grolemund. 2017. *R for Data Science*. First edit. O’Reilly Media. http://r4ds.had.co.nz/.

**leave a comment**for the author, please follow the link and comment on their blog:

**R on francojc ⟲**.

R-bloggers.com offers

**daily e-mail updates**about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.