Data for language research -types and sources

[This article was first published on R on francojc ⟲, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

In this Recipe you will learn about the types of data available for language research and where to find data. The goal, then, is to introduce you to the landscape of language data available and provide a general overview of the characteristics of language data from a variety of sources providing you with resources to begin your own quantitative investigations.

Data for language research

Language research can include data from a variety of sources, linguistic and non-linguistic, that record observations about the world. A typical type of data used in quantitative language research is a corpus. In short, a corpus is set of machine-readable texts that have been compiled with an eye towards linguistic research. All corpora are not created equal, however, in content and/or format. A corpus may aim to represent a wide swath of language behavior or very specific aspects. It can be language specific (English or French), target a particular modality (spoken or written), or approximate domains of language use (medicine, business, etc.). A corpus that aims to represent a language (including modalities, registers, and sub-domains), for example, is known as a generalized corpus. Corpora that aim to capture a snapshot of a particular modality or sub-domain of language use are known as specialized corpora. Each corpus will have an underlying target population and the sampling process will reflect the authors’ best attempt (given the conditions at the point the corpus was compiled) at representing the stated target population. Whether a corpus is generalized or specialized can become difficult to nail down between the extremes. As such, it is key to be clear about the scope of a particular corpus to be able to ascertain its potential applications and gauge the extent to which these applications entail the research goals of your particular project.

A corpus will often include various types of non-linguistic attributes, or meta-data, as well. Ideally this will include information regarding the source(s) of the data, dates when it was acquired or published, and other author or speaker information. It may also include any number of other attributes that were identified as potentially important in order to appropriately document the target population. Again, it is key to match the available meta-data with the goals of your research. In some cases a corpus may be ideal in some aspects but not contain all the key information to address your research question. This may mean you will need to compile your own corpus if there are fundamental attributes missing. Before you consider compiling your own corpus, however, it is worth investigating the possibility of augmenting an available corpus to bring it inline with your particular goals. This may include adding new language sources, harnessing software for linguistic annotation (part-of-speech, syntactic structure, named entities, etc.), or linking available corpus meta-data to other resources, linguistic or non-linguistic.

Corpora come in various formats, the main three being: running text, structured documents, and databases. The format of a corpus is often influenced by characteristics of the data but may also reflect an author’s individual preferences as well. It is typical for corpora with few meta-data characteristics to take the form of running text. In corpora with more meta-data a header may be appended to the top of each running text document or the meta-data may be contained in a separate file with appropriate coding to coordinate meta-data attributes with each text in the corpus. When meta-data increases in complexity it is common to structure each corpus document more explicitly with a markup language such as XML (Extensible Markup Language) or organize relationships between language and meta-data attributes in a database. Although there has been a push towards standardization of corpus formats, most available resources display some degree of idiosyncrasy. Being able to parse the structure of a corpus is a skill that will develop with time. With more experience working with corpora you will become more adept at identifying how the data is stored and whether its content and format will serve the needs of your analysis.

Sources of language data

The most common source of data used in contemporary quantitative research is the internet. On the web an investigator can access corpora published for research purposes and language used in natural settings that can be coerced by the investigator into a corpus. Many organizations exist around the globe that provide access to corpora in browsable catalogs, or repositories. There are repositories dedicated to language research, in general, such as the Language Data Consortium or for specific language domains, such as the language acquisition repository TalkBank. It is always advisable to start looking for the available language data in a repository. The advantage of beginning your data search in repositories is that a repository, especially those geared towards the linguistic community, will make identifying language corpora faster than through a general web search. Furthermore, repositories often require certain standards for corpus format and documentation for publication. A standardized resource many times will be easier to interpret and evaluate for its appropriateness for a particular research project.

In the table below I’ve compiled a list of some corpus repositories to help you get started.

Table 1: A list of some language corpora repositories.
Resource Description
BYU corpora A repository of corpora that includes billions of words of data.
COW (COrpora from the Web) A collection of linguistically processed gigatoken web corpora
Leipzig Corpora Collection Corpora in different languages using the same format and comparable sources.
Linguistic Data Consortium Repository of language corpora
LRE Map Repository of language resources collected during the submission process for the Language Resource and Evaluation Conference (LREC).
NLTK language data Repository of corpora and language datasets included with the Python package NLTK.
OPUS – an open source parallel corpus Repository of translated texts from the web.
TalkBank Repository of language collections dealing with conversation, acquisition, multilingualism, and clinical contexts.
The Language Archive Various corpora and language datasets
The Oxford Text Archive (OTA) A collection of thousands of texts in more than 25 different languages.

Repositories are by no means the only source of corpora on the web. Researchers from around the world provide access to corpora and other data sources on their own sites or through data sharing platforms. Corpora of various sizes and scopes will often be accessible on a dedicated homepage or appear on the homepage of a sponsoring institution. Finding these resources is a matter of doing a web search with the word ‘corpus’ and a list of desired attributes, including language, modality, register, etc. As part of a general movement towards reproducible more corpora are available on the web than ever before. Therefore data sharing platforms supporting reproducible research, such as GitHub, Zenodo, Re3data, etc., are a good place to look as well, if searching repositories and targeted web searches do not yield results.

In the table below you will find a list of corpus resources and datasets.

Table 2: Corpora and language datasets.
Resource Description
CHILDES Treebank
The Switchboard Dialog Act Corpus A corpus of 1155 5-minute conversations in American English, comprising 205,000 utterances and 1.4 million words, from the Switchboard corpus of telephone conversations.
Google Ngram Viewer Google web corpus
Enron Email Dataset Enron email data from about 150 users, mostly senior management.
Corpus of Spanish in Southern Arizona Spanish varieties spoken in Arizona.
OpenSubtitles2011 A collection of documents from http://www.opensubtitles.org/.
Europarl Parallel Corpus A parallel corpus based on the proceedings of the European Parliament
Corpus Argentino Corpus of Argentine Spanish
Russian National Corpus A corpus of modern Russian language incorporating over 300 million words.
Cornell Movie-Dialogs Corpus A corpus containing a large metadata-rich collection of fictional conversations extracted from raw movie scripts.

It is important to note that there can be access and use restrictions for data from particular sources. Compiling, hosting, and maintaining corpus resources can be costly. To gain full access to data, some repositories and homepages of larger corpora require a fee to offset these costs. In other cases, resources may require individual license agreements to ensure that the data is not being used in ways it was not intended or to ensure potentially sensitive participant information will be treated appropriately. You can take a look at a license agreement for the BYU Corpora as an example. If you are a member of an academic institution and aim to conduct research for scholarly purposes licensing is often easily obtained. Fees, on the other hand, may present a more challenging obstacle. If you are an affiliate of an academic institution it is worth checking with your library to see if there are funds for acquiring licensing for you as an individual, a research group or lab or, for the institution.

If your corpus search ends in a dead-end, either because a suitable resource does not appear to exist or an existing resource is unattainable given licensing restrictions or fees, it may be time to compile your own corpus. Turning to machine readable texts on the internet is usually the logical first step to access language for a new corpus. Language texts may be found on sites as uploaded files, such as pdf or doc (Word) documents, or found displayed as the primary text of a site. Given the wide variety of documents uploaded and language behavior recorded daily on social media, news sites, blogs and the like, compiling a corpus has never been easier. Having said that, how the data is structured and how much data needs to be retrieved can pose practical obstacles to collecting data from the web, particularly if the approach is to acquire the data by hand instead of automating the task. Our approach here, however, will be to automate the process as much as possible whether that means leveraging R package interfaces to language data, converting hundreds of pdf documents to plain text, or scraping content from web documents.

The table below lists some R packages that serve to interface language data directly through R.

Table 3: R Package interfaces to language corpora and datasets.
Resource Description
crminer R package interface focusing on getting the user full text via the Crossref search API.
aRxiv R package interface to query arXiv, a repository of electronic preprints for computer science, mathematics, physics, quantitative biology, quantitative finance, and statistics.
internetarchive R package interface to query the Internet Archive.
dvn R package interface to access to the Dataverse Network APIs.
gutenbergr R package interface to download and process public domain works from the Project Gutenberg collection.
fulltext R package interface to query open access journals, such as PLOS.
newsflash R package interface to query the Internet Archive and GDELT Television Explorer
oai R package interface to query any OAI-PMH repository, including Zenodo.
rfigshare R package interface to query the data sharing platform FigShare.

Data for language research is not limited to (primary) text sources. Other sources may include processed data from previous research; word lists, linguistic features, etc.. Alone or in combination with text sources this data can be a rich and viable source of data for a research project.

Below I’ve included some processed language resources.

Table 4: Language data from previous research and meta-studies.
Resource Description
lingtypology R package interface to connect with the Glottolog database and provides additional functionality for linguistic mapping.
English Lexicon Project Access to a large set of lexical characteristics, along with behavioral data from visual lexical decision and naming studies.
The Moby lexicon project Language wordlists and resources from the Moby project.

The list of data available for language research is constantly growing. I’ve document very few of the wide variety of resources. Below I’ve included attempts by others to provide a summary of the corpus data and language resources available.

Table 5: Lists of corpus resources.
Resource Description
Learner corpora around the world A listing of learner corpora around the world
Where can you find language data on the web? Listing of various corpora and language datasets.
Stanford NLP corpora Listing of corpora and language resources aimed at the NLP community.

Round up

In this post we have covered some of the basics on data for language research including types of data and sources to get you started on the path of identifying a viable source for your data analysis project. In the next post we will begin working directly with R code to access and acquire data through R. Along the way I will introduce fundamental programming concepts of the language you will use throughout your project.

References

McEnery, Tony, and Andrew Hardie. 2012. Corpus Linguistics: Method, Theory and Practice. Cambridge University Press.

To leave a comment for the author, please follow the link and comment on their blog: R on francojc ⟲.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)