Databases for text analysis: archive and access texts using SQL

November 7, 2013
By

(This article was first published on Quantifying Memory, and kindly contributed to R-bloggers)

This post is a collection of scripts I've found useful for integrating a SQL database into more complex applications. SQL allows quickish access to largish repositories of text (I wrote about this at some length here), and are a good starting point for taking textual analysis beyond thousands of texts.



I timed Python to be thirteen times quicker than R when reading and performing basic operations on text. Consequently I decided to create a database in SQL which I accessed through a Python script. Finally I put a wrapper around all of this so I can study metadata about the texts in R. This process, then, shields R from having to see large quantities of text, and keeps both me and my computer relatively sane

All the code is available in an unseemly github dump here

We start with a folder of text files. If you need some, try these, 'borrowed' from the R Text Mining package.

First the setup: we need to be able to read the files. We do that with a handy function, which conveniently smooths over many text-encoding woes. Also we load in the required libraries (sqlite3 loads SQL, codecs allows us to work with UTF-8 characters, while os allows us to browse the operating system's file tree)



The next step is to initialise the database. This is quite straightforward, and described in wonderful detail here
The code does what it says on the box: the first line establishes a connection, which we name 'conn', and the second a cursor - which allows us to access individual records within the database. The bigger block of code is a SQL statement wrapped in a try/except phrase: this attempts to create a table if it doesn't already exist.



The syntax of the sql statement is not complicated: the first line creates a table called texts. Within this table we create three columns - id, file_name, and content. After specifying the name of the column, 'integer' and 'text' specify column type. Finally we specify that this information must not be left empty.

Now we load the files in to Python, and save them to the database



This script iterates over all the files in the folder named in the first line. It then takes each file_name, attempts to read in the corresponding file, and creates a SQL entry made up of three pieces of data:
1 - the ID, here taken from the counter.
2 - the file name, taken from the list of names
3 - the text we read in.

All this gets saved as a tuple (the name for Python's most basic data type) called entry. This gets passed to the database by using the connection we opened previously, and the SQL statement:
INSERT OF IGNORE INTO [table name] VALUES([three place holders]), [our data]

It is worth familiarising oneself with the different types of SQL operation. Here we are telling the database to add the entry unless it already exists (insert or ignore). We could also just use 'insert' - which would throw an error in the case of duplicates. Similarly, we might want to overwrite previous entries using 'replace'.
Finally we run conn.commit() which actually executes the command and writes the entry to the database.

And that is it. All done, the data is now archived away.

The eagle eyed will have noted that this is a pretty inefficient method - we are entering texts into the database individually. We can use a list of entries instead; this gives a performance boost:



Here we do exactly the same thing, except we add each entry to a list called 'holder'. Then we commit all the entries in one go using 'executemany'.

Now, we may have many files to archive, so let's not assume they would all fit into memory; instead we will set the code to archive every 100 entries, and then reset the holder. To ensure that any residual texts are also entered, we keep the final executemany command:



Using this method we can archive thousands of texts relatively efficiently, and without needing to worry about the data fitting in memory.



So, we filed away the data. How do we get it out again?

To get the first entry in the database, we run this code:


The SELECT statement allows us to retrieve one or more fields - here I am retrieving the whole entry of id, file_name, and texts. 'From' specifies that we want to fetch this data from the table named 'texts', and the fetchone() method selects the first entry in the database. Using this code we might cycle through the database's content to perform some calculation. Python receives the entry as a list; the statement above immediately unpacks it into its constitutent parts.

To get a specific entry we add a condition to the SQL statement using 'WHERE'

This allows us to name a target file_name and find the corresponding entry

Similarly, we can use the fetchall() method to get all the entries in the database (or those matching a particular condition, such as id range, date of publication, author, etc.):



WHERE statements can be used to access manageable chunks of data, but in some situations we might want to see all the texts in the archive, even though the archive doesn't fit into memory. Instead, let's use a generator to yield a batch of files.
This means we load x (here:100) number of files, do something to these, load the next 100 files, etc. Consequently, we never have more than 100 files in memory at a given time:



And that's that - now you have the necessary tools to archive a folder of files, and to retrieve them later. Next up I'll show how we can wrap all of this in a function, allowing retrieval of data from within R.

To leave a comment for the author, please follow the link and comment on his blog: Quantifying Memory.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...



If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Comments are closed.