Data Wrangling for Text mining: Extract individual elements from a Book

[This article was first published on Posts | SERDAR KORUR, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

My ambitious goal is to write a machine learning algorithm that predicts authors. But let’s start with something simpler. An important part in a Data Science workflow is data preparation. Clean it, reformat it and make it usable for further analysis.

Photo by Patrick Tomasso on Unsplash

Figure 1: Photo by Patrick Tomasso on Unsplash

I will work on a Poetry book called “New Poems” from D. H. Lawrence. You can download it from Project Gutenberg website which is a library of over 60,000 free eBooks.

The goal is to isolate each poem individually for text mining analysis.

Let’s figure out a solution.

I will use the table of contents section to fish out each poem separately by using two for loops.

Install required packages

library(dplyr)
library(stringr)
library(stringi)

Copy the book from DH Lawrence “New Poems”

At gutenberg website there are couple of slightly different formats of the book. Since there were some mistakes in the .txt file I used the html version here. I copied the text and pasted it in a text editor and saved to my working directory.

lawrence <- readLines("posts_data/lawrence_new_poems.txt")

Our file contains 2181 lines. With square brackets [ ] we can view the lines we want. Let’s look at the first few lines;

lawrence[1:5]
## [1] ""                                                                    
## [2] "The Project Gutenberg EBook of New Poems, by D. H. Lawrence"         
## [3] ""                                                                    
## [4] "This eBook is for the use of anyone anywhere at no cost and with"    
## [5] "almost no restrictions whatsoever.  You may copy it, give it away or"
Photo by Thiebaud Faix on Unsplash

Figure 2: Photo by Thiebaud Faix on Unsplash

The book has 42 poems in total. Table of contents (TOC) starts with the line “CONTENTS” and ends with the line “ON THAT DAY”.

I will use those lines to extract the TOC. Stringr package comes in handy here. str_which() function returns line index numbers for a given term.

start <- str_which(lawrence, pattern = fixed("CONTENTS"))
start
## [1] 53
lawrence[start]
## [1] "CONTENTS"
# We are choosing first appearance of "ON THAT DAY" with [1] because it appears 
# also in the Poem title later.
end <-  str_which(lawrence, pattern = fixed("ON THAT DAY"))[1]
end
## [1] 137
lawrence[end]
## [1] "ON THAT DAY"

Slicing the lines from 54 to 137 will give us the TOC.

TOC <- lawrence[(start+1):(end)]

To remove empty spaces I will use here stri_remove_empty() function from stringi package.

TOC <- stri_remove_empty(TOC)

Let’s look at how the clean TOC looks.

TOC 
##  [1] "APPREHENSION"               "COMING AWAKE"              
##  [3] "FROM A COLLEGE WINDOW"      "FLAPPER"                   
##  [5] "BIRDCAGE WALK"              "LETTER FROM TOWN: THE"     
##  [7] "FLAT SUBURBS, S.W., IN THE" "THIEF IN THE NIGHT"        
##  [9] "LETTER FROM TOWN: ON A"     "SUBURBS ON A HAZY DAY"     
## [11] "HYDE PARK AT NIGHT, BEFORE" "GIPSY"                     
## [13] "TWO-FOLD"                   "UNDER THE OAK"             
## [15] "SIGH NO MORE"               "LOVE STORM"                
## [17] "PARLIAMENT HILL IN THE"     "PICCADILLY CIRCUS AT NIGHT"
## [19] "TARANTELLA"                 "IN CHURCH"                 
## [21] "PIANO"                      "EMBANKMENT AT NIGHT,"      
## [23] "PHANTASMAGORIA"             "NEXT MORNING"              
## [25] "PALIMPSEST OF TWILIGHT"     "EMBANKMENT AT NIGHT,"      
## [27] "WINTER IN THE BOULEVARD"    "SCHOOL ON THE OUTSKIRTS"   
## [29] "SICKNESS"                   "EVERLASTING FLOWERS"       
## [31] "THE NORTH COUNTRY"          "BITTERNESS OF DEATH"       
## [33] "SEVEN SEALS"                "READING A LETTER"          
## [35] "TWENTY YEARS AGO"           "INTIME"                    
## [37] "TWO WIVES"                  "HEIMWEH"                   
## [39] "DEBACLE"                    "NARCISSUS"                 
## [41] "AUTUMN SUNSHINE"            "ON THAT DAY"

Next, we will extract main text containing only the poems without TOC and other metadata. We need to slice the document starting from the end of the contents (end) till end of the last poem.

# After the last poem some metadata starts with "End of the Project..."
# We will slice until this line
end_main <- str_which(lawrence, "End of the Project Gutenberg EBook of New Poems, by D. H. Lawrence")
# Capture main text
lawrence_body <- lawrence[(end+1):(end_main -1)]

Now, we have TOC and main body of the book as two separate objects.

First for loop

We will use TOC and a for loop to get the index numbers of the title’s of each poem.

# First initiate an empty list
index <- list()
# For loop
for (i in 1:42) {
index[[i]] <- str_which(lawrence_body, pattern = TOC[i])
}

index<- unlist(index)
index
##  [1]    9   37   59   82  110  126  164  192  209  253  276  314  332  347
## [15]  387  428  473  496  536  570  593  621  768  664  707  745  621  768
## [29]  901  933  958  990 1057 1100 1193 1253 1286 1313 1376 1502 1527 1571
## [43] 1606 1644

The for loop we created here uses each title in TOC as a pattern inside a str_which() function to find the index number where it detects this pattern.

For example TOC[1] will use the title of first poem as a pattern and it will return the line number where the poem starts. At the end, we will have a list of starting lines of each poem.

TOC[1]
## [1] "APPREHENSION"
str_which(lawrence_body, pattern = TOC[1])
## [1] 9
# e.g. The poem Apprehension starts at line index number 9

Selecting the lines from the beginning of the first poem until the beginning of the second poem will give us the first poem. By iterating everything by +1 we will capture all 42 poems.

Since the title EMBANKMENT AT NIGHT appears in the titles of two poems we will do a slight correction here. To correct this, I will remove first appearance of index 768 and second appearance of 621.

index <- index[-c(23,27)]
index
##  [1]    9   37   59   82  110  126  164  192  209  253  276  314  332  347
## [15]  387  428  473  496  536  570  593  621  664  707  745  768  901  933
## [29]  958  990 1057 1100 1193 1253 1286 1313 1376 1502 1527 1571 1606 1644
length(index)
## [1] 42
# Not to miss the last poem, I have to add the line index of the
# end of the main text. We can use the end of the main body as above.
index[43] <- end_main -1

Now, we have 42 index numbers matching the title of each poem 1 index number to label the end of the main text. We will use those to extract poems separately.

Second for loop

It’s time for the trick. Finally we can capture each 42 poem separately in a list by using a second for loop.

# Create an empty list: poems
poems <- list()
for (i in 1:42) {
    
    poems[[i]] <- lawrence_body[(index[i]:index[i+1]-1)]  
}
# Visualize the first poem
writeLines(poems[[1]])
## 
## APPREHENSION
## AND all hours long, the town
##        Roars like a beast in a cave
##      That is wounded there
##      And like to drown;
##        While days rush, wave after wave
##      On its lair.
## 
##      An invisible woe unseals
##        The flood, so it passes beyond
##      All bounds: the great old city
##      Recumbent roars as it feels
##        The foamy paw of the pond
##      Reach from immensity.
## 
##      But all that it can do
##        Now, as the tide rises,
##      Is to listen and hear the grim
##      Waves crash like thunder through
##        The splintered streets, hear noises
##      Roll hollow in the interim.

Let’s check if we got what we wanted.

str(poems)
## List of 42
##  $ : chr [1:29] "" "APPREHENSION" "AND all hours long, the town" "       Roars like a beast in a cave" ...
##  $ : chr [1:23] "" "COMING AWAKE" "WHEN I woke, the lake-lights were quivering on the" "          wall," ...
##  $ : chr [1:24] "" "FROM A COLLEGE WINDOW" "THE glimmer of the limes, sun-heavy, sleeping," "        Goes trembling past me up the College wall." ...
##  $ : chr [1:29] "" "FLAPPER" "LOVE has crept out of her sealéd heart" "       As a field-bee, black and amber," ...
##  $ : chr [1:17] "" "BIRDCAGE WALK" "WHEN the wind blows her veil" "       And uncovers her laughter" ...
##  $ : chr [1:39] "" "LETTER FROM TOWN: THE" "ALMOND TREE" "YOU promised to send me some violets. Did you" ...
##  $ : chr [1:29] "" "FLAT SUBURBS, S.W., IN THE" "MORNING" "THE new red houses spring like plants" ...
##  $ : chr [1:18] "" "THIEF IN THE NIGHT" "LAST night a thief came to me" "       And struck at me with something dark." ...
##  $ : chr [1:45] "" "LETTER FROM TOWN: ON A" "GREY EVENING IN MARCH" "THE clouds are pushing in grey reluctance slowly" ...
##  $ : chr [1:24] "" "SUBURBS ON A HAZY DAY" "     O STIFFLY shapen houses that change not," "       What conjuror's cloth was thrown across you," ...
##  $ : chr [1:39] "" "HYDE PARK AT NIGHT, BEFORE" "THE WAR" "     Clerks." ...
##  $ : chr [1:19] "" "GIPSY" "     I, THE man with the red scarf," "        Will give thee what I have, this last week's earn-" ...
##  $ : chr [1:16] "" "TWO-FOLD" "     How gorgeous that shock of red lilies, and larkspur" "         cleaving" ...
##  $ : chr [1:41] "" "UNDER THE OAK" "     You, if you were sensible," "     When I tell you the stars flash signals, each one" ...
##  $ : chr [1:42] "" "SIGH NO MORE" "THE cuckoo and the coo-dove's ceaseless calling," "                    Calling," ...
##  $ : chr [1:46] "" "LOVE STORM" "MANY roses in the wind" "     Are tapping at the window-sash." ...
##  $ : chr [1:24] "" "PARLIAMENT HILL IN THE" "EVENING" "THE houses fade in a melt of mist" ...
##  $ : chr [1:41] "" "PICCADILLY CIRCUS AT NIGHT" "     Street-Walkers." "WHEN into the night the yellow light is roused like" ...
##  $ : chr [1:35] "" "TARANTELLA" "SAD as he sits on the white sea-stone" "     And the suave sea chuckles, and turns to the moon," ...
##  $ : chr [1:24] "" "IN CHURCH" "IN the choir the boys are singing the hymn." "             The morning light on their lips" ...
##  $ : chr [1:29] "" "PIANO" "     Softly, in the dusk, a woman is singing to me;" "     Taking me back down the vista of years, till I see" ...
##  $ : chr [1:44] "" "EMBANKMENT AT NIGHT," "BEFORE THE WAR" "     Charity." ...
##  $ : chr [1:44] "" "PHANTASMAGORIA" "RIGID sleeps the house in darkness, I alone" "     Like a thing unwarrantable cross the hall" ...
##  $ : chr [1:39] "" "NEXT MORNING" "     How have I wandered here to this vaulted room" "     In the house of life?—the floor was ruffled with gold" ...
##  $ : chr [1:24] "" "PALIMPSEST OF TWILIGHT" "DARKNESS comes out of the earth" "       And swallows dip into the pallor of the west;" ...
##  $ : chr [1:134] "" "EMBANKMENT AT NIGHT," "BEFORE THE WAR" "     Outcasts." ...
##  $ : chr [1:33] "" "WINTER IN THE BOULEVARD" "THE frost has settled down upon the trees" "     And ruthlessly strangled off the fantasies" ...
##  $ : chr [1:26] "" "SCHOOL ON THE OUTSKIRTS" "     How different, in the middle of snows, the great" "          school rises red!" ...
##  $ : chr [1:33] "" "SICKNESS" "WAVING slowly before me, pushed into the dark," "     Unseen my hands explore the silence, drawing the" ...
##  $ : chr [1:68] "" "EVERLASTING FLOWERS" "WHO do you think stands watching" "       The snow-tops shining rosy" ...
##  $ : chr [1:44] "" "THE NORTH COUNTRY" "IN another country, black poplars shake them-" "         selves over a pond," ...
##  $ : chr [1:94] "" "BITTERNESS OF DEATH" "     I" "AH, stern, cold man," ...
##  $ : chr [1:61] "" "SEVEN SEALS" "SINCE this is the last night I keep you home," "     Come, I will consecrate you for the journey." ...
##  $ : chr [1:34] "" "READING A LETTER" "SHE sits on the recreation ground" "       Under an oak whose yellow buds dot the pale" ...
##  $ : chr [1:28] "" "TWENTY YEARS AGO" "ROUND the house were lilacs and strawberries" "       And foal-foots spangling the paths," ...
##  $ : chr [1:64] "" "INTIME" "RETURNING, I find her just the same," "     At just the same old delicate game." ...
##  $ : chr [1:127] "" "TWO WIVES" "     I" "INTO the shadow-white chamber silts the white" ...
##  $ : chr [1:26] "" "HEIMWEH" "FAR-OFF the lily-statues stand white-ranked in the" "         garden at home." ...
##  $ : chr [1:45] "" "DEBACLE" "THE trees in trouble because of autumn," "       And scarlet berries falling from the bush," ...
##  $ : chr [1:36] "" "NARCISSUS" "WHERE the minnows trace" "     A glinting web quick hid in the gloom of the brook," ...
##  $ : chr [1:39] "" "AUTUMN SUNSHINE" "THE sun sets out the autumn crocuses" "       And fills them up a pouring measure" ...
##  $ : chr [1:173] "" "ON THAT DAY" "   ON that day" "     I shall put roses on roses, and cover your grave" ...

Final Thoughts

Data Preparation is a crucial step in Data Science as data comes rarely ready to use.

Here, starting from a Poetry Book I isolated each poem separately in a list. Hard part is done. Now, I can identify how many rhymes each poem contains, word usage across different poems, the similarities between them and many more to gain insights about the author.

I could also analyze the whole book as a single document but by isolating each element I will gain much deeper insight from the data.

Do you apply similar techniques to isolate chapters or sections from the book or documents to compare and contrast different parts?

Thank you for reading this post. Please feel free to comment below with your thoughts/feedback.

To leave a comment for the author, please follow the link and comment on their blog: Posts | SERDAR KORUR.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)