Turning kindle notes into a tidy data

[This article was first published on Clean Code, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

It is my dream to do everything with R. And we aRe almost there. We can write blogs in blogdown or bookdown, write reports in RMarkdown (thank you Yihui Xie!) create interactive webpages with Shiny (thank you Winston Chang). Control our lifx lights with lifxr (great work Carl!) and use emoticons everywhere with the emo package.

There is even a novel of my vision! I recently found chapter 40 of A Dr. Primestein Adventure™ The Day the Priming Stopped. There is a scene in there which says:

“This Fortress is a monumental technological achievement,” explained Professor Power. “Every aspect of the Fortress’s security is run by R.” As they arrived at the metal doors, the Professor pressed a small button on the wall to the right. “This is an elevatoR, run by its own R package.” They waited for the doors to open, but nothing happened. After a few minutes of alternately waiting and then mashing the elevatoR button, Professor Power called someone on his mobile phone. “The elevatoR is not working…what? Why would they do that?…call Hadley Wickham!…doesn’t anyone around here check packages against the development version of R before upgrading?…yes, we’ll wait.” “Someone upgraded R without permission. Should be fixed soon,” Professor Power explained.

But enough about jokeRs and jesteRs. As it is my life long mission to do everything in R and preferably in the tidyverse, I found something that wasn’t tidy ? !!! Kindle notes!

kindle notes and highlights.

I have a 2010 kindle to read E-books on and once in a while I write a note or highlight some text in the book. If you connect your kindle to the computer you can extract the highlights by copying the file `My Clippings.txt’ to your computer.

This is great, it’s a text file which means you can open it on every computer and search throug the contents. However…

It’s not tidy.

Let’s change that. The general procedure is thus:

  1. Create a new project in Rstudio
  2. Create a new folder called data (or don’t but really this is neat isn’t it?)
  3. Copy the My Clippings.txt file to that data-folder
  4. Load the tidyverse `library(tidyverse)’
  5. Hammer away untill the txt file is a data frame.
  6. profit?

What is in this text file?

First we do some exploratory work on the file. I’ve found that the text file is structured in a particular way:

title  (author)
- Highlight on Page 128 | Loc. 1962-68  | Added on Sunday, December 27, 2015, 03:09 PM
<empty line>
highlighted text
==========
title of the next highlighted book (author)
etc.

So how do we force this into a data frame?

Recognize the structure ( we will create functions for that)

  • Chunks end with the ten ===== signs, we can split on that
  • first line is the title and (author)
  • we can seperate the author and the title
  • next line of information is devided by ‘ ’ signs.
  • type, page, location, added date and time (in american time of course…)
  • highlighted text (or if it is a bookmark, nothing)
<span class="n">library</span><span class="p">(</span><span class="n">tidyverse</span><span class="p">)</span><span class="w">
</span>
## Loading tidyverse: ggplot2
## Loading tidyverse: tibble
## Loading tidyverse: tidyr
## Loading tidyverse: readr
## Loading tidyverse: purrr
## Loading tidyverse: dplyr

## Conflicts with tidy packages ----------------------------------------------

## filter(): dplyr, stats
## lag():    dplyr, stats
<span class="n">raw_text</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">read_file</span><span class="p">(</span><span class="s2">"data/My Clippings.txt"</span><span class="p">)</span><span class="w"> </span><span class="c1"># read in the text file
</span><span class="n">per_chunk</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">unlist</span><span class="p">(</span><span class="n">strsplit</span><span class="p">(</span><span class="n">raw_text</span><span class="p">,</span><span class="w"> </span><span class="s2">"=========="</span><span class="p">))</span><span class="w">  </span><span class="c1"># seperate into chunks
</span><span class="n">per_chunk</span><span class="p">[</span><span class="m">4</span><span class="p">]</span><span class="w">
</span>
## [1] "\r\nThe Clean Coder_ A Code of Conduct For Professional Programmers - Robert C. Martin (Robert C. Martin)\r\n- Highlight on Page 90 | Added on Monday, January 25, 2016, 04:06 PM\r\n\r\nK ATA In martial arts, a kata is a precise set of choreographed movements that simulates one side of a combat. The goal, which is asymptotically approached, is perfection. The artist strives to teach his body to make each movement perfectly and to assemble those movements into fluid enactment. Well-executed kata are beautiful to watch. Beautiful though they are, the purpose of learning a kata is not to perform it on stage. The purpose is to train your mind and body how to react in a particular combat situation. The goal is to make the perfected movements automatic and instinctive so that they are there when you need them. A programming kata is a precise set of choreographed keystrokes and mouse movements that simulates the solving of some programming problem. You aren’t actually solving the problem because you already know the solution. Rather, you are practicing the movements and decisions involved in solving the problem.\r\n"

Above I have created seperate chunks that represent seperate highlights. And a example so you can see what I see.

Now for extracting the seperate elements. I create functions that do one thing.

<span class="c1"># This function takes a chunk of character information
# and seperates it into lines. 
</span><span class="n">seperate_into_lines</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="k">function</span><span class="p">(</span><span class="n">chunk</span><span class="p">){</span><span class="w">
    </span><span class="n">result</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">stringr</span><span class="o">::</span><span class="n">str_split</span><span class="p">(</span><span class="n">chunk</span><span class="p">,</span><span class="w"> </span><span class="s2">"\r\n"</span><span class="p">)</span><span class="w">
    </span><span class="n">unlist</span><span class="p">(</span><span class="n">result</span><span class="p">)</span><span class="w">
</span><span class="p">}</span><span class="w">
</span><span class="c1"># result <- seperate_into_lines(per_chunk[100])  # testing if this works 
## you should put this into formal test frameworks such as testhat if you
## build a package. 
</span><span class="w">


</span><span class="c1"># Extract title sentance and remove author
# This function presumes that you already extracted the raw data into
# character chunks.
</span><span class="n">extract_title</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="k">function</span><span class="p">(</span><span class="n">linechunk</span><span class="p">){</span><span class="w">
    </span><span class="c1"># search for second line
</span><span class="w">    </span><span class="n">titleline</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">linechunk</span><span class="p">[</span><span class="m">2</span><span class="p">]</span><span class="w">
    </span><span class="n">return</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">gsub</span><span class="p">(</span><span class="s2">"\\(.*\\)"</span><span class="p">,</span><span class="w"> </span><span class="s2">""</span><span class="p">,</span><span class="w"> </span><span class="n">titleline</span><span class="p">)</span><span class="w"> </span><span class="c1"># it took me some 
</span><span class="w">            </span><span class="c1">#time to work this regular expression out.
</span><span class="w">    </span><span class="n">stringr</span><span class="o">::</span><span class="n">str_trim</span><span class="p">(</span><span class="n">return</span><span class="p">,</span><span class="w"> </span><span class="n">side</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"both"</span><span class="p">)</span><span class="w"> </span><span class="c1"># remove whitespace at ends
</span><span class="p">}</span><span class="w">
</span><span class="c1">#extract_title(result) # testcase to see if it works for me.
</span><span class="w">

</span><span class="c1"># Extract the author from chunk, this function looks 
# very much like the one above, it uses the same logic.
</span><span class="n">extract_author</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="k">function</span><span class="p">(</span><span class="n">linechunk</span><span class="p">){</span><span class="w">
    </span><span class="c1"># search for second line
</span><span class="w">    </span><span class="n">titleline</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">linechunk</span><span class="p">[</span><span class="m">2</span><span class="p">]</span><span class="w"> </span><span class="c1"># identical
</span><span class="w">    </span><span class="n">author</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">stringr</span><span class="o">::</span><span class="n">str_extract</span><span class="p">(</span><span class="n">titleline</span><span class="p">,</span><span class="w"> </span><span class="s2">"\\(.*\\)"</span><span class="p">)</span><span class="w"> </span><span class="c1"># extract piece
</span><span class="w">    </span><span class="n">return</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">gsub</span><span class="p">(</span><span class="s2">"\\(|\\)"</span><span class="p">,</span><span class="w"> </span><span class="s2">""</span><span class="p">,</span><span class="w"> </span><span class="n">author</span><span class="p">)</span><span class="w">  </span><span class="c1"># 
</span><span class="w">    </span><span class="n">stringr</span><span class="o">::</span><span class="n">str_trim</span><span class="p">(</span><span class="n">return</span><span class="p">,</span><span class="w"> </span><span class="n">side</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"both"</span><span class="p">)</span><span class="w">
</span><span class="p">}</span><span class="w">
</span><span class="c1"># extract_author(result)
</span>

Let’s see if this works on a subset of the data. I usually take multiple notes in one book before I open another, so in this case the first 20 notes are really boring and all from the same book. To spice this up I take a random subset of rows. I will use a simple for-loop here, but I will use functional programming in the end-result. It works kind of the same, but is more explicit.

Some people will tell you that for-loops are slow in R, or that ‘loops are bad’ but they don’t know what they are talking about.[1]

I first create a data_frame [2] and pre-populate it.

<span class="c1"># testset <- per_chunk[1:20]  # You would use this if you want the first 20 pieces.
</span><span class="n">set.seed</span><span class="p">(</span><span class="m">4579</span><span class="p">)</span><span class="w">  </span><span class="c1"># if you do random stuff, it is wise to 
# set the seed so that others can reproduce your work.
</span><span class="n">testset</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">per_chunk</span><span class="p">[</span><span class="n">base</span><span class="o">::</span><span class="n">sample</span><span class="p">(</span><span class="n">x</span><span class="w"> </span><span class="o">=</span><span class="m">1</span><span class="o">:</span><span class="nf">length</span><span class="p">(</span><span class="n">per_chunk</span><span class="p">),</span><span class="n">size</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">20</span><span class="p">)]</span><span class="w"> 
</span><span class="c1"># unfortunately dplyr also has a function called sample. to specify that
# we want the 'normal' one I specify the name of the package followed by
# two ':'. 
</span><span class="n">testingframe</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">data_frame</span><span class="p">(</span><span class="w">
    </span><span class="n">author</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">character</span><span class="p">(</span><span class="n">length</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">length</span><span class="p">(</span><span class="n">testset</span><span class="p">)),</span><span class="w">
    </span><span class="n">title</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">character</span><span class="p">(</span><span class="nf">length</span><span class="p">(</span><span class="n">testset</span><span class="p">)))</span><span class="w">
</span><span class="k">for</span><span class="p">(</span><span class="w"> </span><span class="n">i</span><span class="w"> </span><span class="k">in</span><span class="w"> </span><span class="nf">seq_along</span><span class="p">(</span><span class="n">testset</span><span class="p">)){</span><span class="w">
    </span><span class="n">hold</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">testset</span><span class="p">[</span><span class="n">i</span><span class="p">]</span><span class="w"> </span><span class="o">%>%</span><span class="w"> </span><span class="n">seperate_into_lines</span><span class="p">()</span><span class="w">
    </span><span class="n">testingframe</span><span class="o">$</span><span class="n">author</span><span class="p">[</span><span class="n">i</span><span class="p">]</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">hold</span><span class="w"> </span><span class="o">%>%</span><span class="w"> </span><span class="n">extract_author</span><span class="p">()</span><span class="w">
    </span><span class="n">testingframe</span><span class="o">$</span><span class="n">title</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">hold</span><span class="w"> </span><span class="o">%>%</span><span class="w"> </span><span class="n">extract_title</span><span class="p">()</span><span class="w">
</span><span class="p">}</span><span class="w">
</span><span class="n">testingframe</span><span class="w">
</span>
## # A tibble: 20 × 2
##                                     author
##                                      <chr>
## 1             Andrew Hunt and David Thomas
## 2             Andrew Hunt and David Thomas
## 3                            Alex Reinhart
## 4                              David Price
## 5            City Watch #1 Terry Pratchett
## 6            City Watch #1 Terry Pratchett
## 7                              Mark Manson
## 8                     Kim Stanley Robinson
## 9                     Kim Stanley Robinson
## 10                    Kim Stanley Robinson
## 11         Douglas DeCarlo, James P. Lewis
## 12                             David Price
## 13           City Watch #2 Terry Pratchett
## 14 Kenneth Knoblauch & Laurence T. Maloney
## 15                           Alex Reinhart
## 16            Andrew Hunt and David Thomas
## 17                        Robert C. Martin
## 18            Andrew Hunt and David Thomas
## 19            Andrew Hunt and David Thomas
## 20                             David Price
## # ... with 1 more variables: title <chr>

The author and title functions seem to work, let’s extract some more information. The third row contained multiple pieces of information

example:

- Highlight on Page 132 | Loc. 2017-20  | Added on Saturday, August 20, 2016, 09:37 AM

Like the first functions we first select the correct row [3] and than apply some magic.

<span class="c1"># this function extracts all the pieces
# and subsequent functions will deal with the seperate stuff.
</span><span class="n">extract_type_location_date</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="k">function</span><span class="p">(</span><span class="n">linechunk</span><span class="p">){</span><span class="w">
    </span><span class="n">meta_row</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">linechunk</span><span class="p">[</span><span class="m">3</span><span class="p">]</span><span class="w">
    </span><span class="n">pieces</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">stringr</span><span class="o">::</span><span class="n">str_split</span><span class="p">(</span><span class="n">meta_row</span><span class="p">,</span><span class="w"> </span><span class="s2">"\\|"</span><span class="p">)</span><span class="w"> </span><span class="c1"># the literal character, 
</span><span class="w">    </span><span class="c1"># the '|' has a special meaning in regexp.
</span><span class="w">    </span><span class="n">unlist</span><span class="p">(</span><span class="n">pieces</span><span class="p">)</span><span class="w">
</span><span class="p">}</span><span class="w">
</span><span class="c1"># extract_type_location_date(result) # test function
</span><span class="w">
</span><span class="c1"># extract type from combined result.
# Here the use of the pipe `%>%` operator 
# makes the steps clear.
</span><span class="n">extract_type</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="k">function</span><span class="p">(</span><span class="n">pieces</span><span class="p">){</span><span class="w">
    </span><span class="n">pieces</span><span class="p">[</span><span class="m">1</span><span class="p">]</span><span class="w"> </span><span class="o">%>%</span><span class="w">  </span><span class="c1"># select the first row
</span><span class="w">        </span><span class="n">stringr</span><span class="o">::</span><span class="n">str_extract</span><span class="p">(</span><span class="w"> </span><span class="s2">"- [[:alnum:]]{1,} "</span><span class="p">)</span><span class="w"> </span><span class="o">%>%</span><span class="w"> </span><span class="c1"># extract at least one character.
</span><span class="w">        </span><span class="n">gsub</span><span class="p">(</span><span class="s2">"-"</span><span class="p">,</span><span class="w"> </span><span class="s2">""</span><span class="p">,</span><span class="w"> </span><span class="n">.</span><span class="p">)</span><span class="w"> </span><span class="o">%>%</span><span class="w"> </span><span class="c1"># replace - with nothing, removing it
</span><span class="w">        </span><span class="n">stringr</span><span class="o">::</span><span class="n">str_trim</span><span class="p">(</span><span class="w"> </span><span class="n">side</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"both"</span><span class="p">)</span><span class="w"> </span><span class="c1"># remove whitespace at both sides
</span><span class="p">}</span><span class="w">
</span><span class="c1"># extract_type_location_date(result) %>% 
#     extract_type()
</span><span class="w">

</span><span class="c1"># extract page number by selecting first piece,
# trimming off of whitespace
# selecting a number, at least 1 times, followed by end of line.
</span><span class="n">extract_pagenumber</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="k">function</span><span class="p">(</span><span class="n">pieces</span><span class="p">){</span><span class="w">
    </span><span class="n">pieces</span><span class="p">[</span><span class="m">1</span><span class="p">]</span><span class="w"> </span><span class="o">%>%</span><span class="w">
        </span><span class="n">stringr</span><span class="o">::</span><span class="n">str_trim</span><span class="p">(</span><span class="w"> </span><span class="n">side</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"right"</span><span class="p">)</span><span class="w"> </span><span class="o">%>%</span><span class="w"> </span><span class="c1"># remove right end
</span><span class="w">        </span><span class="n">stringr</span><span class="o">::</span><span class="n">str_extract</span><span class="p">(</span><span class="s2">"[0-9]{1,}$"</span><span class="p">)</span><span class="w"> </span><span class="o">%>%</span><span class="w"> 
        </span><span class="nf">as.numeric</span><span class="p">()</span><span class="w">
</span><span class="p">}</span><span class="w">
</span><span class="c1"># extract_type_location_date(result) %>%
#     extract_pagenumber()
</span><span class="w">
</span><span class="c1"># Extract locations. Just like above.
</span><span class="n">extract_locations</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="k">function</span><span class="p">(</span><span class="n">pieces</span><span class="p">){</span><span class="w">
    </span><span class="n">pieces</span><span class="p">[</span><span class="m">2</span><span class="p">]</span><span class="w"> </span><span class="o">%>%</span><span class="w"> 
        </span><span class="n">stringr</span><span class="o">::</span><span class="n">str_trim</span><span class="p">(</span><span class="w"> </span><span class="n">side</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"both"</span><span class="p">)</span><span class="w"> </span><span class="o">%>%</span><span class="w"> 
        </span><span class="n">stringr</span><span class="o">::</span><span class="n">str_extract</span><span class="p">(</span><span class="s2">"[0-9]{1,}-[0-9]{1,}$"</span><span class="p">)</span><span class="w">
</span><span class="p">}</span><span class="w">
</span><span class="c1"># extract_type_location_date(result) %>% 
#     extract_locations()
</span><span class="w">
</span><span class="c1"># Extract date and convert to standard time, not US centric.
# I use the strptime from the base package here. The time is 
# US-centric, but structured, so we can use the formatting from strptime.
# For example: %B is Full month name in the current locale
# and %I:%M %p means hours, minutes, am/pm. 
</span><span class="n">extract_date</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="k">function</span><span class="p">(</span><span class="n">pieces</span><span class="p">){</span><span class="w">
    </span><span class="n">pieces</span><span class="p">[</span><span class="m">3</span><span class="p">]</span><span class="w"> </span><span class="o">%>%</span><span class="w"> 
        </span><span class="n">stringr</span><span class="o">::</span><span class="n">str_trim</span><span class="p">(</span><span class="w"> </span><span class="n">side</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"both"</span><span class="p">)</span><span class="w"> </span><span class="o">%>%</span><span class="w"> 
        </span><span class="n">stringr</span><span class="o">::</span><span class="n">str_extract</span><span class="p">(</span><span class="s2">"[A-z]{3,} [0-9]{1,2}, [0-9]{4}, [0-9]{2}:[0-9]{2} [A-Z]{2}"</span><span class="p">)</span><span class="w"> </span><span class="o">%>%</span><span class="w"> 
        </span><span class="n">strptime</span><span class="p">(</span><span class="n">format</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"%B %e, %Y, %I:%M %p"</span><span class="p">)</span><span class="w"> 
</span><span class="p">}</span><span class="w">

</span><span class="c1"># Extract the highlight part.
</span><span class="n">extract_highlights</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="k">function</span><span class="p">(</span><span class="n">linechunk</span><span class="p">){</span><span class="w">
    </span><span class="n">linechunk</span><span class="p">[</span><span class="m">5</span><span class="p">]</span><span class="w">
</span><span class="p">}</span><span class="w">
</span><span class="c1"># extract_highlights(result)
</span>

In general:

  • Split into chunks (already did that: per_chunk)
  • Create a data frame
  • Apply extractors per chunk into data_frame

  • I would really love it if someone showed me how to do this with purrr
<span class="n">finalframe</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">data_frame</span><span class="p">(</span><span class="w">
    </span><span class="n">author</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">character</span><span class="p">(</span><span class="n">length</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">length</span><span class="p">(</span><span class="n">testset</span><span class="p">)),</span><span class="w">
    </span><span class="n">title</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">character</span><span class="p">(</span><span class="nf">length</span><span class="p">(</span><span class="n">testset</span><span class="p">)),</span><span class="w">
    </span><span class="n">location</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">character</span><span class="p">(</span><span class="nf">length</span><span class="p">(</span><span class="n">testset</span><span class="p">)),</span><span class="w">
    </span><span class="n">pagenr</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">numeric</span><span class="p">(</span><span class="nf">length</span><span class="p">(</span><span class="n">testset</span><span class="p">)),</span><span class="w">
    </span><span class="n">type</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">character</span><span class="p">(</span><span class="nf">length</span><span class="p">(</span><span class="n">testset</span><span class="p">)),</span><span class="w">
    </span><span class="n">highlight</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">character</span><span class="p">(</span><span class="nf">length</span><span class="p">(</span><span class="n">testset</span><span class="p">))</span><span class="w">
    </span><span class="p">)</span><span class="w">
</span><span class="c1"># loop through all values 
</span><span class="k">for</span><span class="p">(</span><span class="w"> </span><span class="n">i</span><span class="w"> </span><span class="k">in</span><span class="w"> </span><span class="nf">seq_along</span><span class="p">(</span><span class="n">testset</span><span class="p">)){</span><span class="w">
    </span><span class="n">hold</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">testset</span><span class="p">[</span><span class="n">i</span><span class="p">]</span><span class="w"> </span><span class="o">%>%</span><span class="w"> </span><span class="n">seperate_into_lines</span><span class="p">()</span><span class="w">
    </span><span class="n">finalframe</span><span class="o">$</span><span class="n">author</span><span class="p">[</span><span class="n">i</span><span class="p">]</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">hold</span><span class="w"> </span><span class="o">%>%</span><span class="w"> </span><span class="n">extract_author</span><span class="p">()</span><span class="w">
    </span><span class="n">finalframe</span><span class="o">$</span><span class="n">title</span><span class="p">[</span><span class="n">i</span><span class="p">]</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">hold</span><span class="w"> </span><span class="o">%>%</span><span class="w"> </span><span class="n">extract_title</span><span class="p">()</span><span class="w">
    </span><span class="n">finalframe</span><span class="o">$</span><span class="n">location</span><span class="p">[</span><span class="n">i</span><span class="p">]</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">hold</span><span class="w"> </span><span class="o">%>%</span><span class="w"> </span><span class="n">extract_type_location_date</span><span class="p">()</span><span class="w"> </span><span class="o">%>%</span><span class="w"> </span><span class="n">extract_locations</span><span class="p">()</span><span class="w">
    </span><span class="n">finalframe</span><span class="o">$</span><span class="n">pagenr</span><span class="p">[</span><span class="n">i</span><span class="p">]</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">hold</span><span class="w"> </span><span class="o">%>%</span><span class="w"> </span><span class="n">extract_type_location_date</span><span class="p">()</span><span class="w"> </span><span class="o">%>%</span><span class="w"> </span><span class="n">extract_pagenumber</span><span class="p">()</span><span class="w">
    </span><span class="n">finalframe</span><span class="o">$</span><span class="n">type</span><span class="p">[</span><span class="n">i</span><span class="p">]</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">hold</span><span class="w"> </span><span class="o">%>%</span><span class="w"> </span><span class="n">extract_type_location_date</span><span class="p">()</span><span class="w"> </span><span class="o">%>%</span><span class="w"> </span><span class="n">extract_type</span><span class="p">()</span><span class="w">
    </span><span class="n">finalframe</span><span class="o">$</span><span class="n">highlight</span><span class="p">[</span><span class="n">i</span><span class="p">]</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">hold</span><span class="w"> </span><span class="o">%>%</span><span class="w"> </span><span class="n">extract_highlights</span><span class="p">()</span><span class="w">
</span><span class="p">}</span><span class="w">
</span><span class="n">finalframe</span><span class="w">
</span>
## # A tibble: 20 × 6
##                                     author
##                                      <chr>
## 1             Andrew Hunt and David Thomas
## 2             Andrew Hunt and David Thomas
## 3                            Alex Reinhart
## 4                              David Price
## 5            City Watch #1 Terry Pratchett
## 6            City Watch #1 Terry Pratchett
## 7                              Mark Manson
## 8                     Kim Stanley Robinson
## 9                     Kim Stanley Robinson
## 10                    Kim Stanley Robinson
## 11         Douglas DeCarlo, James P. Lewis
## 12                             David Price
## 13           City Watch #2 Terry Pratchett
## 14 Kenneth Knoblauch & Laurence T. Maloney
## 15                           Alex Reinhart
## 16            Andrew Hunt and David Thomas
## 17                        Robert C. Martin
## 18            Andrew Hunt and David Thomas
## 19            Andrew Hunt and David Thomas
## 20                             David Price
## # ... with 5 more variables: title <chr>, location <chr>, pagenr <dbl>,
## #   type <chr>, highlight <chr>

Cool right? Find this specific project on my github page. (I will also add a script only version shortly)

state of machine

click to expand to see machine info

“` r
sessioninfo::session_info()
“`

## – Session info ———————————————————-
## setting value
## version R version 3.3.3 (2017-03-06)
## os Windows 10 x64
## system x86_64, mingw32
## ui RTerm
## language (EN)
## collate Dutch_Netherlands.1252
## tz Europe/Berlin
## date 2017-05-08
##
## – Packages ————————————————————–
## package * version date source
## assertthat 0.1 2013-12-06 CRAN (R 3.3.0)
## backports 1.0.5 2017-01-18 CRAN (R 3.3.2)
## broom 0.4.2 2017-02-13 CRAN (R 3.3.2)
## clisymbols 1.1.0 2017-01-27 CRAN (R 3.3.3)
## colorspace 1.3-2 2016-12-14 CRAN (R 3.3.2)
## DBI 0.6-1 2017-04-01 CRAN (R 3.3.3)
## digest 0.6.12 2017-01-27 CRAN (R 3.3.3)
## dplyr * 0.5.0 2016-06-24 CRAN (R 3.3.1)
## emo 0.0.0.9000 2017-04-27 Github (hadley/[email protected])
## evaluate 0.10 2016-10-11 CRAN (R 3.3.3)
## forcats 0.2.0 2017-01-23 CRAN (R 3.3.2)
## foreign 0.8-67 2016-09-13 CRAN (R 3.3.3)
## ggplot2 * 2.2.1 2016-12-30 CRAN (R 3.3.2)
## gtable 0.2.0 2016-02-26 CRAN (R 3.3.0)
## haven 1.0.0 2016-09-23 CRAN (R 3.3.1)
## hms 0.3 2016-11-22 CRAN (R 3.3.2)
## htmltools 0.3.5 2016-03-21 CRAN (R 3.3.0)
## httr 1.2.1 2016-07-03 CRAN (R 3.3.1)
## jsonlite 1.3 2017-02-28 CRAN (R 3.3.3)
## knitr 1.15.1 2016-11-22 CRAN (R 3.3.2)
## lattice 0.20-35 2017-03-25 CRAN (R 3.3.3)
## lazyeval 0.2.0 2016-06-12 CRAN (R 3.3.0)
## lubridate 1.6.0 2016-09-13 CRAN (R 3.3.1)
## magrittr 1.5 2014-11-22 CRAN (R 3.3.0)
## mnormt 1.5-5 2016-10-15 CRAN (R 3.3.2)
## modelr 0.1.0 2016-08-31 CRAN (R 3.3.2)
## munsell 0.4.3 2016-02-13 CRAN (R 3.3.0)
## nlme 3.1-131 2017-02-06 CRAN (R 3.3.3)
## plyr 1.8.4 2016-06-08 CRAN (R 3.3.0)
## psych 1.7.3.21 2017-03-22 CRAN (R 3.3.3)
## purrr * 0.2.2 2016-06-18 CRAN (R 3.3.1)
## R6 2.2.0 2016-10-05 CRAN (R 3.3.1)
## Rcpp 0.12.10 2017-03-19 CRAN (R 3.3.3)
## readr * 1.1.0 2017-03-22 CRAN (R 3.3.3)
## readxl 0.1.1 2016-03-28 CRAN (R 3.3.0)
## reshape2 1.4.2 2016-10-22 CRAN (R 3.3.2)
## rmarkdown 1.4 2017-03-24 CRAN (R 3.3.3)
## rprojroot 1.2 2017-01-16 CRAN (R 3.3.2)
## rvest 0.3.2 2016-06-17 CRAN (R 3.3.1)
## scales 0.4.1 2016-11-09 CRAN (R 3.3.2)
## sessioninfo 0.0.0.9000 2017-04-25 Github (r-pkgs/[email protected])
## stringi 1.1.5 2017-04-07 CRAN (R 3.3.3)
## stringr 1.2.0 2017-02-18 CRAN (R 3.3.3)
## tibble * 1.3.0 2017-04-01 CRAN (R 3.3.3)
## tidyr * 0.6.1 2017-01-10 CRAN (R 3.3.2)
## tidyverse * 1.1.1 2017-01-27 CRAN (R 3.3.2)
## withr 1.0.2 2016-06-20 CRAN (R 3.3.1)
## xml2 1.1.1 2017-01-24 CRAN (R 3.3.2)
## yaml 2.1.14 2016-11-12 CRAN (R 3.3.2)

Notes

[1] ^1

[2] I use the tidyverse form of a data.frame called tibble or data_frame, it is like a data.frame but it never converts character to factor and never adds rownames . See more at ?tibble::tibble.

[3] This is absolutely not a robust way of programming, if the format ever changes, all my functions are screwed.

Turning kindle notes into a tidy data was originally published by at Clean Code on May 08, 2017.

To leave a comment for the author, please follow the link and comment on their blog: Clean Code.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)