#RObservations #24: Using Tesseract-OCR to Scan Bank Documents and Extract Relevant Data

[This article was first published on r – bensstats, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Introduction

I have clearly been out of the loop because I have only recently learned about the tesseract library in R. If I knew about it earlier I would have wrote about it much sooner!

The tesseract library is a package which has bindings to the Tesseract-OCR engine: a powerful optical character recognition (OCR) engine that supports over 100 languages and enables users to scan and extract text from pictures, which has direct applications in any field that deals with with high amounts of manual data processing- like accounting, mortgages/real estate, insurance and archival work. In this blog I am going to explore how its possible to parse a JPMorgan Chase bank statement using the tesseract,pdftools, stringr, and tidyverse libraries.

Disclaimer: The bank statement I am using is from a sample template and is not personal information. The statement sample can be accessed here; the document being parsed is shown below.

First- converting the .pdf file to .png files.

From what I’ve seen in the CRAN documentation we need to first convert the .pdf file into .png files. This can be done with the pdf_convert() function available in the pdftools package. To ensure that text will be read accurately, setting the dpi argument to a large number is recommended.

bank_statement <- pdftools::pdf_convert("sample-bank-statement.pdf", 
                                        dpi = 1000)


## Converting page 1 to sample-bank-statement_1.png... done!
## Converting page 2 to sample-bank-statement_2.png... done!
## Converting page 3 to sample-bank-statement_3.png... done!
## Converting page 4 to sample-bank-statement_4.png... done!

Now that the bank statement has been converted to .png files it can now be read with tesseract. Right now the data is very unstructured and it needs to be parsed. I’ll save the output for you, but you can see it for yourself if you output the raw_text vector on your machine.

library(tesseract)
raw_text <- ocr(bank_statement,
                engine = tesseract("eng") )

Parsing the data

From a bank statement, businesses are interested in data from the fields listed in document. Namely the:

  • Deposits and additions,
  • Checks paid,
  • Other withdrawals, fees and charges; and
  • Daily ending balances.

While this bank statement is relatively small and only consists of 4 pages, to create a general method for extracting data from JPMorgan Chase bank statements which could be larger, the scanned text will need to be combined into a single text file and then parsed accordingly.

# Bind raw text together
raw_text_combined<- paste(raw_text,collapse="")

To get the data for the desired fields, we can use the field titles as anchors for parsing the data. It was with this and with the help of regex101.com this cleaning script was constructed. Beyond the particular regular expressions involved, the cleaning script relies heavily on the natural anchors in the text relating encompassing the values where they begin (around the title of the table they belong) and where they end (just before the total amount).

library(tidyverse)
library(stringr)

deposits_and_additions<- raw_text_combined %>% 
                         # Need to replace \n
                         str_replace_all('\\n',';') %>%
                         str_extract("(?<=DEPOSITS AND ADDITIONS).*(?=Total Deposits and Additions)") %>% 
                         str_extract('\\d{2}\\/\\d{2}.*\\.\\d{2}') %>%
                         str_split(';') %>% 
                         as_tibble(.name_repair = "unique") %>% 
                         transmute(date= ...1 %>% str_extract('\\d{2}\\/\\d{2}'),
                                   transaction = ...1 %>% str_extract('(?<=\\d{2} ).*(?= (\\$|\\d))'),
                                   amount = ...1 %>% str_extract("(?<=[A-z] ).*")%>% str_extract('\\d{1,}.*'))

checks_paid <- raw_text_combined %>%
                # Need to replace \n
                str_replace_all('\\n',';') %>%
                str_extract('(?<=CHECKS PAID).*(?=Total Checks Paid)') %>% 
                # Would have to change regex to get check numbers and description but its not relevant here
                str_extract('\\d{2}\\/\\d{2}.*\\.\\d{2}') %>% 
                str_split(';') %>% 
                as_tibble(.name_repair = "unique") %>% 
                transmute(date= ...1 %>% str_extract('\\d{2}\\/\\d{2}'),
                       amount = ...1 %>% str_extract("(?<=\\d{2}\\/\\d{2} ).*") %>% str_extract('\\d{1,}.*'))

others <-  raw_text_combined %>%
           # Need to replace \n
           str_replace_all('\\n',';') %>% 
           str_extract('(?=OTHER WITHDRAWALS, FEES & CHARGES).*(?=Total Other Withdrawals, Fees & Charges)') %>% 
           str_extract('\\d{2}\\/\\d{2}.*\\.\\d{2}') %>% 
           str_split(';') %>% 
           as_tibble(.name_repair = "unique") %>% 
                transmute(date= ...1 %>% str_extract('\\d{2}\\/\\d{2}'),
                          description =  ...1 %>% str_extract('(?<=\\d{2} ).*(?= (\\$|\\d))'),
                          amount = ...1 %>% str_extract("(?<=\\d{2}\\/\\d{2} ).*") %>% str_extract('\\d{1,}.*'))

daily_ending_balances<- raw_text_combined %>%
                        # Need to replace \n
                        str_replace_all('\\n',';') %>% 
                        str_extract('(?<=DAILY ENDING BALANCE).*(?=SERVICE CHARGE SUMMARY)') %>% 
                        str_extract('\\d{2}\\/\\d{2}.*\\.\\d{2}') %>% 
                        str_split(';') %>% 
                        lapply(function(x) { x %>% str_split('(?= \\d{2}\\/\\d{2} )')}) %>% 
                        unlist() %>% 
                        as_tibble(.name_repair = "unique") %>% 
                        transmute(date= value %>% str_extract('\\d{2}\\/\\d{2}'),
                                  amount =  value %>% str_extract("(?<=\\d{2}\\/\\d{2} ).*") %>% str_extract('\\d{1,}.*'))

The cleaned data extracted from the bank statement is:

statement_data <- list("DEPOSITS AND ADDITIONS"=deposits_and_additions,
                       "CHECKS PAID"=checks_paid,
                       "OTHER WITHDRAWALS, FEES & CHARGES"=others,
                       "DAILY ENDING BALANCE"=daily_ending_balances)

statement_data

## $`DEPOSITS AND ADDITIONS`
## # A tibble: 10 x 3
##    date  transaction amount   
##    <chr> <chr>       <chr>    
##  1 07/02 Deposit     17,120.00
##  2 07/09 Deposit     24,610.00
##  3 07/14 Deposit     11,424.00
##  4 07/15 Deposit     1,349.00 
##  5 07/21 Deposit     5,000.00 
##  6 07/21 Deposit     3,120.00 
##  7 07/23 Deposit     33,138.00
##  8 07/28 Deposit     18,114.00
##  9 07/30 Deposit     6,908.63 
## 10 07/30 Deposit     5,100.00 
## 
## $`CHECKS PAID`
## # A tibble: 2 x 2
##   date  amount  
##   <chr> <chr>   
## 1 07/14 1,471.99
## 2 07/08 1,697.05
## 
## $`OTHER WITHDRAWALS, FEES & CHARGES`
## # A tibble: 4 x 3
##   date  description                    amount  
##   <chr> <chr>                          <chr>   
## 1 07/11 Online Payment XXXXX To Vendor 8,928.00
## 2 07/11 Online Payment XXXXX To Vendor 2,960.00
## 3 07/25 Online Payment XXXXX To Vendor 250.00  
## 4 07/30 ADP TX/Fincl Svc ADP           2,887.68
## 
## $`DAILY ENDING BALANCE`
## # A tibble: 11 x 2
##    date  amount    
##    <chr> <chr>     
##  1 07/02 98,727.40 
##  2 07/21 129,173.36
##  3 07/08 97,030.35 
##  4 07/23 162,311.36
##  5 07/09 121,640.35
##  6 07/25 162,061.36
##  7 07/11 109,752.35
##  8 07/28 180,175.36
##  9 07/14 108,280.36
## 10 07/30 189,296.31
## 11 07/16 121,053.36

Conclusion

There you have it! The tesseract package has opened up a world of data processing tools that I now have at my disposal and I hope I was able to show it in this blog. While this blog only focused on JPMorgan Chase bank statements, its possible to apply the same techniques to other bank statements by having the cleaning script tweaked accordingly.

Thank you for reading!

Want to see more of my content?

Be sure to subscribe and never miss an update!

To leave a comment for the author, please follow the link and comment on their blog: r – bensstats.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)