Site icon R-bloggers

Tabulizer and pdftools Together as Super-powers – Part 2

[This article was first published on R on Redwall Analytics, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
  • # Libraries
    packages <- 
      c("data.table",
        "stringr",
        "rlist",
        "tabulizer",
        "pdftools",
        "parallel",
        "DT"
        )
    
    if (length(setdiff(packages,rownames(installed.packages()))) > 0) {
      install.packages(setdiff(packages, rownames(installed.packages())))  
    }
    
    invisible(lapply(packages, library, character.only = TRUE))
    
    knitr::opts_chunk$set(comment=NA, fig.width=12, fig.height=8, out.width = '100%')

    Introduction

    This post will be a continuation of Parsing of Mass Municipal PDF CAFR’s with Tabulizer, pdftools and AWS Textract – Part 1 dealing with extracting data from PDFs using R. When Redwall discovered pdftools, and its pdf_data() function, which maps out every word on a pdf page by x-y coordinate, we thought that was interesting, but didn’t really know how to use it. We also didn’t have the regular expression skills, and were much more befuddled by the nested list structures than we are now.

    As for tabulizer, it took about a year before rJava magically started working properly, but even then we it wasn’t possible to consistently read a large number of tables of different sizes without cutting off fields in unexpected ways. Only in this Mass pdf scraping project have we realized that, by combining these two packages, it becomes possible to access data in reliable way, from a large number of varied pdf formats.

    Our Massachusset’s municipal CAFR project provided a perfect opportunity to put all these pieces together. This blog post will consist of a step-by-step walk through which will hopefully help others avoid some of the pain that we experienced in getting to this point.

    Walk Through Plan

    To begin with, we had to download the pdfs from the CAFR Library at the Center for Municipal Finance. We won’t show the code to do the download here, but it can be found at reason_pdf_parser.R. In order to do this on the scale that we plan for this project, we had to build nested lists with the pdf metadata of 150 Massachussett’s CAFR pdfs. For now, we will just walk through a few key points using a single statment from the Abington, MA 2018 Statement of Net Position from the CAFR downloaded here.

    # Set up pdf and pdf_path to directory
    dir <- "/Users/davidlucey/Desktop/David/Projects/mass_munis/data/pdf_cafr/"
    city <- "abington"
    pdf <- paste0(city, "_2018.pdf", collapse="")
    pdf_path <- paste0(dir, pdf, collapse = "")
    
    # Run pdf_data on Abington CAFR
    abington <- pdftools::pdf_data(pdf_path)
    
    # Name each page of list for page index in pdf
    names(abington) <- 1:length(abington)
    
    # Look at structure of 2n element in 92-page nested list
    str(abington[[2]])
    Classes 'tbl_df', 'tbl' and 'data.frame':   266 obs. of  6 variables:
     $ width : int  42 19 76 124 58 20 96 19 41 73 ...
     $ height: int  15 15 15 15 15 15 15 15 15 15 ...
     $ x     : int  168 215 238 319 92 154 179 279 302 348 ...
     $ y     : int  72 72 72 72 102 102 102 102 102 102 ...
     $ space : logi  TRUE TRUE TRUE FALSE TRUE TRUE ...
     $ text  : chr  "TOWN" "OF" "ABINGTON," "MASSACHUSETTS" ...

    PDF Tools pdf_data Functionality

    The above is a list of data.frames containing metadata of the location of every word on every one of the 92 pages of the Abington 2018 CAFR. The structure of the second page is shown above. But, we only need the key financial statements, so would like to drop the majority of pages which don’t have what we need. For example, we know that page 16 has the Statement of Net Position. We could search for that page using regular expressions as shown for variable sonp below.

    See how we extract only the head unique 5 lines of each page by y, paste those lines back together, then match our regular expression on the text of those lines looking for the phrase “STATEMENT OF NET POSITION”. However, there are several pages meeting this criteria, such as “Proprietary Funds” Statement of Net Position on page 20 (which we don’t want). Most of these other pages can be eliminated by choosing NOT to match the word “FUNDS” by the same process, so notice that we negate our second match with “!”. Hence, our sonp_index comes back as 16, which can be used to filter out the remaining pages.

    # Convert elements to data.table
    abington <- mclapply(abington, setDT)
    
    # Get index of Abington Statement of Net Position
    sonp_index <- 
      which(
        unlist(
          mclapply(abington, function(page){
            (str_detect(
              paste(
                # Reformat top 5 lines by y and look for match to "STATEMENT OF NET POSITION"
                  page$text[
                    page$y %in% head(unique(page$y), 5)
                    ],
                  collapse = " "
                  ),
              "STATEMENT OF NET POSITION"
            ) & 
              # And requires both statements to be TRUE
              !str_detect(
                paste(
                  page$text[
                   # Reformat top 5 lines by y and look for non match to "FUNDS" 
                    page$y %in% head(unique(page$y), 5)
                    ],
                  collapse = " "
                  ),
                "FUNDS"
                )
            )
            }
            )
        )
        )
        
    # Extract and View Statement of Net Position pdftools pdf_data  metadata
    sonp <- abington[sonp_index][[1]]
    sonp
         width height   x   y space          text
      1:    32      6 264  73  TRUE     STATEMENT
      2:     7      6 299  73  TRUE            OF
      3:    10      6 308  73  TRUE           NET
      4:    25      6 320  73 FALSE      POSITION
      5:    14      6 287  87  TRUE          JUNE
     ---                                         
    337:    71      9 144 745 FALSE Massachusetts
    338:    11     11 300 743 FALSE            13
    339:    26      9 426 745  TRUE         Basic
    340:    43      9 456 745  TRUE     Financial
    341:    53      9 502 745 FALSE    Statements

    Above is the text grid of Abington’s Statement of Net Position as taken by pdftools. Using this metadata, we can begin to put together exact area parameters for Tabula. Even better, we could programically do it for a large number of tabless. In our experience, this is important because the tabulizer default lattice method for tabular data can be unpredictable cutting off fields unexpectedly.

    Tabulizer Area Coordinates

    Tabulizer specifies pages in blocks of 72 * inches, so a typical 8.5 x 11 verticle page would have dimensions of 612 x 720. This coordinate grid is used to specify the area parameter (top, left, bottom and right). All of of Massachusett’s financial statement tables have a "$" sign in the first and last rows, so those could be used to locate the top or bottom paramenters. In addition, all pages including financial statements have language referring users to the “notes to the financial statements” usually on the second to last line, which could be the “bottom”, or the midpoint between the bottom and the "$" (if more room was needed).

     # Determine if page is verticle or horizontal
        x <- 8.5 * 72
        y <- 11 * 72
        max_x <- max(sonp$x)
        max_y <- max(sonp$y)
        orientation <- 
          ifelse(x < max_x, "horizontal", "verticle")
        
        # TOP
        
        # Keys on the first instance of the year "2018"
        table_top <-
          min(sonp$y[str_detect(sonp$text, "2018") & sonp$space==FALSE])
        # Find the height at in the table_top row
        height_top <- unique(sonp$height[sonp$y == table_top])
        # Add table_top and height_top to avoid slicing row
        top <- table_top + height_top 
        
        # BOTTOM
        
        # Table Bottom marked by last instance of character "$"
        table_bottom <-
          max(sonp$y[str_detect(sonp$text, "\\$")])
        # Height at bottom row of table 
        height_bottom <- unique(sonp$height[sonp$y == table_bottom])
        # Bottom of table
        bottom <- table_bottom + height_bottom
        
        # LEFT
        
        # Add some space to leftmost x coordinate to avoid slicing
        left <-     
          ifelse( min(sonp$x) - 30 > 0,
                  min(sonp$x) - 30, 1 )
        
        # RIGHT
        
        # Find width at maximum "x" coordinate
        width_max_x <- max(sonp$width[sonp$x == max_x])
        # Add width at maximum "x" plus more space wether verticle or horizontal
        right <- 
          max_x + width_max_x + ifelse(orientation == "verticle", 30, 50)
        
        # FINAL AREA PARAMETER FOR TABULIZER AS INTEGER VECTOR
        # Note the specification as an integer vector
        a <- c(top, left, bottom, right)
        
        # Show coordinates 
        a
    [1]  93  24 681 585

    We give an example for Abington’s Statement of Net Position above, starting with the maximum x and y, and determining the page orientation (ie: verticle or horizontal). After finding the location of the date line at the top, and walk down a little from there to set a table_top variable. Typically, it is best to leave a little margin between the page header and the top of the table. The bottom of the table is set adding the height to the bottom line of the table, and left parameter is set by taking the smallest x coordinate and reducing by a little to margin for error. We leave a larger margin for the right-most coordinate because this is where we have found that the most errors occur, often when the algorithm seems to try to squish the table into the available columns.

    In our experience, the most problems come with missetting the top and right parameters. Indentation can also confuse the algorithm. Columns can be split in the middle into two columns, often at the far-rightmost, for example. In the end, we chose parameters of 93 (top), 24 (left), 681 (bottom) and 585 (right).

    Tabulizer extract_table() Function

    Below we run our area parameters we derived above through tabulizer. Note that the area parameter, itself an integer vector, is further wrapped as a list because not having this structure throws an error. In addition, avoid the half day of wheel spinning we experienced by specifying guess as “F” to over-ride the default lattice, otherwise your area parameter is ignored with no warning. Also, we use the sonp_index integer to specify the page of the pdf. There are several options for output which all work as expected, but data.frame seems most natural.

    # Tabulizer extract_tables output is a list
    abington_sonp <-
      extract_tables(
        pdf_path, 
        pages = sonp_index,
        area = list(a), 
        guess = F,
        output = "data.frame")
    
    # Extract and print single element from list
    abington_sonp <- abington_sonp[[1]]
    abington_sonp
                                                                                             X
    1                                                                                         
    2                                                                                         
    3                                                                                   ASSETS
    4                                                                                 CURRENT:
    5              Cash and cash equivalents................................................ $
    6                                        Receivables, net of allowance for uncollectibles:
    7                       Real estate and personal property taxes...........................
    8           Tax liens.....................................................................
    9                      Community preservation fund surtax.................................
    10                   Motor vehicle and other excise taxes.................................
    11             User charges...............................................................
    12                  Departmental and other................................................
    13              Intergovernmental.........................................................
    14                     Community preservation state share.................................
    15               Special assessments......................................................
    16         Tax foreclosures...............................................................
    17                 Total current assets...................................................
    18                                                                             NONCURRENT:
    19                                       Receivables, net of allowance for uncollectibles:
    20               Special assessments......................................................
    21             Capital assets, nondepreciable.............................................
    22                    Capital assets, net of accumulated depreciation.....................
    23                    Total noncurrent assets.............................................
    24          TOTAL ASSETS..................................................................
    25                                                                                        
    26                                                          DEFERRED OUTFLOWS OF RESOURCES
    27               Deferred outflows related to pensions....................................
    28                        Deferred outflows related to other postemployment benefits......
    29                                  TOTAL DEFERRED OUTFLOWS OF RESOURCES..................
    30                                                                             LIABILITIES
    31                                                                                CURRENT:
    32            Warrants payable............................................................
    33          Accrued payroll...............................................................
    34          Health claims payable.........................................................
    35         Accrued interest...............................................................
    36      Accrued liabilities...............................................................
    37            Capital lease obligations...................................................
    38      Landfill closure..................................................................
    39              Compensated absences......................................................
    40         Notes payable..................................................................
    41         Bonds payable..................................................................
    42               Total current liabilities................................................
    43                                                                             NONCURRENT:
    44      Landfill closure..................................................................
    45              Compensated absences......................................................
    46       Net pension liability............................................................
    47               Net other postemployment benefits liability..............................
    48         Bonds payable..................................................................
    49                  Total noncurrent liabilities..........................................
    50        TOTAL LIABILITIES...............................................................
    51                                                           DEFERRED INFLOWS OF RESOURCES
    52                Deferred inflows related to pensions....................................
    53                                                                            NET POSITION
    54         Net investment in capital assets...............................................
    55                                                                         Restricted for:
    56                                                                        Permanent funds:
    57              Expendable................................................................
    58               Nonexpendable............................................................
    59         Gifts and grants...............................................................
    60               Community preservation...................................................
    61 Unrestricted...........................................................................
    62           TOTAL NET POSITION......................................................... $
       X.1          X.2 Primary.Government X.3          X.4
    1   NA Governmental      Business-type                 
    2   NA   Activities         Activities            Total
    3   NA                                                 
    4   NA                                                 
    5   NA   10,392,587        $ 7,449,193   $   17,841,780
    6   NA                                                 
    7   NA      313,316                  -          313,316
    8   NA      882,182             34,598          916,780
    9   NA       6 ,245                  -            6,245
    10  NA      387,455                  -          387,455
    11  NA            -          1,930,158        1,930,158
    12  NA            -            149,296          149,296
    13  NA    1,715,882                  -        1,715,882
    14  NA       70,735                  -           70,735
    15  NA            -             32,137           32,137
    16  NA      663,449                  -          663,449
    17  NA   14,431,851          9,595,382       24,027,233
    18  NA                                                 
    19  NA                                                 
    20  NA            -             10,712           10,712
    21  NA  101,526,106          1,614,044      103,140,150
    22  NA   26,998,272         31,944,596       58,942,868
    23  NA  128,524,378         33,569,352      162,093,730
    24  NA  142,956,229         43,164,734      186,120,963
    25  NA                                                 
    26  NA                                                 
    27  NA      417,711              7,062          424,773
    28  NA    1,982,740             19,982        2,002,722
    29  NA    2,400,451             27,044        2,427,495
    30  NA                                                 
    31  NA                                                 
    32  NA      660,037            403,389        1,063,426
    33  NA      206,897             75,704          282,601
    34  NA      311,064                  -          311,064
    35  NA      278,817             71,281          350,098
    36  NA       50,638          2,247,039        2,297,677
    37  NA            -             53,845           53,845
    38  NA      139,000                  -          139,000
    39  NA      346,271             26,896          373,167
    40  NA       53,168                  -           53,168
    41  NA    2,462,040            491,136        2,953,176
    42  NA    4,507,932          3,369,290        7,877,222
    43  NA                                                 
    44  NA    4,080,000                  -        4,080,000
    45  NA      881,952             20,919          902,871
    46  NA   19,188,882            324,438       19,513,320
    47  NA   67,618,712            681,460       68,300,172
    48  NA   45,761,763          5,764,229       51,525,992
    49  NA  137,531,309          6,791,046      144,322,355
    50  NA  142,039,241         10,160,336      152,199,577
    51  NA                                                 
    52  NA    1,399,576             23,663        1,423,239
    53  NA                                                 
    54  NA   82,168,482         27,395,220      109,563,702
    55  NA                                                 
    56  NA                                                 
    57  NA       99,189                  -           99,189
    58  NA       69,778                  -           69,778
    59  NA    1,088,568                  -        1,088,568
    60  NA      740,211                  -          740,211
    61  NA (82,248,365)          5,612,559     (76,635,806)
    62  NA    1,917,863       $ 33,007,779   $   34,925,642

    Clean up

    The tabulizer output is still in a raw form with colums sometimes determined by indentations and x values, such as the “$” signs. The numbers are in character form with commas and sometimes negative numbers are shown in parenthesis, and need to be parsed into numeric. The item names often have a long series of periods which need to be stripped. The biggest challenge is the column names which often include the first row of the full column name, and need to be rebuilt. This is not a small task and not what we were hoping to illustrate in this post, so we are just showing the output below. Please refer to our Github code for the a more complete explanation and solutions to many of these issues.

    Final Product

    Though there is still work to be done, the final product of this post is shown above. Single elements could be extracted to form a database, or the output could be saved to csv. The headers such as ASSETS or LIABILITIES could be nested. The main point is that short of XBRL, the data has been set free from the PDF in a machine readable form. Not only that, this general process can be repeated for a large number of slightly differing PDFs with a relatively high low error rate as we will show in the next post Evaluating Mass Muni CAFR Tabulizer Results – Part 3. In cases where errors do occur, a second layer can be used to run the more challenging PDFs through AWS Textract SDK. We will show how this is done in our next post.

    To leave a comment for the author, please follow the link and comment on their blog: R on Redwall Analytics.

    R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
    Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.