Articles by R on Redwall Analytics

Evaluating Mass Muni CAFR Textract Results – Part 5

April 23, 2020 | R on Redwall Analytics

# Libraries
packages <- 
  c("data.table",
    "reticulate",
    "paws.machine.learning",
    "paws.common",
    "keyring",
    "pdftools",
    "listviewer",
    "readxl"
    )

if (length(setdiff(packages,rownames(installed.packages()))) > 0) {
  install.packages(setdiff(packages, rownames(installed.packages())))  
}

invisible(lapply(packages, library, character.only = TRUE))

knitr::opts_chunk$set(comment=NA, fig.width=12, fig.height=8, out.width = '100%')
Introduction In Evaluating Mass Muni CAFR Tabulizer Results - Part 3, we showed how to use pdftools and tabulizer to subset a group of PDFs, the AWS paws SDK package to store the PDF in s3, and Textract machine learning to extract a block response object using its “asynchronous” process. ... [Read more...]

Scraping Failed Tabulizer PDFs with AWS Textract – Part 4

April 13, 2020 | R on Redwall Analytics

# Libraries
packages <- 
  c("data.table",
    "stringr",
    "rlist",
    "paws.machine.learning",
    "paws.storage",
    "paws.common",
    "tabulizer",
    "pdftools",
    "keyring",
    "listviewer"
    )

if (length(setdiff(packages,rownames(installed.packages()))) > 0) {
  install.packages(setdiff(packages, rownames(installed.packages())))  
}

invisible(lapply(packages, library, character.only = TRUE))

knitr::opts_chunk$set(comment=NA, fig.width=12, fig.height=8, out.width = '100%')
Introduction In Evaluating Mass Muni CAFR Tabulizer Results - Part 3, we discovered that we were able to accurately extract ~95% of targeted data using tabulizer, but that might not have been good enough for some applications. In this post, we will show how to subset specific pages of PDFs using ... [Read more...]

Evaluating Mass Muni CAFR Tabulizer Results – Part 3

April 13, 2020 | R on Redwall Analytics

# Libraries
packages <- 
  c("data.table",
    "rlist",
    "stringr",
    "pdftools",
    "readxl"
    )

if (length(setdiff(packages,rownames(installed.packages()))) > 0) {
  install.packages(setdiff(packages, rownames(installed.packages())))  
}

invisible(lapply(packages, library, character.only = TRUE))

knitr::opts_chunk$set(comment=NA, fig.width=12, fig.height=8, out.width = '100%')
Introduction This post is a continuation Tabulizer and pdftools Together as Super-powers - Part 2 where we showed how combining pdftools and tabulizer together could lead to better, more scaleable data extraction on a large number of slightly varying pdfs. Although the full process used to extract data from all ... [Read more...]

Tabulizer and pdftools Together as Super-powers – Part 2

April 5, 2020 | R on Redwall Analytics

# Libraries
packages <- 
  c("data.table",
    "stringr",
    "rlist",
    "tabulizer",
    "pdftools",
    "parallel",
    "DT"
    )

if (length(setdiff(packages,rownames(installed.packages()))) > 0) {
  install.packages(setdiff(packages, rownames(installed.packages())))  
}

invisible(lapply(packages, library, character.only = TRUE))

knitr::opts_chunk$set(comment=NA, fig.width=12, fig.height=8, out.width = '100%')
Introduction This post will be a continuation of Parsing of Mass Municipal PDF CAFR’s with Tabulizer, pdftools and AWS Textract - Part 1 dealing with extracting data from PDFs using R. When Redwall discovered pdftools, and its pdf_data() function, which maps out every word on a pdf page ... [Read more...]

Parsing Mass Municipal PDF CAFRs with Tabulizer, pdftools and AWS Textract – Part 1

March 30, 2020 | R on Redwall Analytics

# Libraries
packages <- 
  c("data.table",
    "rlist",
    "stringr",
    "DT",
    "janitor",
    "readxl",
    "xlsx"
    )

if (length(setdiff(packages,rownames(installed.packages()))) > 0) {
  install.packages(setdiff(packages, rownames(installed.packages())))  
}

invisible(lapply(packages, library, character.only = TRUE))

knitr::opts_chunk$set(comment=NA, fig.width=12, fig.height=8, out.width = '100%')
Introduction Redwall Analytics had the pleasure of collaborating with Marc Joffe, of Reason Foundation, in its October 2018 post Replicating Yankee Institute Risk Score Over 15 Years for 150 Connecticut towns. This involved taking a well organized public dataset from the State’s website, and analyzing and building an application to view ... [Read more...]

Tracking R&D spending by 700 Listed US Pharma Companies – Part 2

February 17, 2020 | R on Redwall Analytics

# Re-load data previously stored for purposes of this blog post
pharma <- 
  fread("~/Desktop/David/Projects/xbrl_investment/data/pharma_inc.csv")
Introduction In A Walk Though of Accessing Financial Statements with XBRL in R - Part 1, we went through the first steps of pulling XBRL data for a single company from Edgar into R. Although an improvement over manual plugging of numbers into a Excel, there is still a way ...
[Read more...]

A Through the Cycle Geo-Spatial Analysis of CT Town Finances

February 10, 2019 | R on Redwall Analytics

Introduction In an earlier post, Reviewing Fairfield County Municipal Fiscal Indicators Since 2001, we used 17 years of individual Town Comprehensive Annual Financial Reports (CAFR) aggregated in Connecticut’s Municipal Fiscal Indicator’s to compare 15 Fairfield County towns. The challenge was that the graphs became crowded even with that small number of ... [Read more...]

Analysis of Connecticut Tax Load by Income Bracket

January 8, 2019 | R on Redwall Analytics

Introduction This brief study finds that Connecticut residents pay $62-63 billion annually in total taxes (including: Federal, State, Municipal Real Estate, Sales, FICA, Medicare) on adjusted gross income of $165-167 billion (an effective tax rate of 37-38%). Some taxes, such as FICA and Medicare, might be considered forms of savings ...
[Read more...]
1 2

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)