Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

# Libraries
packages <-
c("data.table",
"stringr",
"rlist",
"paws.machine.learning",
"paws.storage",
"paws.common",
"tabulizer",
"pdftools",
"keyring",
"listviewer"
)

if (length(setdiff(packages,rownames(installed.packages()))) > 0) {
install.packages(setdiff(packages, rownames(installed.packages())))
}

invisible(lapply(packages, library, character.only = TRUE))

In this case, we will show how to subset five tables from Attleboro CAFR which failed to scrape three out of five desired fields, and Hudson, MA where the PDF couldn’t be found on the town’s website and is probably an image. In our full project, we aggregated five pages from every CAFR in Massachusetts (30MB file for 700 pages) for a total cost of $11. # Problem PDFs path <- "/Users/davidlucey/Desktop/David/Projects/mass_munis/" bad_cases <- c("attleboro", "hudson") pdfs <- paste0(path, "data/pdf_cafr/", bad_cases, "_2018.pdf") # Extract 5-pages from Atteboro and Hudson CAFR PDFs with pdftools pages_attleboro <- as.integer(names(readRDS(paste0(path, "mass.RDS"))[[3]][["attleboro"]])) attleboro <- pdf_subset( pdfs[1], pages = pages_attleboro, output = paste0(path, "attleboro.pdf") ) pages_hudson <- as.integer(names(readRDS(paste0(path, "mass.RDS"))[[3]][["hudson"]])) hudson <- pdf_subset( pdfs[2], pages = pages_hudson, output = paste0(path, "hudson.pdf") ) # Merge pdfs with tabulizer merge_pdfs( c(paste0(path, "attleboro.pdf"), paste0(path, "hudson.pdf")), outfile= "joined.pdf" ) [1] "joined.pdf" # Setting up an S3 Bucket and Uploading a PDF We then input our AWS credentials and establish an S3 response object (s3 below), which we use to instruct AWS to create a S3 bucket, and then upload our subset file of PDFs to S3. When setting the bucket names, it is important not to include punctuation, because these will be rejected. Another mistake we made was uploading the PDF from outside our current working directory, because S3 created the directory structure to match our disc, and Textract seemed to be unable to navigate the file structure to find the document. We showed “munisubset” bucket at the bottom of the code below. # Save file to S3 Bucket s3 <- s3( config = list( credentials = list( creds = list( access_key_id = key_get("AWS_ACCESS_KEY_ID"), secret_access_key = key_get("AWS_SECRET_ACCESS_KEY") ) ), region = "us-east-1" ) ) # Create bucket bucket_name <- "munisubset" s3$create_bucket(
Bucket = bucket_name
)
$Location [1] "/munisubset" # Load the file as a raw binary file_name <- "joined.pdf" read_file <- file(file_name, "rb") s3_object <- readBin(read_file, "raw", n = file.size(file_name)) # Put object in bucket s3$put_object(
Body = s3_object,
Bucket = bucket_name,
Key = file_name
)
$Expiration character(0)$ETag
[1] "\"e311e09f66a669056ba956c7e77b9ae1\""

$ServerSideEncryption character(0)$VersionId
character(0)

$SSECustomerAlgorithm character(0)$SSECustomerKeyMD5
character(0)

$SSEKMSKeyId character(0)$SSEKMSEncryptionContext
character(0)

$RequestCharged character(0) buckets <- s3$list_buckets()
buckets$Buckets[[1]][1:3]$Name
[1] "munisubset"

$CreationDate [1] "2020-04-30 12:32:00 GMT"$
NULL

# Setting up Textract Object and Calling Start Document Analysis

Next, we set up a Textract response object (svc below) and use start_document_analysis() to process the pages in the code below. Note that we select TABLES, but other parameters are FORMS or FORMS | TABLES. It is possible to get help by using ?textract or ?start_document_analysis just like any other function in R. The docs say that start_document_analysis() uses asynchronous analysis to look for relationships between key-value pairs. Running Textract on our 700 pages took more than an hour, so another step would be to figure out how to be notified of the completion with AWS SNS. Once completed, Textract returns the JobID (shown below) which is required to get the analysis in the next step.

# Set up Amazon Textract object
svc <-
textract(
config = list(
credentials = list(
creds = list(
access_key_id = key_get("AWS_ACCESS_KEY_ID"),
secret_access_key = key_get("AWS_SECRET_ACCESS_KEY")
)
),
region = "us-east-1"
)
)

# Run Textract on "attleboro" S3 bucket
# Textract function is "start_document_analysis" which asynsychroniously for PDF
# Output is JobID used for "get_document_analysis"
# Feature type is set to "TABLES"
JobId <-
svc$start_document_analysis( Document = list( S3Object = list( Bucket = "munisubset", Name = "joined.pdf" ) ), FeatureTypes = list( "TABLES" ) ) # Textract job identifier JobId$JobId
[1] "fee4fb4042e7b5d21949d17b211e6bdbc6a3441939d28480f0e858ac98f1e0a5"

# Recalling the Blocks from Textract

Below, we show our call to paws get_document_analysis() using the JobId we received back from Textract above. A few things to mention, Textract stores pages from all of the documents together in “Blocks” when called in bulk from a PDF. We searched around, but it doesn’t seem possible to download the whole job with all of the pages at one time. The only way we could figure out to get the data back into our environment was to while loop over get_document_analysis in the maximum 1,000 increments. This also took time, and our 700 pages of tables came back as over 400,000 blocks commingled together in a json object. To give an idea of the sizes involved, the full aggregated PDF resulted to a 210 MB json, once we had called for all the Blocks.

In the next step, we have our work cut out to extract the key elements from the json. A json is a common object for API calls, but when introduced to a json a few years ago, it seemed to be a hopelessly, impenetrable data structure, and one to be avoided if at all possible. Fortunately, time has moved on, and like many things, it might be possible now. In the next post, we will attempt to reconstitute the Blocks into their original pages, and then parse out the desired elements for comparison. A lot is at stake since we have invested quite a bit of time to get to this point.

# Get 1st 10 blocks
a <-
svc\$get_document_analysis(JobId= unlist(JobId))

listviewer::jsonedit(
a[["Blocks"]][1:10]
)

# Conclusion

That concludes this section of setting up S3 and calling Textract which seems like a complete segment. In addition to Textract, the paws link to the AWS SDK opens up so many other options, including the obvious links EC2 and ECS, but also Rekognition for images, Polly for speech to text, Translate for languages and Lambda among others. It seemed like a good place to stop to keep the series in digestible increments. In the next post Evaluating Mass Muni CAFR Textract Results - Part 5, we will show how to parse the complicated json response object back into a table, extract the desired elements for the cases where we failed to match and evaluate how well Textract did on those difficult cases.