# Libraries
packages <- 
  c("data.table",
    "rlist",
    "stringr",
    "pdftools",
    "readxl"
    )

if (length(setdiff(packages,rownames(installed.packages()))) > 0) {
  install.packages(setdiff(packages, rownames(installed.packages())))  
}

invisible(lapply(packages, library, character.only = TRUE))

knitr::opts_chunk$set(comment=NA, fig.width=12, fig.height=8, out.width = '100%')

Introduction

This post is a continuation Tabulizer and pdftools Together as Super-powers - Part 2 where we showed how combining pdftools and tabulizer together could lead to better, more scaleable data extraction on a large number of slightly varying pdfs. Although the full process used to extract data from all our 149 PDFs will not shown in this series, to review the steps that Redwall followed for the portion using pdftools and tabulizer:

Load all the pdf_data (metadata) for every page of the 148 PDFs into a list of lists with the metadata data.frame from pdftool’s pdf_data.
Filter the nested list to keep only the key tables (ie: Statement of Activities, Balance Sheet, Statement of Net Position) using regular expression matching of sub-elements.
Find the coordinates for the key Tabulizer area parameters using regular expressions and the pdf_data metadata.
Extract those tables with Tabulizer into data.frames.
Clean up the raw output especially column headers.
Extract out key fields (ie: Assigned Net Balance, Unassigned Net Balance and Total Expenditures).
Compare to manually extracted data from Marc Joffe.

Please see our Github for the code. In this post, we will evaluate how we did with with this first method.

# Load comparison of scraped vs Reason spreadsheet
mass_compare <- 
  readxl::read_excel("~/Desktop/David/Projects/pdf_cafr_parse/intermediate_data/mass_compare.xls", 
    col_types = c("skip", "text", "numeric", 
        "numeric", "numeric", "numeric", 
        "numeric", "numeric", "numeric", 
        "numeric", "numeric", "numeric", 
        "numeric", "numeric", "numeric", 
        "numeric", "numeric"))
mass_compare <- setDT(mass_compare)

Tabulizer Results Compared to Manual Spreadsheet

In the code below, we show the PDFs where all fields did not match the manually-compiled spreadsheet. It is encouraging that there are only 5 (Boston didn’t match only because its CAFR numbers were rounded to the nearest thousand). We also discovered that our CAFR library didn’t have a 2018 CAFR in some cases, so that was the reason some didn’t return any scraped results. In addition, some CAFR’s (Clinton and Montague) are released as scanned images of printed documents, which don’t work with pdftools, so we couldn’t get page indices to send to tabulizer.

# Filter diff != 0 (cases where there wasn't a match)
diff <- 
  names(mass_compare)[str_detect(names(mass_compare), "diff_")]

# All missing
all_missing <- mass_compare[, 
  problem := as.logical(apply(.SD, 1, function(x) all(x != 0))), 
  .SDcols = diff][
    ][problem == 1]$muni

all_missing

[1] "boston"      "clinton"     "montague"    "new_bedford" "sandwich"   
[6] "weston"

The next table shows cases where one or more variable was not being extracted properly (difference not equal to zero). Of the 720 potential items in the spreadsheets which were successfully parsed, 43 did not match. 94% accuracy seems pretty good at first glance, but an analyst probably couldn’t rely on this without some manual back-up checking. Another possible objective would be to know which formats would have a higher likelihood of failure, and just check those. For example, some tables run over onto a second page, and others have sections with larger indentations. In the next step, we will slice the problematic tables out of the relevant PDF, and run these through textract to see what happens.

Figure 1: 43 Elements had differences with manually extracted data

Conclusion

The combination of pdftools and tabulizer gave encouraging results, and we are confident that further fine tuning could further reduce the error rate. We also have some ideas about which tables fail most often, so the risk of an unexpected mismatch might be mitigated by manual intervention or getting a back up from Textract in these cases. In the next post Scraping Failed Tabulizer PDFs with AWS Textract - Part 4, we use the AWS paws SDK to save an aggregated PDF in S3 and use their machine-learning based Textract solution to extract the tables to see how their solution does for the difficult cases.

Evaluating Mass Muni CAFR Tabulizer Results - Part 3

Introduction

Tabulizer Results Compared to Manual Spreadsheet

Conclusion