6 min read

Scraping Failed Tabulizer PDFs with AWS Textract - Part 4

# Libraries
packages <- 
  c("data.table",
    "stringr",
    "rlist",
    "paws.machine.learning",
    "paws.storage",
    "paws.common",
    "tabulizer",
    "pdftools",
    "keyring",
    "listviewer"
    )

if (length(setdiff(packages,rownames(installed.packages()))) > 0) {
  install.packages(setdiff(packages, rownames(installed.packages())))  
}

invisible(lapply(packages, library, character.only = TRUE))

knitr::opts_chunk$set(comment=NA, fig.width=12, fig.height=8, out.width = '100%')

Introduction

In Evaluating Mass Muni CAFR Tabulizer Results - Part 3, we discovered that we were able to accurately extract ~95% of targeted data using tabulizer, but that might not have been good enough for some applications. In this post, we will show how to subset specific pages of PDFs using pdftools pdf_subset() function, merge those pages with those of other municipalities with tabulizer merge_pdf, and then upload the aggregated document to an AWS S3 bucket with the R paws interface with the AWS SDK. Once in an S3 bucket, we will show how to use paws to call AWS Textract, which uses OCR and machine learning to try to accurately parse text and tables.

AWS Textract Using PAWS Package

Textract offers a number of alternatives for using OCR to extract structured text, forms and tabular data. The API allows to manually upload up to 10 pages and get back a response, and second option of up to 1,000 pages a month for PNG formats for the first three months. This option also doesn’t require upload to an S3 bucket.

Extracting from bulk PDFs, which we used, costs $0.015 per page up to 1 million pages using their asynchronous API on documents which are in an S3 bucket. A logical workflow seemed to be to try tabulizer for free, where possible, and then pay for cases where the document can’t be extracted with tabulizer or the error rate is expected to be high.

In this case, we will show how to subset five tables from Attleboro CAFR which failed to scrape three out of five desired fields, and Hudson, MA where the PDF couldn’t be found on the town’s website and is probably an image. In our full project, we aggregated five pages from every CAFR in Massachusetts (30MB file for 700 pages) for a total cost of $11.

# Problem PDFs
path <- "/Users/davidlucey/Desktop/David/Projects/mass_munis/"
bad_cases <- c("attleboro", "hudson")
pdfs <- 
  paste0(path, "data/pdf_cafr/", bad_cases, "_2018.pdf")

# Extract 5-pages from Atteboro and Hudson CAFR PDFs with pdftools
pages_attleboro <- 
  as.integer(names(readRDS(paste0(path, "mass.RDS"))[[3]][["attleboro"]]))
attleboro <-  
  pdf_subset(
    pdfs[1],
    pages = pages_attleboro,
    output = paste0(path, "attleboro.pdf")
    )

pages_hudson <- 
  as.integer(names(readRDS(paste0(path, "mass.RDS"))[[3]][["hudson"]]))
hudson <-  
  pdf_subset(
    pdfs[2],
    pages = pages_hudson,
    output = paste0(path, "hudson.pdf")
    )

# Merge pdfs with tabulizer
  merge_pdfs(
    c(paste0(path, "attleboro.pdf"), 
      paste0(path, "hudson.pdf")), 
    outfile= "joined.pdf"
    )
[1] "joined.pdf"

Setting up an S3 Bucket and Uploading a PDF

We then input our AWS credentials and establish an S3 response object (s3 below), which we use to instruct AWS to create a S3 bucket, and then upload our subset file of PDFs to S3. When setting the bucket names, it is important not to include punctuation, because these will be rejected. Another mistake we made was uploading the PDF from outside our current working directory, because S3 created the directory structure to match our disc, and Textract seemed to be unable to navigate the file structure to find the document. We showed “munisubset” bucket at the bottom of the code below.

# Save file to S3 Bucket
s3 <- 
  s3( 
    config = list(
      credentials = list(
        creds = list(
          access_key_id = key_get("AWS_ACCESS_KEY_ID"),
          secret_access_key = key_get("AWS_SECRET_ACCESS_KEY")
        )
      ),
      region = "us-east-1"
    )
  )

# Create bucket
bucket_name <- "munisubset"
s3$create_bucket(
  Bucket = bucket_name
)
$Location
[1] "/munisubset"
# Load the file as a raw binary
file_name <- "joined.pdf"
read_file <- file(file_name, "rb")
s3_object <- 
  readBin(read_file, "raw", n = file.size(file_name))

# Put object in bucket
s3$put_object(
  Body = s3_object,
  Bucket = bucket_name,
  Key = file_name
)
$Expiration
character(0)

$ETag
[1] "\"e311e09f66a669056ba956c7e77b9ae1\""

$ServerSideEncryption
character(0)

$VersionId
character(0)

$SSECustomerAlgorithm
character(0)

$SSECustomerKeyMD5
character(0)

$SSEKMSKeyId
character(0)

$SSEKMSEncryptionContext
character(0)

$RequestCharged
character(0)
buckets <- s3$list_buckets()
buckets$Buckets[[1]][1:3]
$Name
[1] "munisubset"

$CreationDate
[1] "2020-04-30 12:32:00 GMT"

$<NA>
NULL

Setting up Textract Object and Calling Start Document Analysis

Next, we set up a Textract response object (svc below) and use start_document_analysis() to process the pages in the code below. Note that we select TABLES, but other parameters are FORMS or FORMS | TABLES. It is possible to get help by using ?textract or ?start_document_analysis just like any other function in R. The docs say that start_document_analysis() uses asynchronous analysis to look for relationships between key-value pairs. Running Textract on our 700 pages took more than an hour, so another step would be to figure out how to be notified of the completion with AWS SNS. Once completed, Textract returns the JobID (shown below) which is required to get the analysis in the next step.

# Set up Amazon Textract object
svc <- 
  textract( 
    config = list(
      credentials = list(
        creds = list(
          access_key_id = key_get("AWS_ACCESS_KEY_ID"),
          secret_access_key = key_get("AWS_SECRET_ACCESS_KEY")
        )
      ),
      region = "us-east-1"
    )
  )

# Run Textract on "attleboro" S3 bucket
# Textract function is "start_document_analysis" which asynsychroniously for PDF
# Output is JobID used for "get_document_analysis"
# Feature type is set to "TABLES"
JobId <- 
  svc$start_document_analysis(
    Document = list(
      S3Object = list(
        Bucket = "munisubset",
        Name = "joined.pdf"
    )
  ),
  FeatureTypes = list(
    "TABLES"
  )
)

# Textract job identifier
JobId
$JobId
[1] "fee4fb4042e7b5d21949d17b211e6bdbc6a3441939d28480f0e858ac98f1e0a5"

Recalling the Blocks from Textract

Below, we show our call to paws get_document_analysis() using the JobId we received back from Textract above. A few things to mention, Textract stores pages from all of the documents together in “Blocks” when called in bulk from a PDF. We searched around, but it doesn’t seem possible to download the whole job with all of the pages at one time. The only way we could figure out to get the data back into our environment was to while loop over get_document_analysis in the maximum 1,000 increments. This also took time, and our 700 pages of tables came back as over 400,000 blocks commingled together in a json object. To give an idea of the sizes involved, the full aggregated PDF resulted to a 210 MB json, once we had called for all the Blocks.

In the next step, we have our work cut out to extract the key elements from the json. A json is a common object for API calls, but when introduced to a json a few years ago, it seemed to be a hopelessly, impenetrable data structure, and one to be avoided if at all possible. Fortunately, time has moved on, and like many things, it might be possible now. In the next post, we will attempt to reconstitute the Blocks into their original pages, and then parse out the desired elements for comparison. A lot is at stake since we have invested quite a bit of time to get to this point.

# Get 1st 10 blocks
a <- 
  svc$get_document_analysis(JobId= unlist(JobId))

listviewer::jsonedit(
  a[["Blocks"]][1:10]
)

Conclusion

That concludes this section of setting up S3 and calling Textract which seems like a complete segment. In addition to Textract, the paws link to the AWS SDK opens up so many other options, including the obvious links EC2 and ECS, but also Rekognition for images, Polly for speech to text, Translate for languages and Lambda among others. It seemed like a good place to stop to keep the series in digestible increments. In the next post Evaluating Mass Muni CAFR Textract Results - Part 5, we will show how to parse the complicated json response object back into a table, extract the desired elements for the cases where we failed to match and evaluate how well Textract did on those difficult cases.