# Libraries
packages <-
c("data.table",
"reticulate",
"paws.machine.learning",
"paws.common",
"keyring",
"pdftools",
"listviewer",
"readxl"
)
if (length(setdiff(packages,rownames(installed.packages()))) > 0) {
install.packages(setdiff(packages, rownames(installed.packages())))
}
invisible(lapply(packages, library, character.only = TRUE))
knitr::opts_chunk$set(comment=NA, fig.width=12, fig.height=8, out.width = '100%')
Introduction
In Evaluating Mass Muni CAFR Tabulizer Results - Part 3, we showed how to use pdftools
and tabulizer
to subset a group of PDFs, the AWS paws
SDK package to store the PDF in s3
, and Textract machine learning to extract a block response object using its “asynchronous” process. Subsequently, we discovered an alternate route to save the desired pages as PNG and send those page-by-page to AWS to get the same result. In the last post Scraping Failed Tabulizer PDFs with AWS Textract - Part 4 method has the added advantage of being free to free-tier AWS users.
In this post, we will show how to do this, and also how to parse the response blocks (shown in Figure ??), which took us a few days to figure out. As mentioned previously, the response blocks are complicated, with one nested list pertaining to text and another to the coordinates on the page. We attempted to write R-code to extract the required data, but in the end, decided to modify the AWS code, and call it within RStudio using the reticulate
package. This had the added benefit of bringing together several pieces we had been working on.
Convert Selected Page to PNG with pdftools
Below we show the steps extract the Balance Sheet (page 27) from the Attleboro CAFR as a png with the pdftools
pdf_convert()
function. At first, we used the default setting for dpi of 75, but we found that the resolution this was too fuzzy, and led to frequent errors on particular letters with the OCR. These were not eliminated, but significantly reduced once we switched to dpi of 200, but it seems like it might be beneficial to go even higher, because for example, the “l” was commonly dropped when it occurred at the end of a word (ie: “Governmenta”).
# Point to particular muni pdf on disc
pdf <-
paste0("/Users/davidlucey/Desktop/David/Projects/mass_munis/data/pdf_cafr/attleboro_2018.pdf")
# Create png name for muni
png <-
paste0("/Users/davidlucey/Desktop/David/Projects/pdf_cafr_parse/attleboro.png")
# Convert report to png
pdf_convert(
pdf,
format = "png",
pages = 27,
filenames = png,
dpi = 200)
Converting page 27 to /Users/davidlucey/Desktop/David/Projects/pdf_cafr_parse/attleboro.png... done!
[1] "/Users/davidlucey/Desktop/David/Projects/pdf_cafr_parse/attleboro.png"
Extract Page with AWS Textract Synchronous Operation
We input our AWS credentials and set up a Textract response object in the same way as in the previous post. One difference between synchronous and asynchronous (demonstrated in the last post), is that paws
sends the request and gets the response with just the analyze_document()
function, instead of start_document_analysis()
and the second get_document_analysis
. We also did not need to loop, because each page fit within the maximum of 1000 blocks, and was immediately returned without a second function call. We also used the Bytes argument to reference the PNG from our local environment instead of pointing Textract to an s3
bucket.
# Call Textract for particular muni pdf page
response <-
svc$analyze_document(
Document = list(
Bytes = png
),
FeatureTypes = list(
"TABLES"
)
)
Description and List View of Textract Response Object
In the chunk below, we show the anatomy of the response object in a subset of blocks. The first 10 tabs in the listviewer below have the PAGE
(element 0) and some LINE
elements (1-10), which are the parents of WORD
and CELL
. The TABLE
block is shown in element 21, with its 168 children. WORD
and CELL
blocks are shown in lines 11-20 and 22-30, respectively. CELL
and WORD
blocks hvae no children, but can be used to find the location or row-column coordinates of a WORD
. We spent a lot of time trying to understand the relationships between the different kinds of objects in order to parse it out in R, but in the end, it seemed easier just to use AWS’s pre-built Python parser via reticulate
.
# Sample of blocks
listviewer::jsonedit(
response[["Blocks"]][c(1:10, 161:170, 421:430)]
)
Parsing the Response Object with Reticulated Python
An increasing amount is being written on how to use reticulate
to run Python code from RStudio, so we won’t get into great detail here about how to set it up here. It took us a long time to understand how to set up our environment, but as this case shows, it is probably going to be worth it to switch back and forth to take advantage of the strengths of the two languages. We used Python 3.7 with miniconda after re-installing Anaconda post the recent upgrade to Catalina, in which Apple re-arranged our whole set-up.
# Choose Python 3.7 miniconda
reticulate::use_condaenv(
condaenv = "r-miniconda",
conda = "/Users/davidlucey/opt/anaconda3/bin/conda",
required = TRUE
)
AWS gives the Python code to parse the blocks back into a tabular form [Textract Python Table Parser] (https://github.com/awsdocs/aws-doc-sdk-examples/blob/master/python/example_code/textract/textract_python_table_parser.py). This code built in the call to AWS with the Python boto client from the command line, but we didn’t need these functions, so we had to modify to take already returned response straight from our RStudio environment.
A second problem that we encountered was that when the response object was read into Python from R, blocks which had multiple children Ids
were converted to list, while those with single children were stored as strings. It took some time, we patched this in the get_text()
function below by converting the strings to single item lists, so it should work for any R user now. We noted where we made modifications with comments below.
import webbrowser, os
import json
import io
from io import BytesIO
import sys
def get_rows_columns_map(table_result, blocks_map):
rows = {}
for relationship in table_result['Relationships']:
if relationship['Type'] == 'CHILD':
for child_id in relationship['Ids']:
cell = blocks_map[child_id]
if cell['BlockType'] == 'CELL':
row_index = cell['RowIndex']
col_index = cell['ColumnIndex']
if row_index not in rows:
# create new row
rows[row_index] = {}
# get the text value
rows[row_index][col_index] = get_text(cell, blocks_map)
return rows
def get_text(result, blocks_map):
text = ''
if 'Relationships' in result:
for relationship in result['Relationships']:
if relationship['Type'] == 'CHILD':
if isinstance(relationship['Ids'], str): # Modified here
relationship_ids = [relationship['Ids']]
else:
relationship_ids = relationship['Ids']
for child_id in relationship_ids:
word = blocks_map[child_id]
if word['BlockType'] == 'WORD':
text += word['Text'] + ' '
return text
def get_table_csv_results(response):
blocks = response['Blocks'] # Modified here
blocks_map = {}
table_blocks = []
for block in blocks:
blocks_map[block['Id']] = block
if block['BlockType'] == "TABLE":
table_blocks.append(block)
if len(table_blocks) <= 0:
return "<b> NO Table FOUND </b>"
csv = ''
for index, table in enumerate(table_blocks):
csv += generate_table_csv(table, blocks_map, index +1)
csv += '\n\n'
return csv
def generate_table_csv(table_result, blocks_map, table_index):
rows = get_rows_columns_map(table_result, blocks_map)
table_id = 'Table_' + str(table_index)
# get cells.
csv = 'Table: {0}\n\n'.format(table_id)
for row_index, cols in rows.items():
for col_index, text in cols.items():
csv += '{}'.format(text) + "\t"
csv += '\n'
csv += '\n\n\n'
return csv
# Removed main()
We called our Python get_table_csv_results()
function from reticulate
(as py$get_table_csv_results()
) and show the raw parsed unparsed text below. We will not show the clean up of the raw text string here, but this also involves still some complicated regulaar expressions. Please refer to our Github repo for the code we used.
# Call Python function above from R
page <- py$get_table_csv_results(response)
# Print text
cat(page)
Table: Table_1
General Fund Governmental Funds Governmental Funds
Assets
Cash and investments -- unrestricted $ 9,663,898 $ 21,470,155 $ 31,134,053
Cash and investments - restricted (for stabilization
purposes) 3,172,925 - 3,172,925
Receivables:
Property taxes 1,266,214 -- 1,266,214
Motor vehicle excise 879,694 - 879,694
Tax liens and foreclosures 2,110,587 -- 2,110,587
User charges 557,376 -- 557,376
Intergovernmenta 160,060 2,728,377 2,888,437
Other 5,584,192 2,184,952 7,769,144
Due from ARA 616,048 -- 616,048
Total assets 24,010,994 26,383,484 50,394,478
Deferred outflows of resources
None - -- -
Total deferred outflows of resources - -- -
Total assets and deferred outflows of resources $ 24,010,994 $ 26,383,484 $ 50,394,478
Liabilities
Warrants payable $ 1,850,514 $ 1,745,496 $ 3,596,010
Accounts payable and accrued expenses 682,383 311,354 993,737
Retainage payable - 164,392 164,392
Due to federal and state governments - 19,101 19,101
Notes payable - 6,499,066 6,499,066
Total liabilities 2,532,897 8,739,409 11,272,306
Deferred inflows of resources
Property taxes paid in advance 62,260 - 62,260
Deferred property tax revenues 3,809,278 - 3,809,278
Deferred user fees and fines 6,141,568 -- 6,141,568
Deferred revenue from ARA 616,048 616,048
Unearned income - 8,240 8,240
Deferred grant income - 253,192 253,192
Deferred loan income - 2,124,915 2,124,915
Total deferred inflows of resources 10,629,154 2,386,347 13,015,501
Fund balance
Nonspendable - 315,478 315,478
Restricted - 7,235,505 7,235,505
Committed 3,172,925 9,500,776 12,673,701
Assigned 2,100,746 -- 2,100,746
Unassigned 5,575,272 (1,794,031) 3,781,241
Total fund balance 10,848,943 15,257,728 26,106,671
Total liabilities, deferred inflows of resources and fund balance $ 24,010,994 $ 26,383,484 $ 50,394,478
Comparing the Textract Results
As a refresher from Evaluating Mass Muni CAFR Tabulizer Results - Part 3, we started with 149 Massachusetts municipalities, and 5 PDFs couldn’t be read at all with OCR (because of their formatting). Of the remaining CAFRs, we were able to match all elements of 121 perfectly, and 23 had one or more elements which failed to match. Textract didn’t really help with the PDFs which couldn’t be read by OCR, although we really didn’t try those because our method relied on pdftools
using OCR to get the page index.
For our Textract trial, we started with the 23 municipalities where there were known problems matching manually extracted data for 43 elements. Of the requested 125 tables from Textract for about $2, all but a 4 tables were returned without problems. Of the challenging elements, we successfully extracted 21 from 12 municipalities with Textract for about $2. Recall that we were able to accurately extract all but about 6% of the cases we wanted with pdftools
and tabulizer
, so combined with Textract, we got about 97% right. We didn’t spend much time fine tuning our regex and cleaning of the raw Textract output, so if this were going to be a repeated process, it could likely be further improved.
Series summary
As reminder of the tools demonstrated in this 5-part series,
Find the location of a table on a page with regex matching and
pdftools
Tabulizer and pdftools Together as Super-powersExtract a particular page from a pdf with
pdftools
Tabulizer and pdftools Together as Super-powersAggregate multiple PDF pages together Tabulizer and pdftools Together as Super-powers
Use pdttools to find exact table parameters and accurately extract a large amount of slightly varying tables combining
pdftools
andtabulizer
Tabulizer and pdftools Together as Super-powersConnect and interact with AWS’
s3
with thepaws
SDK Scraping Failed Tabulizer PDFs with AWS TextractSend requests to AWS
Textract
froms3
and use bulk asynchronous operation with thepaws
SDK Scraping Failed Tabulizer PDFs with AWS TextractConvert a PDF page to PNG and send and receive single page requests to AWS Textract. [This post]
Study the anatomy of the AWS JSON response object and use the Python parser from R with
reticulate
to convert it back into a tabular form. [This post]
Conclusion
It has been over 10 years since the the SEC mandated company financial statements be reported in XBRL. Marc Joffe has advocated for the similar requirements for municipal reporting, a subject which becomes all the more critical at a time when many governments are likely to need emergency funding. After our exploration of amost 150 PDFs, we believe that it would be possible for auditors to come close to the efficacy of XBRL with a few mandatory guidelines for PDF formatting.
Although there were 73 problem elements, there were only a few patterns common in the large majority of cases:
- Scanned images of printed documents
- Tables which ran over onto a second page without repeating the column headers
- Tables which included a second section on the same page with large indentations
- Lack of uniformity of header and footer formatting and naming
- Lack of uniformity in line-item names (ie: Total Net Position and Net Position)
We now have a dataset which could be used to greatly improve the machine readability of municipal financial statements with a small number of prescribed formatting guidelines.