# Libraries
packages <-
c("data.table",
"stringr",
"rlist",
"tabulizer",
"pdftools",
"parallel",
"DT"
)
if (length(setdiff(packages,rownames(installed.packages()))) > 0) {
install.packages(setdiff(packages, rownames(installed.packages())))
}
invisible(lapply(packages, library, character.only = TRUE))
knitr::opts_chunk$set(comment=NA, fig.width=12, fig.height=8, out.width = '100%')
Introduction
This post will be a continuation of Parsing of Mass Municipal PDF CAFR’s with Tabulizer, pdftools and AWS Textract - Part 1 dealing with extracting data from PDFs using R. When Redwall discovered pdftools
, and its pdf_data()
function, which maps out every word on a pdf page by x-y coordinate, we thought that was interesting, but didn’t really know how to use it. We also didn’t have the regular expression skills, and were much more befuddled by the nested list structures than we are now.
As for tabulizer
, it took about a year before rJava magically started working properly, but even then we it wasn’t possible to consistently read a large number of tables of different sizes without cutting off fields in unexpected ways. Only in this Mass pdf scraping project have we realized that, by combining these two packages, it becomes possible to access data in reliable way, from a large number of varied pdf formats.
Our Massachusset’s municipal CAFR project provided a perfect opportunity to put all these pieces together. This blog post will consist of a step-by-step walk through which will hopefully help others avoid some of the pain that we experienced in getting to this point.
Walk Through Plan
To begin with, we had to download the pdfs from the CAFR Library at the Center for Municipal Finance. We won’t show the code to do the download here, but it can be found at reason_pdf_parser.R. In order to do this on the scale that we plan for this project, we had to build nested lists with the pdf metadata of 150 Massachussett’s CAFR pdfs. For now, we will just walk through a few key points using a single statment from the Abington, MA 2018 Statement of Net Position from the CAFR downloaded here.
# Set up pdf and pdf_path to directory
dir <- "/Users/davidlucey/Desktop/David/Projects/mass_munis/data/pdf_cafr/"
city <- "abington"
pdf <- paste0(city, "_2018.pdf", collapse="")
pdf_path <- paste0(dir, pdf, collapse = "")
# Run pdf_data on Abington CAFR
abington <- pdftools::pdf_data(pdf_path)
# Name each page of list for page index in pdf
names(abington) <- 1:length(abington)
# Look at structure of 2n element in 92-page nested list
str(abington[[2]])
Classes 'tbl_df', 'tbl' and 'data.frame': 266 obs. of 6 variables:
$ width : int 42 19 76 124 58 20 96 19 41 73 ...
$ height: int 15 15 15 15 15 15 15 15 15 15 ...
$ x : int 168 215 238 319 92 154 179 279 302 348 ...
$ y : int 72 72 72 72 102 102 102 102 102 102 ...
$ space : logi TRUE TRUE TRUE FALSE TRUE TRUE ...
$ text : chr "TOWN" "OF" "ABINGTON," "MASSACHUSETTS" ...
PDF Tools pdf_data Functionality
The above is a list of data.frames containing metadata of the location of every word on every one of the 92 pages of the Abington 2018 CAFR. The structure of the second page is shown above. But, we only need the key financial statements, so would like to drop the majority of pages which don’t have what we need. For example, we know that page 16 has the Statement of Net Position. We could search for that page using regular expressions as shown for variable sonp
below.
See how we extract only the head unique 5 lines of each page by y, paste those lines back together, then match our regular expression on the text of those lines looking for the phrase “STATEMENT OF NET POSITION”. However, there are several pages meeting this criteria, such as “Proprietary Funds” Statement of Net Position on page 20 (which we don’t want). Most of these other pages can be eliminated by choosing NOT to match the word “FUNDS” by the same process, so notice that we negate our second match with “!”. Hence, our sonp_index
comes back as 16, which can be used to filter out the remaining pages.
# Convert elements to data.table
abington <- mclapply(abington, setDT)
# Get index of Abington Statement of Net Position
sonp_index <-
which(
unlist(
mclapply(abington, function(page){
(str_detect(
paste(
# Reformat top 5 lines by y and look for match to "STATEMENT OF NET POSITION"
page$text[
page$y %in% head(unique(page$y), 5)
],
collapse = " "
),
"STATEMENT OF NET POSITION"
) &
# And requires both statements to be TRUE
!str_detect(
paste(
page$text[
# Reformat top 5 lines by y and look for non match to "FUNDS"
page$y %in% head(unique(page$y), 5)
],
collapse = " "
),
"FUNDS"
)
)
}
)
)
)
# Extract and View Statement of Net Position pdftools pdf_data metadata
sonp <- abington[sonp_index][[1]]
sonp
width height x y space text
1: 32 6 264 73 TRUE STATEMENT
2: 7 6 299 73 TRUE OF
3: 10 6 308 73 TRUE NET
4: 25 6 320 73 FALSE POSITION
5: 14 6 287 87 TRUE JUNE
---
337: 71 9 144 745 FALSE Massachusetts
338: 11 11 300 743 FALSE 13
339: 26 9 426 745 TRUE Basic
340: 43 9 456 745 TRUE Financial
341: 53 9 502 745 FALSE Statements
Above is the text grid of Abington’s Statement of Net Position as taken by pdftools
. Using this metadata, we can begin to put together exact area parameters for Tabula. Even better, we could programically do it for a large number of tabless. In our experience, this is important because the tabulizer
default lattice
method for tabular data can be unpredictable cutting off fields unexpectedly.
Tabulizer Area Coordinates
Tabulizer
specifies pages in blocks of 72 * inches, so a typical 8.5 x 11 verticle page would have dimensions of 612 x 720. This coordinate grid is used to specify the area
parameter (top
, left
, bottom
and right
). All of of Massachusett’s financial statement tables have a "$"
sign in the first and last rows, so those could be used to locate the top
or bottom
paramenters. In addition, all pages including financial statements have language referring users to the “notes to the financial statements” usually on the second to last line, which could be the “bottom”, or the midpoint between the bottom and the "$"
(if more room was needed).
# Determine if page is verticle or horizontal
x <- 8.5 * 72
y <- 11 * 72
max_x <- max(sonp$x)
max_y <- max(sonp$y)
orientation <-
ifelse(x < max_x, "horizontal", "verticle")
# TOP
# Keys on the first instance of the year "2018"
table_top <-
min(sonp$y[str_detect(sonp$text, "2018") & sonp$space==FALSE])
# Find the height at in the table_top row
height_top <- unique(sonp$height[sonp$y == table_top])
# Add table_top and height_top to avoid slicing row
top <- table_top + height_top
# BOTTOM
# Table Bottom marked by last instance of character "$"
table_bottom <-
max(sonp$y[str_detect(sonp$text, "\\$")])
# Height at bottom row of table
height_bottom <- unique(sonp$height[sonp$y == table_bottom])
# Bottom of table
bottom <- table_bottom + height_bottom
# LEFT
# Add some space to leftmost x coordinate to avoid slicing
left <-
ifelse( min(sonp$x) - 30 > 0,
min(sonp$x) - 30, 1 )
# RIGHT
# Find width at maximum "x" coordinate
width_max_x <- max(sonp$width[sonp$x == max_x])
# Add width at maximum "x" plus more space wether verticle or horizontal
right <-
max_x + width_max_x + ifelse(orientation == "verticle", 30, 50)
# FINAL AREA PARAMETER FOR TABULIZER AS INTEGER VECTOR
# Note the specification as an integer vector
a <- c(top, left, bottom, right)
# Show coordinates
a
[1] 93 24 681 585
We give an example for Abington’s Statement of Net Position above, starting with the maximum x
and y
, and determining the page orientation (ie: verticle or horizontal). After finding the location of the date line at the top, and walk down a little from there to set a table_top
variable. Typically, it is best to leave a little margin between the page header and the top of the table. The bottom of the table is set adding the height to the bottom line of the table, and left parameter is set by taking the smallest x
coordinate and reducing by a little to margin for error. We leave a larger margin for the right-most coordinate because this is where we have found that the most errors occur, often when the algorithm seems to try to squish the table into the available columns.
In our experience, the most problems come with missetting the top and right parameters. Indentation can also confuse the algorithm. Columns can be split in the middle into two columns, often at the far-rightmost, for example. In the end, we chose parameters of 93 (top), 24 (left), 681 (bottom) and 585 (right).
Tabulizer extract_table()
Function
Below we run our area parameters we derived above through tabulizer
. Note that the area parameter, itself an integer vector, is further wrapped as a list because not having this structure throws an error. In addition, avoid the half day of wheel spinning we experienced by specifying guess as “F” to over-ride the default lattice, otherwise your area parameter is ignored with no warning. Also, we use the sonp_index integer to specify the page of the pdf. There are several options for output which all work as expected, but data.frame seems most natural.
# Tabulizer extract_tables output is a list
abington_sonp <-
extract_tables(
pdf_path,
pages = sonp_index,
area = list(a),
guess = F,
output = "data.frame")
# Extract and print single element from list
abington_sonp <- abington_sonp[[1]]
abington_sonp
X
1
2
3 ASSETS
4 CURRENT:
5 Cash and cash equivalents................................................ $
6 Receivables, net of allowance for uncollectibles:
7 Real estate and personal property taxes...........................
8 Tax liens.....................................................................
9 Community preservation fund surtax.................................
10 Motor vehicle and other excise taxes.................................
11 User charges...............................................................
12 Departmental and other................................................
13 Intergovernmental.........................................................
14 Community preservation state share.................................
15 Special assessments......................................................
16 Tax foreclosures...............................................................
17 Total current assets...................................................
18 NONCURRENT:
19 Receivables, net of allowance for uncollectibles:
20 Special assessments......................................................
21 Capital assets, nondepreciable.............................................
22 Capital assets, net of accumulated depreciation.....................
23 Total noncurrent assets.............................................
24 TOTAL ASSETS..................................................................
25
26 DEFERRED OUTFLOWS OF RESOURCES
27 Deferred outflows related to pensions....................................
28 Deferred outflows related to other postemployment benefits......
29 TOTAL DEFERRED OUTFLOWS OF RESOURCES..................
30 LIABILITIES
31 CURRENT:
32 Warrants payable............................................................
33 Accrued payroll...............................................................
34 Health claims payable.........................................................
35 Accrued interest...............................................................
36 Accrued liabilities...............................................................
37 Capital lease obligations...................................................
38 Landfill closure..................................................................
39 Compensated absences......................................................
40 Notes payable..................................................................
41 Bonds payable..................................................................
42 Total current liabilities................................................
43 NONCURRENT:
44 Landfill closure..................................................................
45 Compensated absences......................................................
46 Net pension liability............................................................
47 Net other postemployment benefits liability..............................
48 Bonds payable..................................................................
49 Total noncurrent liabilities..........................................
50 TOTAL LIABILITIES...............................................................
51 DEFERRED INFLOWS OF RESOURCES
52 Deferred inflows related to pensions....................................
53 NET POSITION
54 Net investment in capital assets...............................................
55 Restricted for:
56 Permanent funds:
57 Expendable................................................................
58 Nonexpendable............................................................
59 Gifts and grants...............................................................
60 Community preservation...................................................
61 Unrestricted...........................................................................
62 TOTAL NET POSITION......................................................... $
X.1 X.2 Primary.Government X.3 X.4
1 NA Governmental Business-type
2 NA Activities Activities Total
3 NA
4 NA
5 NA 10,392,587 $ 7,449,193 $ 17,841,780
6 NA
7 NA 313,316 - 313,316
8 NA 882,182 34,598 916,780
9 NA 6 ,245 - 6,245
10 NA 387,455 - 387,455
11 NA - 1,930,158 1,930,158
12 NA - 149,296 149,296
13 NA 1,715,882 - 1,715,882
14 NA 70,735 - 70,735
15 NA - 32,137 32,137
16 NA 663,449 - 663,449
17 NA 14,431,851 9,595,382 24,027,233
18 NA
19 NA
20 NA - 10,712 10,712
21 NA 101,526,106 1,614,044 103,140,150
22 NA 26,998,272 31,944,596 58,942,868
23 NA 128,524,378 33,569,352 162,093,730
24 NA 142,956,229 43,164,734 186,120,963
25 NA
26 NA
27 NA 417,711 7,062 424,773
28 NA 1,982,740 19,982 2,002,722
29 NA 2,400,451 27,044 2,427,495
30 NA
31 NA
32 NA 660,037 403,389 1,063,426
33 NA 206,897 75,704 282,601
34 NA 311,064 - 311,064
35 NA 278,817 71,281 350,098
36 NA 50,638 2,247,039 2,297,677
37 NA - 53,845 53,845
38 NA 139,000 - 139,000
39 NA 346,271 26,896 373,167
40 NA 53,168 - 53,168
41 NA 2,462,040 491,136 2,953,176
42 NA 4,507,932 3,369,290 7,877,222
43 NA
44 NA 4,080,000 - 4,080,000
45 NA 881,952 20,919 902,871
46 NA 19,188,882 324,438 19,513,320
47 NA 67,618,712 681,460 68,300,172
48 NA 45,761,763 5,764,229 51,525,992
49 NA 137,531,309 6,791,046 144,322,355
50 NA 142,039,241 10,160,336 152,199,577
51 NA
52 NA 1,399,576 23,663 1,423,239
53 NA
54 NA 82,168,482 27,395,220 109,563,702
55 NA
56 NA
57 NA 99,189 - 99,189
58 NA 69,778 - 69,778
59 NA 1,088,568 - 1,088,568
60 NA 740,211 - 740,211
61 NA (82,248,365) 5,612,559 (76,635,806)
62 NA 1,917,863 $ 33,007,779 $ 34,925,642
Clean up
The tabulizer
output is still in a raw form with colums sometimes determined by indentations and x
values, such as the “$” signs. The numbers are in character form with commas and sometimes negative numbers are shown in parenthesis, and need to be parsed into numeric. The item names often have a long series of periods which need to be stripped. The biggest challenge is the column names which often include the first row of the full column name, and need to be rebuilt. This is not a small task and not what we were hoping to illustrate in this post, so we are just showing the output below. Please refer to our Github code for the a more complete explanation and solutions to many of these issues.
Final Product
Though there is still work to be done, the final product of this post is shown above. Single elements could be extracted to form a database, or the output could be saved to csv. The headers such as ASSETS or LIABILITIES could be nested. The main point is that short of XBRL, the data has been set free from the PDF in a machine readable form. Not only that, this general process can be repeated for a large number of slightly differing PDFs with a relatively high low error rate as we will show in the next post Evaluating Mass Muni CAFR Tabulizer Results - Part 3. In cases where errors do occur, a second layer can be used to run the more challenging PDFs through AWS Textract SDK. We will show how this is done in our next post.