Introduction
As financial professionals and analytic software lovers, the ability to efficiently load a large number of financial statements, and conduct an analysis has always been a key objective. In previous posts, Redwall Analytics worked with a 15-year time series of municipal Comprehensive Annual Financial Reports (CAFR) for 15 Fairfield County, CT towns Fairfield County Town Level Spending and Liabilities Gallop Since 2001. We also studied the worrisome long-term, time series of unfunded liabilities for all 169 Connecticut municipalities Connecticut City Unfunded Pension And OPEB Liabilities. We are probably most proud of our work replicating the complicated one-year (2016) spreadsheet analysis by Marc Joffe (Reason Foundation) and Mark Fitch (Yankee Institute) over the long-term Replicating Yankee Institute Risk Score Over 15 Years.
However, all of these were conducted using downloaded .csv data from the State’s website, not on the kind of structured XBRL data now required in real-time for public companies on the SEC’s Edgar website. Despite several attempts, XBRL had remained elusive up until now, but with daily practice, tasks previously beyond reach, suddenly become achievable.
Methodology
A good deal of credit and much of the code to extract XBRL data here from Edgar comes Aaron Mumala’s 2018 blog post: Accessing Financial Data from the SEC - Part 2. In addition, Micah Waldstein’s Parsing Functions in edgarWebR, as well as his edgarWebR package was helpful in understanding the structure and finding documents on the Edgar website, but the package doesn’t seem to be actively maintained, so we ultimately had to scrape the locations of the filings. Lastly, none of this would have been possible without Darko Bergant’s finstr and Roberto Bertolusso’s XBRL packages. To summarize the workflow to be described below, filing data is extracted with web scraping, the XBRL Instance Documents are downloaded and parsed into a list of data.frames with the XBRL package xbrlDoAll function, and finally, statements are organized into traditional Income Statements and Balance Sheets using the finstr package.
This will be the first in a three-part series. First, a quick walk-through to try to fill in some gaps from Aaron Mumala’s blog post and current documentation to help others up the learning curve with XBRL. Even better, we would welcome comments clarifying solutions to unresolved issues or even more efficient ways of extracting the same data than we have used. In the second part, we will use another free source of financial statement data (financialmodelingprep.com API) to pull 10-year’s of income statements for ~700 publically-listed pharma companies and explore R&D spending. Lastly, we will look to mine financial statement items for warning signs using metrics from a 2003 piece in the CFA Conference Proceeding.
Scraping Apple’s 10-K Filing Links from Edgar
We previously used edgarWebR for to find href links pertaining to filings, but because it stopped working with recent updates, so had to build the web scraper below. Below, we choose to query all past Apple XBRL 10-K filings up until last week. 20 results are available along with the web link (href) to each set of filings. Although filings prior to 2008 are available, these are in html format, and can’t be parsed with finstr and XBRL. A later effort may be made to retrieve these by scraping and parsing the html, but this is an advanced operation.
# Filter Edgar company search for "10-K" starting in "1990" in this case for AAPL (CIK 0000320193)
url <-
"https://www.sec.gov/cgi-bin/browse-edgar?action=getcompany&CIK=0000320193&type=10-K&dateb=1990&owner=exclude&count=25"
# Scrape filing page identfier numbers
filings <-
read_html(url) %>%
html_nodes(xpath='//*[@id="seriesDiv"]/table') %>%
html_table() %>%
as.data.table() %>%
janitor::clean_names()
# Drop pre XBRL filings
filings <- filings[str_detect(format, "Interactive")]
# Showing last 5 years
filings[1:5, c(1,3:4)]
filings
1: 10-K
2: 10-K
3: 10-K
4: 10-K
5: 10-K
description
1: Annual report [Section 13 and 15(d), not S-K Item 405]Acc-no: 0000320193-19-000119 (34 Act) Size: 12 MB
2: Annual report [Section 13 and 15(d), not S-K Item 405]Acc-no: 0000320193-18-000145 (34 Act) Size: 12 MB
3: Annual report [Section 13 and 15(d), not S-K Item 405]Acc-no: 0000320193-17-000070 (34 Act) Size: 14 MB
4: Annual report [Section 13 and 15(d), not S-K Item 405]Acc-no: 0001628280-16-020309 (34 Act) Size: 13 MB
5: Annual report [Section 13 and 15(d), not S-K Item 405]Acc-no: 0001193125-15-356351 (34 Act) Size: 9 MB
filing_date
1: 2019-10-31
2: 2018-11-05
3: 2017-11-03
4: 2016-10-26
5: 2015-10-28
There are generally about 15 elements in a filing package for a given year, as can be seen here for Apple in 2019. We only want to extract the XBRL Instance Document (XML), which is at the bottom of the table. In order to get this manually, we would have to click on the Document link. Instead, we show the steps to extract the actual links one by one using regex and again web scraping below. We also show the links for most recent five years below. Copy and pasting any of these links into the browser would show the 10-K for the relevant year.
# Extract filings identifiers with regex match of digits
pattern <- "\\d{10}\\-\\d{2}\\-\\d{6}"
filings <- filings$description
filings <-
stringr::str_extract(filings, pattern)
# Build urls for filings using filing numbers and Edgar's url structure
urls <-
sapply(filings, function(filing) {
# Rebuild URL to match Edgar format
url <-
paste0(
"https://www.sec.gov/Archives/edgar/data/320193/",
paste0(
str_remove_all(filing, "-"),"/"),
paste0(
filing, "-index.htm"),
sep="")
# Return Url
url
})
# Extract hrefs with links to 10-K filings
aapl_href <-
sapply(urls, function(url) {
# Pattern tomatch
href <- '//*[@id="formDiv"]/div/table'
# Table of XBRL Documents
page <-
read_html(url) %>%
xml_nodes('.tableFile') %>%
html_table()
# Extract document
page <- rbindlist(page)
document <-
page[str_detect(page$Description, "XBRL INSTANCE DOCUMENT")]$Document
# Take break not to overload Edgar
Sys.sleep(2)
# Reconstite as href link
href <-
paste0(str_remove(url, "\\d{18}.*$"),
str_extract(url, "\\d{18}"),
"/",
document)
# Return
href
})
# Show first 5 hrefs
aapl_href[1:5]
0000320193-19-000119
"https://www.sec.gov/Archives/edgar/data/320193/000032019319000119/a10-k20199282019_htm.xml"
0000320193-18-000145
"https://www.sec.gov/Archives/edgar/data/320193/000032019318000145/aapl-20180929.xml"
0000320193-17-000070
"https://www.sec.gov/Archives/edgar/data/320193/000032019317000070/aapl-20170930.xml"
0001628280-16-020309
"https://www.sec.gov/Archives/edgar/data/320193/000162828016020309/aapl-20160924.xml"
0001193125-15-356351
"https://www.sec.gov/Archives/edgar/data/320193/000119312515356351/aapl-20150926.xml"
Parse XBRL For all Instance Documents Using XBRL Package
We have not shown here, but XBRL parsing for US companies is governed by the schemas from the US GAAP reported Taxonomy, which are updated annually. These can be found here on the XBRL US website XBRL Taxonomy. By downloading the relevant .xsd file, removing the .xml suffix at the end and moving it to in the “xbrl.Cache” file created when you run XBRL, these problems were resolved for XBRL Instance Documents. Redwall spent quite a bit of time figuring out that we needed currency, dei, exch, country and currency files among others. We did this manually and by trial and error, but it seems likely that there is a better way. A more efficient solution might be found here on Stack Overflow XBRL Package Error in File.
Below, we use the xbrlDoAll function from the XBRL package to extract the financial statements using the links from the aapl_href list we made above. Note that we try(), because the function would fail on some of the documents and stop at that point.
# This chunk was run previously and stmt_list was saved for the purposes of this blog post
# Disable stringsAsFactors so XBRL parsing works
options(stringsAsFactors = FALSE)
# Run xbrlDoAll and store in stmt_list
stmt_list <-
# apply function to each element of href_list
lapply(aapl_href, function(doc) {
# Extract XBRL of using specified href
try(XBRL::xbrlDoAll(doc, cache.dir = "/Users/davidlucey/Desktop/David/Projects/xbrl_investment/xbrl.Cache"))
})
# Filter elements which failed to parse
stmt_list <-
Filter(function(x) length(x) > 1, stmt_list)
As we have set it up in the code above, we loop through all Apple’s 10-K hrefs parsing with the XBRL package. This gives a list object for each annual reporting package, each in turn containing 9-10 data.frames (as shown below). Most of the data.frames relate to the taxonomy, governed by the .xsb files for that year. Starting in 2008, 10-Ks began being filed as “XBRL Instance Documents”, and these work seamlessly with the workflow described above. Around 2018, 10-K’s shifted to “Extended XBRL Instance Document” format, which we could not build into financial statements with finstr.
# Show one of extracted XBRL statements
summary(stmt_list[[3]])
Length Class Mode
element 8 data.frame list
role 5 data.frame list
calculation 11 data.frame list
context 13 data.frame list
unit 4 data.frame list
fact 9 data.frame list
definition 11 data.frame list
label 5 data.frame list
presentation 11 data.frame list
In the future, we would like to better understand the XBRL data structure, but again, it wasn’t easy to find much simple documentation on the subject. This post from Darko Bergant gives an excellent diagram of the structural hierarchy of the XBRL object Exploring XBRL files with R. He goes on in the same post: All values are kept in the fact table (in the fact field, precisely). The element table defines what are these values (the XBRL concepts, e.g. “assets”, “liabilities”, “net income” etc.). The context table defines the periods and other dimensions for which the values are reported.
Below, we show the first 50 lines of the “fact” list of the 2017 XBRL package. As noted above, data here is nested in levels, with the top level being the Balance Sheet, Income Statement, etc (again see Exploring XBRL files with R). It is possible to drill down to lower level items. Mr. Bergant also explains here how to for example extract a lower level item such as warranty information which wouldn’t be broken out at the top level How to get all the elements contained in the original XBRL?.
# Drill down on financial statement "fact" for individual year in stmt_list
stmt_list[[3]]$fact[1:10,c(2:4)]
contextId
1 FD2015Q4YTD
2 FD2015Q4YTD_us-gaap_StatementEquityComponentsAxis_us-gaap_AccumulatedOtherComprehensiveIncomeMember
3 FD2015Q4YTD_us-gaap_StatementEquityComponentsAxis_us-gaap_CommonStockIncludingAdditionalPaidInCapitalMember
4 FD2015Q4YTD_us-gaap_StatementEquityComponentsAxis_us-gaap_RetainedEarningsMember
5 FD2016Q4YTD
6 FD2016Q4YTD_us-gaap_StatementEquityComponentsAxis_us-gaap_AccumulatedOtherComprehensiveIncomeMember
7 FD2016Q4YTD_us-gaap_StatementEquityComponentsAxis_us-gaap_CommonStockIncludingAdditionalPaidInCapitalMember
8 FD2016Q4YTD_us-gaap_StatementEquityComponentsAxis_us-gaap_RetainedEarningsMember
9 FD2017Q4YTD
10 FD2017Q4YTD_us-gaap_StatementEquityComponentsAxis_us-gaap_AccumulatedOtherComprehensiveIncomeMember
unitId fact
1 usd 748000000
2 usd 0
3 usd 748000000
4 usd 0
5 usd 379000000
6 usd 0
7 usd 379000000
8 usd 0
9 usd 620000000
10 usd 0
Apply XBRL_get_statements via finstr Package to transform XBRL into Financial Statements
Next, we run finstr’s xbrl_get_statements on our stmt_list objects (ie: the XBRL Instance Documents) to convert the parsed XBRL objects into financial statements.
# This chunk was run previously and result_list was saved for the purposes of this blog post
result_list <-
lapply(stmt_list, function(stmt) {
try(xbrl_get_statements(stmt))
})
Error : Each row of output must be identified by a unique combination of keys.
Keys are shared for 34 rows:
* 6, 8
* 5, 7, 9
* 49, 51
* 48, 50
* 55, 57
* 54, 56
* 11, 13
* 10, 12
* 25, 27
* 24, 26
* 59, 61
* 58, 60
* 29, 31
* 28, 30
* 63, 64, 66
* 62, 65
As shown above, finstr’s xbrl_get_statements is unsuccessful on the first item (2019), which is the new “Extended XBRL Instance Document”, because of a problem with duplicate keys after the spreading the data.frame. It seems that finstr hasn’t been updated since 2017, so it likely has something to do with the problem. We asked for guidance on Stack Overflow Finstr Get XBRL Statement Error Parsing XBRL Instance Documents, but so far no luck. Also, it fails with the pre-2008 documents as expected, which were only available as html.
Catching Errors in Successfully Parsed Statements
There are many issues with XBRL including that companies classify items in differing ways, use different names for the same items and sometimes the there are errors in the calculation hierarchy. Classification differences will be challenging, but finstr has the check_statement function to show where there are calculation inconsistencies.
# Run check_statement on 2nd element of result_list (2018)
for(list in result_list[2]) {
print(lapply(list, check_statement))
}
$ConsolidatedBalanceSheets
Number of errors: 0
Number of elements in errors: 0
$ConsolidatedStatementsOfCashFlows
Number of errors: 6
Number of elements in errors: 1
$ConsolidatedStatementsOfComprehensiveIncome
Number of errors: 0
Number of elements in errors: 0
$ConsolidatedStatementsOfOperations
Number of errors: 0
Number of elements in errors: 0
Above we can see that there are six errors in the “CashAndCashEquivalentsPeriodIncreaseDecrease” element_id in the 2017 Statement of Cash Flows. We can then drill down to see what those are. In this case, it is not a mismatch between the original data and the amount calculated as a check. The check is just not there (NA). We need to do further research to fully understand this function.
check <-
check_statement(result_list[[2]]$ConsolidatedStatementsOfCashFlows, element_id = "CashAndCashEquivalentsPeriodIncreaseDecrease")
# Calculated is NA
check$calculated
[1] NA NA NA
Looking at Apple’s 2017 Balance Sheet Disclosure
Here we show the balance sheets for the first available document which is as of September 2018 (2019 does exist, but we were not able to parse that thus far). We could do the same for the Income Statement or the Statement of Cash Flows.
# Drill down to annual balance sheet from result_list
result_list[[2]]$ConsolidatedBalanceSheets
Financial statement: 2 observations from 2017-09-30 to 2018-09-29
Element 2018-09-29 2017-09-30
Assets = 365725 375319
+ AssetsCurrent = 131339 128645
+ CashAndCashEquivalentsAtCarryingValue 25913 20289
+ AvailableForSaleSecuritiesCurrent 40388 53892
+ AccountsReceivableNetCurrent 23186 17874
+ InventoryNet 3956 4855
+ NontradeReceivablesCurrent 25809 17799
+ OtherAssetsCurrent 12087 13936
+ AssetsNoncurrent = 234386 246674
+ AvailableForSaleSecuritiesNoncurrent 170799 194714
+ PropertyPlantAndEquipmentNet 41304 33783
+ OtherAssetsNoncurrent 22283 18177
LiabilitiesAndStockholdersEquity = 365725 375319
+ Liabilities = 258578 241272
+ LiabilitiesCurrent = 116866 100814
+ AccountsPayableCurrent 55888 44242
+ OtherLiabilitiesCurrent 32687 30551
+ DeferredRevenueCurrent 7543 7548
+ CommercialPaper 11964 11977
+ LongTermDebtCurrent 8784 6496
+ LiabilitiesNoncurrent = 141712 140458
+ DeferredRevenueNoncurrent 2797 2836
+ LongTermDebtNoncurrent 93735 97207
+ OtherLiabilitiesNoncurrent 45180 40415
+ CommitmentsAndContingencies 0 0
+ StockholdersEquity = 107147 134047
+ CommonStocksIncludingAdditionalPaidInCapital 40201 35867
+ RetainedEarningsAccumulatedDeficit 70400 98330
+ AccumulatedOtherComprehensiveIncomeLossNetOfTax -3454 -150
Our results_list is nested, so not so easy to work with so we use the purrr package to flatten it into 35 financial statement objects including all of the balance sheets, income statements and cash flows for the period. The same first balance sheet we have already looked at is shown, but now at the top level so that the underlying data is easier to analyze and plot.
# Flatten nested lists down to one list of all financial statements for the 10 year period
fs <- purrr::flatten(result_list)
# Now the balance sheet is at the top level of fs instead of nested
fs[[2]]
Financial statement: 2 observations from 2017-09-30 to 2018-09-29
Element 2018-09-29 2017-09-30
Assets = 365725 375319
+ AssetsCurrent = 131339 128645
+ CashAndCashEquivalentsAtCarryingValue 25913 20289
+ AvailableForSaleSecuritiesCurrent 40388 53892
+ AccountsReceivableNetCurrent 23186 17874
+ InventoryNet 3956 4855
+ NontradeReceivablesCurrent 25809 17799
+ OtherAssetsCurrent 12087 13936
+ AssetsNoncurrent = 234386 246674
+ AvailableForSaleSecuritiesNoncurrent 170799 194714
+ PropertyPlantAndEquipmentNet 41304 33783
+ OtherAssetsNoncurrent 22283 18177
LiabilitiesAndStockholdersEquity = 365725 375319
+ Liabilities = 258578 241272
+ LiabilitiesCurrent = 116866 100814
+ AccountsPayableCurrent 55888 44242
+ OtherLiabilitiesCurrent 32687 30551
+ DeferredRevenueCurrent 7543 7548
+ CommercialPaper 11964 11977
+ LongTermDebtCurrent 8784 6496
+ LiabilitiesNoncurrent = 141712 140458
+ DeferredRevenueNoncurrent 2797 2836
+ LongTermDebtNoncurrent 93735 97207
+ OtherLiabilitiesNoncurrent 45180 40415
+ CommitmentsAndContingencies 0 0
+ StockholdersEquity = 107147 134047
+ CommonStocksIncludingAdditionalPaidInCapital 40201 35867
+ RetainedEarningsAccumulatedDeficit 70400 98330
+ AccumulatedOtherComprehensiveIncomeLossNetOfTax -3454 -150
Next, we can graph the evolution of “Current” items over the 10-year period. Note that in 2017, Apple started calling them “Consolidated Balance Sheets”. Prior to that, they were named “Statements of Financial Position Classified”. Hence, we had to regex match using both “Balance” and “Position” to get the relevant documents for the full period. This clearly cumbersome and difficult to do at scale on a larger number of companies. We can see that both current assets and liabilities have tripled over the period.
# Regex match list items matching "Balance" and "Position" and rbind into bs (balance sheet)
bs <-
rbindlist(fs[str_detect(names(fs), "Balance|Position")], fill = TRUE)
# Drop any duplicates with same endDate
bs <- unique(bs, by = "endDate")
# Function to scale y axis label
scaleFUN <- function(x) paste("$",x/1000000000,"Billion")
current <- names(bs)[str_detect(names(bs), "Current")]
# Tidy and ggplot from within data.table bs object coloring variables
bs[, melt(.SD, id.vars = "endDate", measure.vars=current)][
][!is.na(value)][
][, ggplot(.SD,
aes(as.Date(endDate),
as.numeric(value),
color = variable)) +
geom_line() +
scale_y_continuous(labels = scaleFUN) +
labs(title = "Apple Current Items over Time",
caption = "Source: SEC Edgar") +
ylab("Amount") +
xlab("Year") +
theme_bw()]
We are curious about R&D spending over time, so here we took the income statements which again used changing names (Statements of Income and Statements of Operations) over time. It looks like Apple reduced its R&D spending sharply after the iPhone launch, but has ramped it up again to launch the wearables business. We will use this strategy to extract specific balance sheet and income statement items in Part 3 to mine for warning signs.
# Regex match list items matching "StatementOFIncome" and StatementsOfOperations", and rbind into is (income statement)
is <-
rbindlist(fs[str_detect(names(fs),"StatementOfIncome$|StatementsOfOperations$")], fill =
TRUE)
# Drop NA rows
is <-
is[!is.na(SalesRevenueNet)]
# Drop previous year's which are dupes
is <-
unique(is, by = "endDate")
# Mutate R&D to sales ratio variable and select columns needed
is <-
is[, rd_sales :=
ResearchAndDevelopmentExpense / SalesRevenueNet][
][, .(endDate, rd_sales)]
# Tidy and ggplot within data.table is object
is[, melt(.SD, id.vars = "endDate", measure.vars=c("rd_sales"))][
, ggplot(.SD, aes(as.Date(endDate), as.numeric(value), color = variable)) +
geom_line() +
scale_y_continuous(labels = scales::percent) +
labs(title = "Apple R&D-to-Sales Ratio has More than Doubled since 2012",
caption = "SEC Edgar") +
ylab("Amount") +
xlab("Year") +
theme_bw()]
Conclusion
After spending the few years solving many problems in R, the volume of discussion about XBRL still seems surprisingly sparse for such a big potential use case. In addition, most of development at least in R seemed to stop cold in 2017. It will be interesting to learn if this is because most the people who look at financial statements generally don’t use XBRL in analytic software yet, or because it is too inefficient to get clean data in a usable form. In the end, Aaron Mumala suggested that downloading and parsing XBRL from Edgar was probably still not ready for prime time.
Admittedly, the challenges discovered in this exercise because of changing statement names, financial statement items, data formats, we wondered if the SEC Edgar XBRL disclosures are suitable for a large scale analysis. Many providers charge for data parsing and cleaning, and it may be worth paying for, especially if real money will be involved. In Part 2 of this series, we will look at using a free outside provider of financial statement data (Financial Modeling Prep) and see what we find.