Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Introduction

This is part 2 of a 3-part series on extracting XBRL data from secdatabase.com. In Learning SQL and Exploring XBRL with secdatabase.com – Part 1, we showed how to set up the database connection from RStudio. In this post, we will discuss the basics of XBRL and secdatabase.com’s SQL representation of it. We will try to understand the dimensions of the database over its history, and how to query it from RStudio. For example, how many unique companies per year by filing type are included? How many unique labels and concepts per company each year? Have report names been standardized so that data can be retrieved and compared across sectors and over time? How many revisions and extensions were used over time? The goal is to explore how the project has progressed since launch from the perspective of usability for financial analysis in its raw form.

In the first few years, there were several academic papers probing the usefulness of XBRL for end investors, but we found fewer more recently. In this analysis, we find that some of those concerns are starting to be resolved, but many remain valid today. We would like to forewarn that we are not experts on XBRL, and are making our best effort to understand this complicated data set. XBRL is unstructured, and may or may not have consistent fields, while SQL is by definition structured, so this adds complexity. In addition, most of the resources available about XBRL are for accounting professionals, and few that we could find, went into depth about financial analysis. The purpose of our blog is to record our learning so that we can refer back, and also to get feedback from others if there are better solutions.

We would also like to warn in advance that this post will include a lot of cod on how to query secdatabase.com with SQL as well as some quite detailed discussion of the XBRL data model, though we used R for charts and tables. It is intended to be helpful for readers who really want to better understand the history and structure XBRL, and possibly to try out secdatabase.com for themselves. Now would be a good time to drop off or skim through the tables if that will be too much detail. Lastly, all the queries shown were checked by secdatabase.com, who were very welcoming of the project. There is a comment section at the bottom, and we welcome any constructive feedback.

Series Index

• Part 1 – Learning SQL and Exploring XBRL with secdatabase.com
• Part 2 – Finding the Dimensions of secdatabase.com from 2010-2018 – Part 2
• Part 3 – How to extract company or sector line items from secdatabase.com – Part 3 (In Progress)

Hierarchial Taxonomy of XBRL in secdatabase.com

From an investor’s perspective, the most challenging thing about XBRL is learning the taxonomy to be able to extract the desired financial statement elements from deeply nested structures. We are going to do our best to lay it out succinctly, but it is likely that there will be inaccuracies. We relied on The XBRL Book for much of the information about XBRL, and a lot of responses from the secdatabase.com team about their database. The XBRL Book argues that everything we find to be complicated about XBRL are misconceptions, but after our deep dive, we respectfully disagree.

The root of XBRL data elements are called facts, which are recorded as string or numeric values, represented by a series of aspects (concept, period, entity and unit) within a context. Within secdatabase.com, facts are stored in three “data_point” tables, with the most important being “data_point_snapshot” (which shows the most current value after any revisions). As we showed in the last posts, all the “entities” (identified by “cik” or “company_name” fields in secdatabase.com) and all their filings (ie: 10-K’s, etc.) are tracked in the “company_submission” table. The combination of two or more aspects, called a “collision”, can uniquely specify the coordinates of a particular value or group of values (facts) within a context to be displayed in the presentation of a particular entity. Within secdatabase.com, this always involves joining two or more tables together, usually on at least the accession_number_int field.

Aside from the entity (“company_name” or “cik” in secdatabase), probably the most important aspect is the concept (such as “Revenues”, “Assets”, etc.), but without other aspects, no value can be uniquely specified. Instead of calling these concepts, secdatabase.com uses the field “datapoint_name”, which is included in three tables having the term “data_point” in them, as well as the “report_presentation_line_item” table. We have also heard concepts called “tags” in some cases. A particular “datapoint_name” is likely to have many “datapoint_ids” attached to it. For example, a single 10-K might have three separate facts specified by a “datapoint_name” like “Revenues” (for say 2017, 2018 and 2019), but each of these would have a unique “datapoint_id”. If the entity or year was not specified, a single “datapoint_name” might bring thousands of facts for “Revenues” (ie: for all the companies in that year). We found it complicated to specify the context to get the unique value we were hoping for, and often found ourselves with duplicates for some companies and none for others. We will discuss this more in Post 3 where we try to use the data for company or sector analysis.

An XBRL Instance is just a “bag of facts”, and a single fact might be shown in more than one “presentation” (statements and disclosures). For example, Total Revenues might be displayed in the Income Statement, but also in a disclosure of segment reporting of a region or product. A separate taxonomy schema defining concepts and a linkbase to organize specified concepts into graphs to be displayed in presentations. We discovered that the names used to describe the same element varied among companies, and the same company might even change the names and/or formatting of its own financial statements over time (ie: “statement – BALANCE SHEET” or “statement – Balance Sheet”), making them different to a computer. As a result, secdatabase has manufactured a identifier in the “report_presentation_section” called the “statement_type”, which tries to collapse these many variations into the main financial statements which we all expect (ie: Balance Sheet (B), Income Statement (I), etc.). The believe their field is right most of the time, but we found even this identifier to be an incomplete solution.

When a fact has been recorded against a concept (or “datapoint_name”) as a “string_value” or “numeric_value”, XBRL also assigns “label roles”, which can be used ultimately be used to specify how it will be displayed in a presentation (it’s concept or in secdatabase.com speak “database_name” is not displayed in the presentation). Every label has a “language” and a “label role” (called “preferred_label_role” by secdatabase.com), and like concepts, labels are part of a linkbase. A single concept may be associated with more than one label. If this seems confusing, that is because it is.

Distinct Companies and Filings per Annum Have Been Declining

We wanted to understand the universe of companies which have been filing in XBRL each year, which was achieved by a simple SQL query of the single “company_submission” table. In Figure 1 below, we group by “document_type” and “document_fiscal_year_focus”and eliminate duplicate filings with “DISTINCT”, then we count by year and “document_type”. We had heard that there are ~ 8,000 companies filing in XBRL, so adding up the 10-K, 10-K/A and 20-F’s, we get close to that many in 2012, but then easing off to ~ 6,600 in recent years. We assume companies have been withdrawing listings or being acquired at a faster rate than new companies have been issuing stock. Though there was a healthy increase in foreign companies filing 20-F’s, it is surprising to see a decline in equity listings during an historic bull market.

SELECT document_fiscal_year_focus AS year
,document_type AS filing
,COUNT(DISTINCT accession_number_int) AS num_filings
,COUNT(DISTINCT cik) AS num_companies
FROM sec_financial_statements.company_submission
WHERE document_fiscal_year_focus BETWEEN 2010 AND 2020
AND document_type in ('10-K', '10-K/A', '10-Q', '10-Q/A', '20-F', '20-F/A', '40-F')
GROUP BY document_fiscal_year_focus, document_type
ORDER BY document_fiscal_year_focus, filing;

Figure 1: Number of Edgar Filings by Type and Year

History of XBRL Challenges

In the early days after launch of XBRL, there was lot of academic research on the challenges of implementing XBRL. Here is a thorough accounting of the concerns from the beginning of 2013 SEC–Mandated XBRL Data at Risk of Being Irrelevant to Investors and Analysts. The key recommendations to avoid the risk of becoming obsolete to investors were: (1) reduce the error rate and limit unnecessary extensions, (2) improve the quality of data and (3) have technologists take over from accountants. The link to CEASA’s longer paper on the subject is here. As we will show, between vendors like XBRLogic and the industry advocate, XBRL US, the US branch of the non-profit supporting the implementation of digital business reporting standards, there are signs of recent improvement on some fronts.

Labels Role Types Contribute to Errors

One of the big contributors to presentation errors comes from labels. In the next query, we show the number of label roles each year, this time from the “report_presentation_line_item” table, which connects a group of concepts to a presentation. By far the most common is the terseLabel, followed by no or a missing label, verboseLabel and label (also known as the “standard” label). We also don’t understand why there would not be a label in so many cases. As we understand it, the terseLabel and verboseLabels are shorter and longer versions of label, respectively. We are surprised that there are so many more terseLabels than labels or verboseLabels, because the XBRL Book informs us that the standard label is supposed to be the most common. Both labels and terseLabels are then aggregated within the XBRL Taxonomy (which we showed at the end of Part 1) into totalLabels, which are always sums of other labels. The periodEndLabel and periodStartLabel are used to define the period or duration of time, and are used with other to define period. In the query below, we pull out the most common “preferred_label_role” over the whole period.

SELECT preferred_label_role
,COUNT(*) AS frequency
FROM sec_financial_statements.report_presentation_line_item
GROUP BY preferred_label_role
ORDER BY COUNT(*) DESC;

Figure 2: Frequency of Usage of Label Roles – All Periods

Because facts are supposed to be positive by default, there are several “negated” label roles, which are used if there is a need to flip the sign of something for display in a presentation. For example, the main concept for net income is “NetIncomeLoss”, which should by convention always have a positive numeric value, but would be displayed with a negatedLabel when there was a loss. However, we also found cases where the string or numeric value was already negative. Our understanding is that one of the most common areas where errors occur in XBRL is with negated labels mistakenly transposing the sign of line item. With all of this complexity, it is not surprising that the wrong labels or label roles have often been used. As we will discuss, label roles seem to be one area where there confusion by those who prepare XBRL statements, seemingly switching conventions from year to year. If those who prepare financial statements have struggled with it, the challenge would be doubly difficult for investors. After we spent several hours trying to understand label roles, we are still not sure we have it down.

The Problem of Concept Standardization

The complication surrounding concepts (in our data set called “datapoint_name”), seems to be another of the main challenges which has held XBRL back from wide use by investors. Companies have the discretion to make up custom concepts, known as “extensions” in XBRL parlance. The desire to do this company-by-company makes sense, but when considered across sectors or the market as a whole, it negatively impacts comparability. All of these issues are discussed in detail in this excellent summary by Idaciti In Data We Trust. In High Quality Data We Shine.

In Figure 3, the 25 most common fields in 2019 are shown. Our understanding is that XBRL is built on the idea of matching “as reported” financial statements. Though it seems like the highest level line items like “Assets”, “Liabilities” and “Shareholders Equity” should have 100% comparability across the 5,460 companies reporting 10-K’s in 2019, these come well short of the number. The number of companies using a field also falls off sharply after the first 25, so comparability will be much worse for more revealing items, like bad debts, capitalized items and accruals.

SELECT datapoints
,COUNT(*) AS num_occur
FROM (SELECT DISTINCT dp.datapoint_name AS datapoints,
dp.accession_number_int,
document_fiscal_year_focus
FROM sec_financial_statements.data_point_snapshot dp
INNER JOIN sec_financial_statements.company_submission cp
ON dp.accession_number_int = cp.accession_number_int
WHERE document_fiscal_year_focus = 2019
AND dp.unit = 'USD'
AND cp.document_type = '10-K')
GROUP BY datapoints
ORDER BY num_occur DESC
LIMIT 25;

Figure 3: Number of Occurances of Concept during 2019

In the query below, with results shown in Figure 4, we show the number of unique concepts (datapoint_name) by “label_role” over time. The number of unique concepts rise steadily until 2013 or 2014, and then start to come down slowly. The number of labels roles which are missing (ie: no label role at all recorded) shows the most improvement. So there is some evidence that an effort is being made, but as we will show in Part 3, there is still room to improve. The only thing to say about the SQL query are to highlight the two WHERE statements (1) rpli.datapoint_id = dp.datapoint_id and (2) rps.section_sequence_id = rpli.section_sequence_id. Without these two constraints, we found we were double-counting many data points.

SELECT year
,label_role
,COUNT(*) AS unique_labels
FROM( SELECT DISTINCT rpli.datapoint_name
,rpli.preferred_label_role AS label_role
,cs.document_fiscal_year_focus AS year
FROM sec_financial_statements.company_submission cs,
sec_financial_statements.report_presentation_section rps,
sec_financial_statements.report_presentation_line_item rpli,
sec_financial_statements.data_point_snapshot dp
WHERE cs.accession_number_int = rpli.accession_number_int
AND rpli.accession_number_int = rps.accession_number_int
AND rps.accession_number_int = dp.accession_number_int
AND rps.section_sequence_id  = rpli.section_sequence_id
AND rpli.datapoint_id = dp.datapoint_id
AND cs.document_fiscal_year_focus BETWEEN 2010 AND 2020
AND cs.document_type = '10-K'
AND cs.document_fiscal_period_focus = 'FY'
GROUP BY cs.document_fiscal_year_focus, rpli.datapoint_name, rpli.preferred_label_role)
GROUP BY year, label_role
ORDER BY year;

Figure 4: Annual Number of Unique Concepts by Label

Use of Extensions Has Declined, But Still Represent About 18% of Data Points

It is easy to understand why extensions are needed, but non-standard data points are big detractors from comparability, so it is recognized for a long time that fewer would be better. It sounds like the SEC is getting more serious more recently about discouraging them, but progress has been slow, especially for smaller companies. With the help of secdatabase.com, the query below filters on the “version” from the “data_point” table being null or not null in order to distinguish non-standard from standard data points. If version (ie: “us-gaap/2019”) is null, the data point is an extension. In our query, we have limited to just 10-K’s and 10-K/A’s.

We can see that almost all companies have at least a few extensions. Our understanding is that larger companies have been much better at sticking to the standard tags, but we have not grouped by company size for this analysis. The results of the query in Figure 5 shows that extensions peaked in about 2013, and have come down by almost half. We do also not show the calculations here, but there are about 30x as many unique non-standard as unique standard data point names, but because standard data point names are used so much more frequently, the ratio of non-standard to standard data points is much lower. As we show below, we calculate that the ratio has bounced between 16-25%, and in recent years, extensions have been about 18% of all data points. This ratio squares with with the SEC GAAP Trends Analysis for 2019.

SELECT year(cs.filing_date) AS filing_year
,count(CASE WHEN version IS NOT NULL THEN datapoint_name END) standard_datapoints_count
,count(CASE WHEN version IS NULL THEN datapoint_name END) non_standard_datapoints_name_count
,count(DISTINCT CASE WHEN version IS NULL THEN cs.cik END) total_company_count_with_extension
,count(DISTINCT cs.cik) total_company_count
FROM sec_financial_statements.data_point_snapshot dp
INNER JOIN sec_financial_statements.company_submission cs
ON dp.accession_number_int = cs.accession_number_int
WHERE CS.document_type IN ('10-K', '10-K/A') AND dp.segment_hash IS NULL
GROUP BY year(cs.filing_date)
ORDER BY 1 DESC

Figure 5: Standard Data Points and Extensions Over Time

Another Way to Look at Annual Number of Unique Tags

For the query below shown in Figure 6, we sum up the annual number of times the five most common label roles (label, terseLabel, verboseLabel, totalLabel and no label) were used for a unique concept (datapoint_name). We then summed up how many concepts were used that many times in a given year with that label role. If the “Concept Frequency in Year” (x-axis) is 1, each of those label roles were only used once for that year, 2 for those used twice, and so on. Taking for example the label facet in the chart below, the number used just 1x has generally been ~4,000-10,000 per year. Unique “terseLabel’s” were used 1x ~68,000-90,000. If there are typically 6-8k companies filing, there should ideally be only several hundred concepts used that many times for genuine comparability. Moving to the right of the x-axis, there are very few labels which were used more than 500 times. Curves for later years have come down somewhat and slowly flattened compared to the early years. Positive change would see the graph moving much further downwards from the left on the tag_frequency axis and flattening out.

SELECT year
,tag_frequency
,label_role
,COUNT(*) AS num_repeats
FROM( SELECT DISTINCT rpli.datapoint_name
,rpli.preferred_label_role AS label_role
,COUNT(*) as tag_frequency
,cs.document_fiscal_year_focus AS year
FROM sec_financial_statements.company_submission cs,
sec_financial_statements.report_presentation_line_item rpli,
sec_financial_statements.report_presentation_section rps,
sec_financial_statements.data_point_snapshot dp
WHERE cs.accession_number_int = rpli.accession_number_int
AND rpli.accession_number_int = rps.accession_number_int
AND rps.accession_number_int = dp.accession_number_int
AND rps.section_sequence_id  = rpli.section_sequence_id
AND cs.document_fiscal_year_focus BETWEEN 2010 AND 2020
AND rpli.datapoint_id = dp.datapoint_id
AND cs.document_type = '10-K'
AND cs.document_fiscal_period_focus = 'FY'
GROUP BY cs.document_fiscal_year_focus, rpli.datapoint_name, rpli.preferred_label_role)
GROUP BY year, tag_frequency, label_role
ORDER BY year;