Cleaning Raw SEC Filings Dowloaded from EDGAR and Linking to Compustat

sina golara

unread,

Nov 7, 2016, 5:10:34 AM11/7/16

to wrdssas

Hi, I have read may posts on this forum but still looking for a comprehensive answer to cleaning EDGAR filings:

I have downloaded raw sec filings from EDGAR and saved them as .txt files, for multiple companies and multiple firms

1- Different formats: from the tags in the text, it is clear that the format of older files are different from newer ones. Can you explain what each format is and how to clean each for textual analysis?

2- I want to extract text from only one section (item 1. business description). if I simply want to extract the text between "Item 1. Business Description" and "Item 2" I will have problems because these strings are repeated multiple times in the file (e.g. one instance of each is in the table of content). So how do I write a code that skips these occurrences and only gives me section 1?

3- Is it safe to use the cik in Compustat Annual Fundamentals to link these filings to financial data? I read that cik's may only reflect the current status not the historical status; in this case, how big would be the error in matching? (in other words how much change occurs in cik for a company over time? is it negligible?)

Note: I have used python for downloading files and will use Stata for analysis.

Thank you.

joost impink

unread,

Nov 7, 2016, 8:45:13 AM11/7/16

to wrdssas

hi Sina,

1. Older filings are 'flat text', recent filings are in HTML format.

2. You need some logic to determine which match is most likely to be the right one. For example, some minimum length or some keywords (that would appear in the 'real' business description and not in a table of contents).

3. I would expect you lose about 30% of the Compustat Funda observations when matching on CIK in Funda. (Most of the missings are because CIK is empty in Funda.)

Best Regards,

Joost

Charlotte Zhang

unread,

Mar 21, 2018, 5:24:35 PM3/21/18

to wrdssas

Hi Joost and Sina,

This is helpful. I have some more questions related to this:

1. Do you have a sense when CIK formats changed? Is there anyway to transform one to another, for the purpose of analysis?

2. For my research I only matched a subset of CIKs which are all of 10D documents, and very few of the CIKs actually could be matched to a ticker symbol (say, 1%). What's more, I checked both data sources: there are close to 700,000 unique CIKs in EDGAR filings, but only around 40,000 unique CIKs that could be matched to GVKEY and CUSIP. How come there's such a discrepancy? (Note I'm referring to the total population in U.S., and not using the sample program.)

3. Is there any way I can make use of entity name to match to any other data sources?