Hi, I have read may posts on this forum but still looking for a comprehensive answer to cleaning EDGAR filings:
I have downloaded raw sec filings from EDGAR and saved them as .txt files, for multiple companies and multiple firms
1- Different formats: from the tags in the text, it is clear that the format of older files are different from newer ones. Can you explain what each format is and how to clean each for textual analysis?
2- I want to extract text from only one section (item 1. business description). if I simply want to extract the text between "Item 1. Business Description" and "Item 2" I will have problems because these strings are repeated multiple times in the file (e.g. one instance of each is in the table of content). So how do I write a code that skips these occurrences and only gives me section 1?
3- Is it safe to use the cik in Compustat Annual Fundamentals to link these filings to financial data? I read that cik's may only reflect the current status not the historical status; in this case, how big would be the error in matching? (in other words how much change occurs in cik for a company over time? is it negligible?)
Note: I have used python for downloading files and will use Stata for analysis.
Thank you.