Most EDGAR docs (but not all) are available in a very poorly adhered
to mark up language - no OCR required, but you do need to apply many
heuristics to determine the intent of whoever wrote the document. For
example, from memory, table columns are defined with a <C> tag, but
this tag simply indicates the tab stop for the column delimits, not an
XML style tag. So once you have found a table with columns, you need
to do some guess work to determine what is a header, versus units,
versus actual column content.
And then within the columns, footnote references are often included or
other notes which are difficult to parse out. Some columns have
content that span multiple rows, whereas other columns (or rows)
within the same table don't. And then if there is a page break inside
a table, best of luck trying to work out what happens on the next page
as there is no standard to define this behavior - but it occurs
frequently.
Every financial firm that I know of just pays boatloads of cash to one
of Retuers, Bloomberg, etc., who have armies of 'encoders' who
manually enter in the data into a normalized format. And even then, if
you pony up the minimum of ~$10k/month for these feeds, you get to
deal with the joy of an entirely different set of problems that you
don't ever get to scratch the surface of when you write your own
parser (because you never get far enough along with solving _that_
problem). The second order problems come from company restatements,
changing accounting standards, changing reporting periods, etc. Not to
mention the lack of a unified schema for company identification. CRSP
uses one standard, Bloomberg another, Reuters another (and its
completely incompatible with different standards from different
companies that are part of the Thompson group..) In the end you spend
time doing Levenshtein distance matching on names, and dealing with
the inevitable typos that appear in ISINs, CUSIPs and SEDOLs from your
data provider...
XRBL should change some of this , but it doesn't help for right now.
Its a pain.
On Jun 6, 1:41 pm, Bo Cowgill <
bo.cowg...@gmail.com> wrote:
> Will this application sometimes need to OCR, or are these generally text
> PDFs?
>