python edgar parser?

599 views
Skip to first unread message

Lukasz Szybalski

unread,
Jun 5, 2009, 6:36:10 PM6/5/09
to get-t...@googlegroups.com
Hello,
Does anybody know of a free edgar submissions file parser written in python?

Or an overview what information can be found in the filing.


Thanks,
Lucas

Josh Tauberer

unread,
Jun 6, 2009, 12:38:33 PM6/6/09
to get-t...@googlegroups.com
Lukasz Szybalski wrote:
> Does anybody know of a free edgar submissions file parser written in python?
>
> Or an overview what information can be found in the filing.

There are different types of forms in EDGAR. My C parser for the
corporate ownership forms (iirc forms 3, 4, 5) is here:
http://razor.occams.info/code/repo/?/govtrack/sec

Josh

Bo Cowgill

unread,
Jun 6, 2009, 1:41:42 PM6/6/09
to get-t...@googlegroups.com
Will this application sometimes need to OCR, or are these generally text PDFs?

josh reich

unread,
Jun 6, 2009, 1:57:41 PM6/6/09
to get.theinfo
Most EDGAR docs (but not all) are available in a very poorly adhered
to mark up language - no OCR required, but you do need to apply many
heuristics to determine the intent of whoever wrote the document. For
example, from memory, table columns are defined with a <C> tag, but
this tag simply indicates the tab stop for the column delimits, not an
XML style tag. So once you have found a table with columns, you need
to do some guess work to determine what is a header, versus units,
versus actual column content.

And then within the columns, footnote references are often included or
other notes which are difficult to parse out. Some columns have
content that span multiple rows, whereas other columns (or rows)
within the same table don't. And then if there is a page break inside
a table, best of luck trying to work out what happens on the next page
as there is no standard to define this behavior - but it occurs
frequently.

Every financial firm that I know of just pays boatloads of cash to one
of Retuers, Bloomberg, etc., who have armies of 'encoders' who
manually enter in the data into a normalized format. And even then, if
you pony up the minimum of ~$10k/month for these feeds, you get to
deal with the joy of an entirely different set of problems that you
don't ever get to scratch the surface of when you write your own
parser (because you never get far enough along with solving _that_
problem). The second order problems come from company restatements,
changing accounting standards, changing reporting periods, etc. Not to
mention the lack of a unified schema for company identification. CRSP
uses one standard, Bloomberg another, Reuters another (and its
completely incompatible with different standards from different
companies that are part of the Thompson group..) In the end you spend
time doing Levenshtein distance matching on names, and dealing with
the inevitable typos that appear in ISINs, CUSIPs and SEDOLs from your
data provider...

XRBL should change some of this , but it doesn't help for right now.

Its a pain.

On Jun 6, 1:41 pm, Bo Cowgill <bo.cowg...@gmail.com> wrote:
> Will this application sometimes need to OCR, or are these generally text
> PDFs?
>

Bo Cowgill

unread,
Jun 6, 2009, 2:01:26 PM6/6/09
to get-t...@googlegroups.com
I'm interested in doing a similar project in which I extract lots of data from the bankruptcy filings in PACER (if I ever somehow get free/cheaper access to PACER) -- and effectively turn each bankruptcy filing in to at a row in several large CSV files. I've gotten started on the parser, but I suppose this is a good forum to see if anybody has done the same. These docs don't have a markup language, they're just PDFs. 

Josh Tauberer

unread,
Jun 6, 2009, 2:05:57 PM6/6/09
to get-t...@googlegroups.com
On 06/06/2009 01:41 PM, Bo Cowgill wrote:
> Will this application sometimes need to OCR, or are these generally text
> PDFs?

The forms I dealt with were (fortunately) XML within SGML. The only
challenge was that it is a very large amount of data.

- Josh Tauberer
- GovTrack.us

http://razor.occams.info

"Yields falsehood when preceded by its quotation! Yields
falsehood when preceded by its quotation!" Achilles to
Tortoise (in "Godel, Escher, Bach" by Douglas Hofstadter)

Lukasz Szybalski

unread,
Jun 16, 2009, 11:42:54 AM6/16/09
to get-t...@googlegroups.com
On Sat, Jun 6, 2009 at 1:01 PM, Bo Cowgill<bo.co...@gmail.com> wrote:
> I'm interested in doing a similar project in which I extract lots of data
> from the bankruptcy filings in PACER (if I ever somehow get free/cheaper
> access to PACER) -- and effectively turn each bankruptcy filing in to at a
> row in several large CSV files. I've gotten started on the parser, but I
> suppose this is a good forum to see if anybody has done the same. These docs
> don't have a markup language, they're just PDFs.

As far as Pacer there is http://pacer.resource.org/recycling.html

Lucas
--
How to create python package?
http://lucasmanual.com/mywiki/PythonPaste
DataHub - create a package that gets, parses, loads, visualizes data
http://lucasmanual.com/mywiki/DataHub
Reply all
Reply to author
Forward
0 new messages