Parsers for SEC EDGAR data

2,086 views
Skip to first unread message

Skye Bender-deMoll

unread,
Jun 8, 2009, 1:19:05 PM6/8/09
to get-t...@googlegroups.com
Hi Folks,

This seems like the perfect time to announce this morning's release of
the CorpWatch API, developed with support from Sunlight Foundation.

http://api.corpwatch.org

The api servers data parsed from the 10-K Exhibit 21, the listing of
subsidiaries of each corporation. Currently, the api is serving only
2008 data (last full year) but we've parsed 2009 and the last ten years,
will be available soon.

Press release about the associated GUI application is here:
http://www.corpwatch.org/article.php?id=15381

get-theinfo group wrote:

> Does anybody know of a free edgar submissions file parser written in python?
>

CorpWatch API is in perl, and only does 10-K, Exhibit 21. Josh at
GovTrack has parsers for some of the ownership forms.

> Or an overview what information can be found in the filing.


> Most EDGAR docs (but not all) are available in a very poorly adhered
> to mark up language - no OCR required, but you do need to apply many
> heuristics to determine the intent of whoever wrote the document. For
> example, from memory, table columns are defined with a <C> tag, but
> this tag simply indicates the tab stop for the column delimits, not an
> XML style tag. So once you have found a table with columns, you need
> to do some guess work to determine what is a header, versus units,
> versus actual column content.

Yes, this is very hard. There is no standard format, and for Exhibit
21, some companies even submit their data as images! In other cases,
html table markup is used, but not in a very helpful way. In fact, the
markup is so bad in some cases, we wonder if it was done intentionally
to obscure data. The parser we wrote (in perl) uses lots of crude
huristics and crazy regular expressions to attempt to guess at the
formatting. We believe we are get about 95% correct.


> XRBL should change some of this , but it doesn't help for right now.

We investigated XBRL a bit when starting. For what we wanted to do
(serve up subsidary relationships) there seemed to be few advantages at
this point.

best,
-skye (one of the developers on the CorpWatch API)

Greg Elin

unread,
Jun 11, 2009, 7:00:03 AM6/11/09
to get-t...@googlegroups.com
Skye,

Are you all going to connect this information to the corporate information on Freebase?
Have you listed the API with programmableweb?

very cool...btw

Greg Elin
gr...@fotonotes.net
http://twitter.com/gregelin
skype: fotonotes
aim: wiredbike
cell: 917-304-3488

Jonathan Gray

unread,
Jun 16, 2009, 9:16:17 AM6/16/09
to get-t...@googlegroups.com, Skye Bender-deMoll, sup...@api.corpwatch.org, okfn-discuss
On Mon, Jun 8, 2009 at 7:19 PM, Skye Bender-deMoll<skye...@skyeome.net> wrote:
> This seems like the perfect time to announce this morning's release of
> the CorpWatch API, developed with support from Sunlight Foundation.

This is great news!

I've registered basic details on CKAN - please feel free to add
anything I may have missed!

http://ckan.net/package/read/corpwatch

Also, I'm not clear about how it can be re-used. Is it open as in
opendefinition.org? If so would you consider using a legal tool like
CC0 or the PDDL to make this explicit?

--
Jonathan Gray

Community Coordinator
The Open Knowledge Foundation
http://www.okfn.org

Reply all
Reply to author
Forward
0 new messages