Data extraction tool chain

Stuart Sierra

unread,

Apr 13, 2008, 12:01:12 AM4/13/08

to process.theinfo

Hello, all,

I'm looking for a set of tools for dealing with large collections of
files (a few million). Ideally, I'd like to be able to write some
data extraction or conversion code, then "submit" it to a system that
will run that code on a collection of files and report the results.
Is there anything out there in the open-source world that approaches
this?

I've looked at Hadoop, but it's just the foundation -- the data
processing model. I'm looking for something with a little more
structure, to help me keep track of all my files while I run them
through multiple stages of processing. Right now I'm keeping it
mostly in my head, and my head doesn't scale well.

Thanks,
Stuart Sierra, altlaw.org

Michal Migurski

unread,

Apr 13, 2008, 1:33:16 AM4/13/08

to process...@googlegroups.com

Important questions:
Is this a one-time affair, or something you'll need to repeat on a
regular basis? Do all the files pretty much need to have the same
process applied to them? Are there any cross-dependencies, or could
you just divvy up the files into groups and churn through each one?

-mike.

----------------------------------------------------------------
michal migurski- mi...@stamen.com
415.558.1610

Stuart Sierra

unread,

Apr 14, 2008, 9:33:52 AM4/14/08

to process.theinfo

Hi Mike, thanks for responding. The short answer to your questions
is: the worst possibilities.

I have 50-60 new files coming in daily, and I need to re-process the
entire corpus whenever I come up with new types of data to extract.

There are 3 or 4 major groups of files with completely different
processing needs, and a dozen or so minor groups requiring small
tweaks to the code.

All the files are cross-dependent. They're court cases, and I want to
graph the links between them by parsing citations (which follow a
somewhat consistent syntax). In addition, the citations tell me
things about the cases they cite, like the preferred form of the
title.

Also, the files contain errors that I need to handle gracefully. Some
data need to be hand-corrected.

I've been doing all of this with scripts, keeping the dependencies in
my head, and it's driving me nuts.

-Stuart

> michal migurski- m...@stamen.com
> 415.558.1610

dwighthines

unread,

Apr 14, 2008, 10:19:19 AM4/14/08

to process.theinfo

April 14, 2008

Dear Stuart,

These citations are all from today's emailing from www.llrx.com.
They give you good people who know about information problems and
solutions from different perspectives, mostly law. Contact the
founder and still editor and publisher, Sabrina I. Pacifici, who is
quite helpful: spacific at earthlink dot net.

Second, Investigative Reporters and Editors (www.ire.org) have a
subgroup, Computer Assisted Reporters and they have librarians who are
interested in what you are doing. One called me but I don't remember
her name. Call the phone number on the IRE webpage for Research
support.

**The Tao of Law Librarianship: If the Books Go, Will They Still Want
Us?
http://www.llrx.com/columns/tao13.htm
Connie Crosby's column returns with an insightful clarion call about
the work in which we must engage now, collectively, to clarify, market
and invigorate our profession.

**E-Discovery Update: Minimizing E-Mail Archive Data Conversion Issues
http://www.llrx.com/columns/emaildataconversion.htm
According to Conrad J. Jacoby, e-mail conversion is done without a
second thought in many e-discovery projects, and the results are often
satisfactory to both producing and requesting parties. However, each
major e-mail archive architecture uses a fundamentally different
method for storing information about e-mail messages, and sometimes
some collateral damage will occur.

**Reference from Coast to Coast: Making A Federal Case Out of It
http://www.llrx.com/columns/reference57.htm
This month Jan Bissett and Margi Heinen review the expanding world of
federal case law sources available free on the web. They also
highlight the new feature of searching slip opinions that is now
available on a number of sites.

**The Government Domain: Rich Resources from the Librarians of the Fed
http://www.llrx.com/columns/govdomain34.htm
Peggy Garvin explores the extensive web resources produced by the
librarians of the Federal Reserve Bank of St. Louis, and relevant to
economists, researchers, and others all over the world.

**LLRX Court Rules, Forms and Dockets, continually updated by law
librarian Margaret Berkland
http://www.llrx.com/courtrules.

I thought that what you are trying to do is what WEX is doing? Isn't
it?

There are some other sources but the most helpfull will be the
administrative offices of the federal courts and of the state
courts. Start with www.fjc.org (federal judicial center).

Please keep us posted on your progress in getting the system out of
your head and into the computers.

Dwight Hines
150 Nesmith Ave.
St. Augustine, Florida 32084
904-829-1507

On Apr 13, 12:01 am, Stuart Sierra <the.stuart.sie...@gmail.com>
wrote:

Lukasz Szybalski

unread,

Apr 14, 2008, 11:19:27 AM4/14/08

to process...@googlegroups.com

On Mon, Apr 14, 2008 at 8:33 AM, Stuart Sierra
<the.stua...@gmail.com> wrote:
>
> Hi Mike, thanks for responding. The short answer to your questions
> is: the worst possibilities.
>
> I have 50-60 new files coming in daily, and I need to re-process the
> entire corpus whenever I come up with new types of data to extract.
>
> There are 3 or 4 major groups of files with completely different
> processing needs,

Just out of curiosity.
What would these 4 major groups be? txt, pdf, ??

> and a dozen or so minor groups requiring small tweaks to the code.

What type of tweaks we are talking about?

>
> All the files are cross-dependent. They're court cases, and I want to
> graph the links between them by parsing citations (which follow a
> somewhat consistent syntax).

Could you post a sample citation? Is it somewhat similar to book citations?

In addition, the citations tell me
> things about the cases they cite, like the preferred form of the
> title.
>
> Also, the files contain errors that I need to handle gracefully. Some
> data need to be hand-corrected.

Would these hand-corrections be because of spelling issues or what exactly?

Lucas

Chris K Wensel

unread,

Apr 14, 2008, 11:56:41 AM4/14/08

to process...@googlegroups.com

Hey Stuart

You might look at Cascading (www.cascading.org). It should help you
with creating complex processes for dealing with your data on Hadoop.

The pattern you described below is working well for a project managing
'financial' datasets.

We are also working on a Groovy 'builder' to provide even more
flexibility.

ckw

On Apr 12, 2008, at 9:01 PM, Stuart Sierra wrote:
>

Chris K Wensel
ch...@wensel.net
http://chris.wensel.net/

Michal Migurski

unread,

Apr 14, 2008, 12:30:04 PM4/14/08

to process...@googlegroups.com

On Apr 14, 2008, at 6:33 AM, Stuart Sierra wrote:

> Hi Mike, thanks for responding. The short answer to your questions
> is: the worst possibilities.
>
> I have 50-60 new files coming in daily, and I need to re-process the
> entire corpus whenever I come up with new types of data to extract.

Eek.

> All the files are cross-dependent. They're court cases, and I want to
> graph the links between them by parsing citations (which follow a

> somewhat consistent syntax). In addition, the citations tell me

> things about the cases they cite, like the preferred form of the
> title.

This doesn't necessarily mean you need to have both documents handy
when parsing each, though, right? I guess when I say cross dependent,
I'm trying to figure out whether you can process document A separately
from document B, even if one references the other. Can the citations
be normalized independently for later comparison? I know that legal
syntax can be kind of a disaster.

Your problem seems super-hairy, maybe more detail can make it less so.
It sounds like what you're doing is well-covered by existing search-
related literature, and you know that Yahoo and Google don't need to
re-process the whole lot every time they encounter a new web page.

More details! =)

-mike.

Stuart Sierra

unread,

Apr 14, 2008, 5:22:40 PM4/14/08

to process.theinfo

Hello, all! Thanks for the responses. Looks like this is interesting
to some people, so here's some more detail. I'm the lead (i.e. only)
full-time developer on AltLaw.org, a free web database / search engine
for U.S. court cases (the LLRX.com article mentions us!). It's all
open-sourced at lawcommons.org, although that site is a few months out
of date.

We get cases from 2 main sources: (1) public.resource.org, which
bought and published about a million pages of case law late last year;
and (2) a dozen federal court web sites, for the most recent cases.
My daily updates come from (2), but every few months (1) puts out some
more. There's much more data out there -- statutes, district courts,
state courts -- that I haven't had time to work with yet.

The public.resource.org corpus is nice XHTML with microformat-style
tagging, but they're generating it from plain text, so it's not
perfect.

The court web sites are a mess -- each court has its own web site with
its own circa-1995 interface. A volunteer wrote some scrapers to
download files and metadata from each site. Most of the files are PDF
or really crappy HTML generated by ... wait for it ... WordPerfect.
Another volunteer managed to hack up some code that extracts something
approximating XHTML from the PDFs.

Legal citations follow a (mostly) regular format, see
http://www.law.cornell.edu/citation/ if you're interested. I have
Perl scripts (written by a volunteer, of course) to extract all the
citations in a document and save them in Plain-Old-XML.

My current process looks something like this:
1. Get tarballs from public.resource.org and the court scrapers.
2. Convert PDF files into XHTML.
3. Convert messy HTML files into clean XHTML.
4. Extract just the "body text" from each XHTML file (for web pages).
5. Convert all the HTML into plain text (for indexing).
6. Extract metadata from XHTML and write it to RDF-Turtle.
7. Extract metadata from court scrapers and write it to RDF-Turtle.
8. Load the RDF data into a Sesame 2.0 RDF database.
9. Run citation-extractor on each text file.
10. Cross-reference citations with RDF database, generate more RDF.
11. Cross-reference everything to generate one XML document per case,
containing all metadata, plain text, HTML, and citation links.
12. Load XML documents into Solr, a server wrapper for Lucene and the
back-end for the altlaw.org web site.
13. Cry into my beer.

Thanks, Chris, for the link to Cascading -- I hadn't found that before
and it looks like it could be a big help. Any other suggestions
welcome!

-Stuart

Reply all

Reply to author

Forward