-mike.
----------------------------------------------------------------
michal migurski- mi...@stamen.com
415.558.1610
Just out of curiosity.
What would these 4 major groups be? txt, pdf, ??
> and a dozen or so minor groups requiring small tweaks to the code.
What type of tweaks we are talking about?
>
> All the files are cross-dependent. They're court cases, and I want to
> graph the links between them by parsing citations (which follow a
> somewhat consistent syntax).
Could you post a sample citation? Is it somewhat similar to book citations?
In addition, the citations tell me
> things about the cases they cite, like the preferred form of the
> title.
>
> Also, the files contain errors that I need to handle gracefully. Some
> data need to be hand-corrected.
Would these hand-corrections be because of spelling issues or what exactly?
Lucas
You might look at Cascading (www.cascading.org). It should help you
with creating complex processes for dealing with your data on Hadoop.
The pattern you described below is working well for a project managing
'financial' datasets.
We are also working on a Groovy 'builder' to provide even more
flexibility.
ckw
On Apr 12, 2008, at 9:01 PM, Stuart Sierra wrote:
>
Chris K Wensel
ch...@wensel.net
http://chris.wensel.net/
> Hi Mike, thanks for responding. The short answer to your questions
> is: the worst possibilities.
>
> I have 50-60 new files coming in daily, and I need to re-process the
> entire corpus whenever I come up with new types of data to extract.
Eek.
> All the files are cross-dependent. They're court cases, and I want to
> graph the links between them by parsing citations (which follow a
> somewhat consistent syntax). In addition, the citations tell me
> things about the cases they cite, like the preferred form of the
> title.
This doesn't necessarily mean you need to have both documents handy
when parsing each, though, right? I guess when I say cross dependent,
I'm trying to figure out whether you can process document A separately
from document B, even if one references the other. Can the citations
be normalized independently for later comparison? I know that legal
syntax can be kind of a disaster.
Your problem seems super-hairy, maybe more detail can make it less so.
It sounds like what you're doing is well-covered by existing search-
related literature, and you know that Yahoo and Google don't need to
re-process the whole lot every time they encounter a new web page.
More details! =)
-mike.