On Tue, 30 Oct 2012 11:49:29 +1100, michael douma @idea_org
<
michael.d...@gmail.com> wrote:
> So the question is really (2)+(3). What is a good approach for processing
> millions of multi-word expressions on 75 GB.
It's clearly a highly parallelizable problem: 3 million MWEs will not take
a huge amount of memory per-process when loaded into Aho-Corasick, so as
long as you have the computing resources, it should be straightforward
enough to solve using something like Hadoop (or even something less
heavy-duty as you don't have a fancy reduce operation).
The component we're missing so far is how to perform the substitutions:
GNU fgrep -o gives you the first, longest match. With -b, you get its byte
offset. So, for example:
$ echo 'hello' | fgrep -bo -e 'he' -e 'hel'
0:hel
$ echo 'hello' | fgrep -bo -e 'he' -e 'el'
0:he
$ echo 'hello' | fgrep -bo -e 'he' -e 'ell'
0:he
$ echo 'hello' | fgrep -bo -e 'he' -e 'lo'
0:he
3:lo
You can use this mode and a substitution script like the following
(untested):
#!/usr/bin/env/python
def read_fgrep_ob(f):
for l in f:
b, m = l.rstrip().split(':')
yield int(b), m
def fgrep_sub(text_in, text_out, matches, sub_cb):
offset = 0
for match_offset, match_text in matches:
while match_offset - offset:
s = text_in.read(match_offset - offset)
offset += len(s)
text_out.write(s)
text_out.write(sub_cb(match_text))
offset += len(match_text)
text_out.write(text_in.read())
if __name__ == '__main__':
import sys
text_path = sys.argv[1]
matches_path sys.argv[2]
fgrep_sub(open(text_path, 'rb'), sys.stdout,
read_fgrep_ob(open(matches_path)),
lambda s: s.replace(' ', '_'))
GNU fgrep also has a -w option to match word boundaries.
Or, an alternative might be to use --color=yes instead of -ob, and you
will get matches marked inline with xterm colour codes!
- Joel