Message from discussion
How to do large-scale search & replace of multiword expressions prior to LDA
Date: Tue, 30 Oct 2012 06:27:41 -0700 (PDT)
From: "michael douma @idea_org" <michael.douma.i...@gmail.com>
To: nltk-dev@googlegroups.com
Cc: "michael douma @idea_org" <michael.douma.i...@gmail.com>
Message-Id: <1e88f434-2974-4ad3-acde-1364c1f7bb4b@googlegroups.com>
In-Reply-To: <op.wmyxwtdcnxjllz@wifi-joel-2.cs.usyd.edu.au>
References: <fe429486-3cc0-4a74-aa12-50c9d86860c8@googlegroups.com>
<op.wmynmkcpnxjllz@joels-macbook.local>
<dec48cbd-a5ef-466a-82e5-5c03d3f836ad@googlegroups.com>
<op.wmyxwtdcnxjllz@wifi-joel-2.cs.usyd.edu.au>
Subject: Re: [nltk-dev] How to do large-scale search & replace of multiword
expressions prior to LDA
MIME-Version: 1.0
Content-Type: multipart/mixed;
boundary="----=_Part_135_21973882.1351603661406"
------=_Part_135_21973882.1351603661406
Content-Type: multipart/alternative;
boundary="----=_Part_136_14962131.1351603661406"
------=_Part_136_14962131.1351603661406
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 7bit
Joel, thanks.
Particularly for suggesting the byte offset option from fgrep. We can break
it into multiple threads, no problem. We'll probably break it into threads
via a script, not via Hadroop.
Any other approaches/suggestions? Let me know.
Michael
On Monday, October 29, 2012 9:47:11 PM UTC-4, Joel wrote:
>
> On Tue, 30 Oct 2012 11:49:29 +1100, michael douma @idea_org
> <michael.d...@gmail.com <javascript:>> wrote:
>
> > So the question is really (2)+(3). What is a good approach for
> processing
> > millions of multi-word expressions on 75 GB.
>
> It's clearly a highly parallelizable problem: 3 million MWEs will not take
> a huge amount of memory per-process when loaded into Aho-Corasick, so as
> long as you have the computing resources, it should be straightforward
> enough to solve using something like Hadoop (or even something less
> heavy-duty as you don't have a fancy reduce operation).
>
> The component we're missing so far is how to perform the substitutions:
> GNU fgrep -o gives you the first, longest match. With -b, you get its byte
> offset. So, for example:
>
> $ echo 'hello' | fgrep -bo -e 'he' -e 'hel'
> 0:hel
> $ echo 'hello' | fgrep -bo -e 'he' -e 'el'
> 0:he
> $ echo 'hello' | fgrep -bo -e 'he' -e 'ell'
> 0:he
> $ echo 'hello' | fgrep -bo -e 'he' -e 'lo'
> 0:he
> 3:lo
>
> You can use this mode and a substitution script like the following
> (untested):
>
> #!/usr/bin/env/python
>
> def read_fgrep_ob(f):
> for l in f:
> b, m = l.rstrip().split(':')
> yield int(b), m
>
> def fgrep_sub(text_in, text_out, matches, sub_cb):
> offset = 0
> for match_offset, match_text in matches:
> while match_offset - offset:
> s = text_in.read(match_offset - offset)
> offset += len(s)
> text_out.write(s)
> text_out.write(sub_cb(match_text))
> offset += len(match_text)
> text_out.write(text_in.read())
>
> if __name__ == '__main__':
> import sys
> text_path = sys.argv[1]
> matches_path sys.argv[2]
> fgrep_sub(open(text_path, 'rb'), sys.stdout,
> read_fgrep_ob(open(matches_path)),
> lambda s: s.replace(' ', '_'))
>
> GNU fgrep also has a -w option to match word boundaries.
>
>
> Or, an alternative might be to use --color=yes instead of -ob, and you
> will get matches marked inline with xterm colour codes!
>
> - Joel
>
------=_Part_136_14962131.1351603661406
Content-Type: text/html; charset=utf-8
Content-Transfer-Encoding: 7bit
Joel, thanks. <div>Particularly for suggesting the byte offset option from fgrep. We can break it into multiple threads, no problem. We'll probably break it into threads via a script, not via Hadroop. <div><br></div><div>Any other approaches/suggestions? Let me know. </div><div><br></div><div>Michael<br><div><br><br>On Monday, October 29, 2012 9:47:11 PM UTC-4, Joel wrote:<blockquote class="gmail_quote" style="margin: 0;margin-left: 0.8ex;border-left: 1px #ccc solid;padding-left: 1ex;">On Tue, 30 Oct 2012 11:49:29 +1100, michael douma @idea_org
<br><<a href="javascript:" target="_blank" gdf-obfuscated-mailto="oE7aM-O7EaIJ">michael.d...@gmail.com</a>> wrote:
<br>
<br>> So the question is really (2)+(3). What is a good approach for processing
<br>> millions of multi-word expressions on 75 GB.
<br>
<br>It's clearly a highly parallelizable problem: 3 million MWEs will not take
<br>a huge amount of memory per-process when loaded into Aho-Corasick, so as
<br>long as you have the computing resources, it should be straightforward
<br>enough to solve using something like Hadoop (or even something less
<br>heavy-duty as you don't have a fancy reduce operation).
<br>
<br>The component we're missing so far is how to perform the substitutions:
<br>GNU fgrep -o gives you the first, longest match. With -b, you get its byte
<br>offset. So, for example:
<br>
<br>$ echo 'hello' | fgrep -bo -e 'he' -e 'hel'
<br>0:hel
<br>$ echo 'hello' | fgrep -bo -e 'he' -e 'el'
<br>0:he
<br>$ echo 'hello' | fgrep -bo -e 'he' -e 'ell'
<br>0:he
<br>$ echo 'hello' | fgrep -bo -e 'he' -e 'lo'
<br>0:he
<br>3:lo
<br>
<br>You can use this mode and a substitution script like the following
<br>(untested):
<br>
<br>#!/usr/bin/env/python
<br>
<br>def read_fgrep_ob(f):
<br> for l in f:
<br> b, m = l.rstrip().split(':')
<br> yield int(b), m
<br>
<br>def fgrep_sub(text_in, text_out, matches, sub_cb):
<br> offset = 0
<br> for match_offset, match_text in matches:
<br> while match_offset - offset:
<br> s = text_in.read(match_offset - offset)
<br> offset += len(s)
<br> text_out.write(s)
<br> text_out.write(sub_cb(match_<wbr>text))
<br> offset += len(match_text)
<br> text_out.write(text_in.read()<wbr>)
<br>
<br>if __name__ == '__main__':
<br> import sys
<br> text_path = sys.argv[1]
<br> matches_path sys.argv[2]
<br> fgrep_sub(open(text_path, 'rb'), sys.stdout,
<br> read_fgrep_ob(open(matches_<wbr>path)),
<br> lambda s: s.replace(' ', '_'))
<br>
<br>GNU fgrep also has a -w option to match word boundaries.
<br>
<br>
<br>Or, an alternative might be to use --color=yes instead of -ob, and you
<br>will get matches marked inline with xterm colour codes!
<br>
<br>- Joel
<br></blockquote></div></div></div>
------=_Part_136_14962131.1351603661406--
------=_Part_135_21973882.1351603661406--