Account Options

  1. Sign in
The old Google Groups will be going away soon, but your browser is incompatible with the new version.
Google Groups Home
« Groups Home
Message from discussion How to do large-scale search & replace of multiword expressions prior to LDA

Date: Tue, 30 Oct 2012 06:27:41 -0700 (PDT)
From: "michael douma @idea_org" <michael.douma.i...@gmail.com>
To: nltk-dev@googlegroups.com
Cc: "michael douma @idea_org" <michael.douma.i...@gmail.com>
Message-Id: <1e88f434-2974-4ad3-acde-1364c1f7bb4b@googlegroups.com>
In-Reply-To: <op.wmyxwtdcnxjllz@wifi-joel-2.cs.usyd.edu.au>
References: <fe429486-3cc0-4a74-aa12-50c9d86860c8@googlegroups.com>
 <op.wmynmkcpnxjllz@joels-macbook.local>
 <dec48cbd-a5ef-466a-82e5-5c03d3f836ad@googlegroups.com>
 <op.wmyxwtdcnxjllz@wifi-joel-2.cs.usyd.edu.au>
Subject: Re: [nltk-dev] How to do large-scale search & replace of multiword
 expressions prior to LDA
MIME-Version: 1.0
Content-Type: multipart/mixed; 
	boundary="----=_Part_135_21973882.1351603661406"

------=_Part_135_21973882.1351603661406
Content-Type: multipart/alternative; 
	boundary="----=_Part_136_14962131.1351603661406"

------=_Part_136_14962131.1351603661406
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 7bit

Joel, thanks. 
Particularly for suggesting the byte offset option from fgrep. We can break 
it into multiple threads, no problem. We'll probably break it into threads 
via a script, not via Hadroop. 

Any other approaches/suggestions? Let me know. 

Michael


On Monday, October 29, 2012 9:47:11 PM UTC-4, Joel wrote:
>
> On Tue, 30 Oct 2012 11:49:29 +1100, michael douma @idea_org 
> <michael.d...@gmail.com <javascript:>> wrote: 
>
> > So the question is really (2)+(3). What is a good approach for 
> processing 
> > millions of multi-word expressions on 75 GB. 
>
> It's clearly a highly parallelizable problem: 3 million MWEs will not take 
> a huge amount of memory per-process when loaded into Aho-Corasick, so as 
> long as you have the computing resources, it should be straightforward 
> enough to solve using something like Hadoop (or even something less 
> heavy-duty as you don't have a fancy reduce operation). 
>
> The component we're missing so far is how to perform the substitutions: 
> GNU fgrep -o gives you the first, longest match. With -b, you get its byte 
> offset. So, for example: 
>
> $ echo 'hello' | fgrep -bo -e 'he' -e 'hel' 
> 0:hel 
> $ echo 'hello' | fgrep -bo -e 'he' -e 'el' 
> 0:he 
> $ echo 'hello' | fgrep -bo -e 'he' -e 'ell' 
> 0:he 
> $ echo 'hello' | fgrep -bo -e 'he' -e 'lo' 
> 0:he 
> 3:lo 
>
> You can use this mode and a substitution script like the following 
> (untested): 
>
> #!/usr/bin/env/python 
>
> def read_fgrep_ob(f): 
>      for l in f: 
>        b, m = l.rstrip().split(':') 
>        yield int(b), m 
>
> def fgrep_sub(text_in, text_out, matches, sub_cb): 
>      offset = 0 
>      for match_offset, match_text in matches: 
>        while match_offset - offset: 
>          s = text_in.read(match_offset - offset) 
>          offset += len(s) 
>          text_out.write(s) 
>        text_out.write(sub_cb(match_text)) 
>        offset += len(match_text) 
>      text_out.write(text_in.read()) 
>
> if __name__ == '__main__': 
>      import sys 
>      text_path = sys.argv[1] 
>      matches_path  sys.argv[2] 
>      fgrep_sub(open(text_path, 'rb'), sys.stdout, 
>          read_fgrep_ob(open(matches_path)), 
>          lambda s: s.replace(' ', '_')) 
>
> GNU fgrep also has a -w option to match word boundaries. 
>
>
> Or, an alternative might be to use --color=yes instead of -ob, and you 
> will get matches marked inline with xterm colour codes! 
>
> - Joel 
>

------=_Part_136_14962131.1351603661406
Content-Type: text/html; charset=utf-8
Content-Transfer-Encoding: 7bit

Joel, thanks.&nbsp;<div>Particularly for suggesting the byte offset option from fgrep. We can break it into multiple threads, no problem. We'll probably break it into threads via a script, not via Hadroop.&nbsp;<div><br></div><div>Any other approaches/suggestions? Let me know.&nbsp;</div><div><br></div><div>Michael<br><div><br><br>On Monday, October 29, 2012 9:47:11 PM UTC-4, Joel wrote:<blockquote class="gmail_quote" style="margin: 0;margin-left: 0.8ex;border-left: 1px #ccc solid;padding-left: 1ex;">On Tue, 30 Oct 2012 11:49:29 +1100, michael douma @idea_org
<br>&lt;<a href="javascript:" target="_blank" gdf-obfuscated-mailto="oE7aM-O7EaIJ">michael.d...@gmail.com</a>&gt; wrote:
<br>
<br>&gt; So the question is really (2)+(3). What is a good approach for processing
<br>&gt; millions of multi-word expressions on 75 GB.
<br>
<br>It's clearly a highly parallelizable problem: 3 million MWEs will not take
<br>a huge amount of memory per-process when loaded into Aho-Corasick, so as
<br>long as you have the computing resources, it should be straightforward
<br>enough to solve using something like Hadoop (or even something less
<br>heavy-duty as you don't have a fancy reduce operation).
<br>
<br>The component we're missing so far is how to perform the substitutions:
<br>GNU fgrep -o gives you the first, longest match. With -b, you get its byte
<br>offset. So, for example:
<br>
<br>$ echo 'hello' | fgrep -bo -e 'he' -e 'hel'
<br>0:hel
<br>$ echo 'hello' | fgrep -bo -e 'he' -e 'el'
<br>0:he
<br>$ echo 'hello' | fgrep -bo -e 'he' -e 'ell'
<br>0:he
<br>$ echo 'hello' | fgrep -bo -e 'he' -e 'lo'
<br>0:he
<br>3:lo
<br>
<br>You can use this mode and a substitution script like the following
<br>(untested):
<br>
<br>#!/usr/bin/env/python
<br>
<br>def read_fgrep_ob(f):
<br>&nbsp; &nbsp; &nbsp;for l in f:
<br>&nbsp; &nbsp; &nbsp; &nbsp;b, m = l.rstrip().split(':')
<br>&nbsp; &nbsp; &nbsp; &nbsp;yield int(b), m
<br>
<br>def fgrep_sub(text_in, text_out, matches, sub_cb):
<br>&nbsp; &nbsp; &nbsp;offset = 0
<br>&nbsp; &nbsp; &nbsp;for match_offset, match_text in matches:
<br>&nbsp; &nbsp; &nbsp; &nbsp;while match_offset - offset:
<br>&nbsp; &nbsp; &nbsp; &nbsp; &nbsp;s = text_in.read(match_offset - offset)
<br>&nbsp; &nbsp; &nbsp; &nbsp; &nbsp;offset += len(s)
<br>&nbsp; &nbsp; &nbsp; &nbsp; &nbsp;text_out.write(s)
<br>&nbsp; &nbsp; &nbsp; &nbsp;text_out.write(sub_cb(match_<wbr>text))
<br>&nbsp; &nbsp; &nbsp; &nbsp;offset += len(match_text)
<br>&nbsp; &nbsp; &nbsp;text_out.write(text_in.read()<wbr>)
<br>
<br>if __name__ == '__main__':
<br>&nbsp; &nbsp; &nbsp;import sys
<br>&nbsp; &nbsp; &nbsp;text_path = sys.argv[1]
<br>&nbsp; &nbsp; &nbsp;matches_path &nbsp;sys.argv[2]
<br>&nbsp; &nbsp; &nbsp;fgrep_sub(open(text_path, 'rb'), sys.stdout,
<br>&nbsp; &nbsp; &nbsp; &nbsp; &nbsp;read_fgrep_ob(open(matches_<wbr>path)),
<br>&nbsp; &nbsp; &nbsp; &nbsp; &nbsp;lambda s: s.replace(' ', '_'))
<br>
<br>GNU fgrep also has a -w option to match word boundaries.
<br>
<br>
<br>Or, an alternative might be to use --color=yes instead of -ob, and you
<br>will get matches marked inline with xterm colour codes!
<br>
<br>- Joel
<br></blockquote></div></div></div>
------=_Part_136_14962131.1351603661406--

------=_Part_135_21973882.1351603661406--