How to do large-scale search & replace of multiword expressions prior to LDA

michael douma @idea_org

unread,

Oct 29, 2012, 4:41:23 PM10/29/12

to nltk...@googlegroups.com

I have a question about doing large-scale search & replace. I hope this is relevant to NLTK-DEV because many group members are experienced with various ways to process text.

BACKGROUND:
I'm building a data set to be used in an consumer educational app which links words together. The app is a follow-on of our WikiNodes app (http://www.idea.org/WikiNodes.htm), expanded to include both dictionary and Wikipedia terms. Our goal is to display a constellation of linked words and terms for students and the public to explore. To build our database of word relationships, we have had good success using latent Dirichlet allocation (LDA). We have a ~75 GB text corpus. Our methods are working well for single words (e.g., linking "onion" and "garlic" and "sauté"), but LDA doesn't handle multiword expressions. Therefore, we're thinking that if we search and replace ~3 million phrases, inserting underscores instead of spaces, e.g., "Eiffel_Tower", then we can use LDA and statistical methods directly. This is related to the task done by named-entity recognition and Wikification systems.

QUESTION:
What methods would you suggest for doing search/replace of ~3 million terms over ~75GB of text?

Also, I welcome any related comments or suggestions. Please be patient with my ignorance, I am not a Natural Language scholar.

Thanks,

Michael Douma
www.idea.org

Mikhail Korobov

unread,

Oct 29, 2012, 5:01:32 PM10/29/12

to nltk...@googlegroups.com

I'm also not an expert but it seems Aho-Corasick automaton may fit.

There is a "fgrep" unix console utility and several Python modules for this (https://github.com/scoder/acora and http://pypi.python.org/pypi/ahocorasick).

вторник, 30 октября 2012 г., 2:41:23 UTC+6 пользователь michael douma @idea_org написал:

michael douma @idea_org

unread,

Oct 29, 2012, 5:10:22 PM10/29/12

to nltk...@googlegroups.com

Thanks, Mikhail. I did not know about fgrep for Aho-Corasick. That will help solve a different piece of our problem. I have previously read about the substantially higher performance of fixed-string searching of Aho-Corasick.

Joel Nothman

unread,

Oct 29, 2012, 6:04:58 PM10/29/12

to nltk...@googlegroups.com, michael douma @idea_org

On Tue, 30 Oct 2012 07:41:23 +1100, michael douma @idea_org
<michael.d...@gmail.com> wrote:

> QUESTION:
> What methods would you suggest for doing search/replace of ~3 million
> terms
> over ~75GB of text?

Hi Michael,

Are you interested in methods for:

(1) determining what multi-word expressions to combine?
(2) performing the search and replace, given a list of multi-word
expressions to underscore?
(3) scaling the problem to 75GB of text?

Mikhail has proposed a solution to (2) but you say this will "solve a
different piece of our problem". So is (2) what you are looking for?

If so, multi-word expressions might include:

* anything matched by a named entity recogniser
* anything identical to a Wikipedia page title or redirect
* frequent, or highly associated (see nltk.metrics.association), word
collocations

These won't always predict the correct terms, and will find some
additional MWEs. There may be ways to reduce false positives or false
negatives... But it seems that if you are generally using single words as
if they have single meanings or functions, then a rough approach to this
component should suffice.

But please explicate a little more what part needs guidance.

Cheers,

- Joel

michael douma @idea_org

unread,

Oct 29, 2012, 8:49:29 PM10/29/12

to nltk...@googlegroups.com, michael douma @idea_org

Hi Joel,

I'm interested in (2), given the constraints of (3).

For (1), for discussion purposes, assume the list of multi-word expressions is all Wikipedia and Wiktionary topics. We could use other sources, but that's not relevant for now. We could add collocations in the future, but we are limiting to only multi-word expressions for which we have a definition.

For (2), there are many ways to do this, from regex (sed, awk, perl, ruby), to using hashes (google-dense, khash, etc.), possibly with a trie. Using CWG/DAWG to lookup a word at a time, or perhaps SQLite+Memcached. None seem ideal.

So the question is really (2)+(3). What is a good approach for processing millions of multi-word expressions on 75 GB.

Sorry if I'm just repeating myself, and thanks for trying to hone in. I feel like people experienced in these matters must have have already figured this out.

Michael

Joel Nothman

unread,

Oct 29, 2012, 9:47:07 PM10/29/12

to nltk...@googlegroups.com, michael douma @idea_org

On Tue, 30 Oct 2012 11:49:29 +1100, michael douma @idea_org
<michael.d...@gmail.com> wrote:

> So the question is really (2)+(3). What is a good approach for processing
> millions of multi-word expressions on 75 GB.

It's clearly a highly parallelizable problem: 3 million MWEs will not take
a huge amount of memory per-process when loaded into Aho-Corasick, so as
long as you have the computing resources, it should be straightforward
enough to solve using something like Hadoop (or even something less
heavy-duty as you don't have a fancy reduce operation).

The component we're missing so far is how to perform the substitutions:
GNU fgrep -o gives you the first, longest match. With -b, you get its byte
offset. So, for example:

$ echo 'hello' | fgrep -bo -e 'he' -e 'hel'
0:hel
$ echo 'hello' | fgrep -bo -e 'he' -e 'el'
0:he
$ echo 'hello' | fgrep -bo -e 'he' -e 'ell'
0:he
$ echo 'hello' | fgrep -bo -e 'he' -e 'lo'
0:he
3:lo

You can use this mode and a substitution script like the following
(untested):

#!/usr/bin/env/python

def read_fgrep_ob(f):
for l in f:
b, m = l.rstrip().split(':')
yield int(b), m

def fgrep_sub(text_in, text_out, matches, sub_cb):
offset = 0
for match_offset, match_text in matches:
while match_offset - offset:
s = text_in.read(match_offset - offset)
offset += len(s)
text_out.write(s)
text_out.write(sub_cb(match_text))
offset += len(match_text)
text_out.write(text_in.read())

if __name__ == '__main__':
import sys
text_path = sys.argv[1]
matches_path sys.argv[2]
fgrep_sub(open(text_path, 'rb'), sys.stdout,
read_fgrep_ob(open(matches_path)),
lambda s: s.replace(' ', '_'))

GNU fgrep also has a -w option to match word boundaries.

Or, an alternative might be to use --color=yes instead of -ob, and you
will get matches marked inline with xterm colour codes!

- Joel

michael douma @idea_org

unread,

Oct 30, 2012, 9:27:41 AM10/30/12

to nltk...@googlegroups.com, michael douma @idea_org

Joel, thanks.

Particularly for suggesting the byte offset option from fgrep. We can break it into multiple threads, no problem. We'll probably break it into threads via a script, not via Hadroop.