Text mining with a MongoDB approach

Nathan

unread,

Jul 18, 2012, 10:42:43 AM7/18/12

to mongod...@googlegroups.com

Hello people!!! :)

I am workin in text mining, and I would like to use MongoDB. My work, at the moment, consists in entity detection (with dictionaries) and map them to its unique identifier.

First, I have stored in MongoDB sentences, tokens and chunks from text of interest. So I have a collection with a lot of records with the following fields:

Identifier of the document.
Sentence with the text, offset, and number of sentence
Tokens is a list, each item is has text of token, offset, part-of-speech, lemma and number of token
Chunks is a list similar to Tokens

Second, I have large list of terminology, where each terminology has one o more identifiers {term: as string, concept_id: as list of string}

With Python, I have a script that search all terminology from dictionary against all sentences, to get a subset of sentences. Then, I check sentence by sentence for each term in dictionary, if this term is on sentence or not. If match, i store the result in a new collection with the entity offsets.

The problem is that I am using Python to match Regular Expressions with Terminology (in mongodb) against Sentences (in mongodb), I have a lot of sentences (more than 1.000.000) and large dictionaries arround 300.000 entries, and this process is really slow...

I do not know if it is possible use better mongodb for this task, and increase speed. My processig take days to finish. I would like mongodb and I hope work with it, but I do know if can help me.

If I search with regular expressions in mongo, its posible get the offset of match???

Thank you for all!!!

nth

A. Jesse Jiryu Davis

unread,

Jul 18, 2012, 3:00:19 PM7/18/12

to mongod...@googlegroups.com

There's no way to get the offset of a regex match -- MongoDB queries always return whole documents and nothing else.

For speed, is there a way to arrange your data and / or your queries to avoid querying with regexes, or at least to only do case-sensitive prefix matches with regexes? E.g.,

db.collection.find( { sentence: /^prefix.../ } )

If a regex starts with "^" and a few constant characters, and is *not* case-insensitive, then it can use an index. Otherwise, each query requires a full collection scan.

Nathan

unread,

Jul 19, 2012, 4:06:00 AM7/19/12

to mongod...@googlegroups.com

Then, only I can use MongoDB to store sentences and to do queries (sentences_collection.find({})) to get sentences in a cursor. The problem is my python algortihm. For each sentence in cursor my algorith match all terms:

for t in terms:
     matches = re.finditer(r'('+t+')', sentence)
     for m in matches:
          ini = m.start()
          end = m.end()

Is there any way to find matches quickly?

A. Jesse Jiryu Davis

unread,

Jul 19, 2012, 9:26:04 AM7/19/12

to mongod...@googlegroups.com

I think that's the fastest way, if 't' is a regular expression. Of course if 't' is just a normal string then you can do:

sentence.index(t)

... which would be faster than regex matching.

In your first message you identified five or six distinct steps in your algorithm. Which takes the longest? Can you post some example code for that time-consuming step?

Reply all

Reply to author

Forward