Hello people!!! :)
I am workin in text mining, and I would like to use MongoDB. My work, at the moment, consists in entity detection (with dictionaries) and map them to its unique identifier.
First, I have stored in MongoDB sentences, tokens and chunks from text of interest. So I have a collection with a lot of records with the following fields:
- Identifier of the document.
- Sentence with the text, offset, and number of sentence
- Tokens is a list, each item is has text of token, offset, part-of-speech, lemma and number of token
- Chunks is a list similar to Tokens
Second, I have large list of terminology, where each terminology has one o more identifiers {term: as string, concept_id: as list of string}
With Python, I have a script that search all terminology from dictionary against all sentences, to get a subset of sentences. Then, I check sentence by sentence for each term in dictionary, if this term is on sentence or not. If match, i store the result in a new collection with the entity offsets.
The problem is that I am using Python to match Regular Expressions with Terminology (in mongodb) against Sentences (in mongodb), I have a lot of sentences (more than 1.000.000) and large dictionaries arround 300.000 entries, and this process is really slow...
I do not know if it is possible use better mongodb for this task, and increase speed. My processig take days to finish. I would like mongodb and I hope work with it, but I do know if can help me.
If I search with regular expressions in mongo, its posible get the offset of match???
Thank you for all!!!
nth