Classifier implementation help

18 views
Skip to first unread message

Kate

unread,
Sep 13, 2012, 5:02:17 PM9/13/12
to mongod...@googlegroups.com
I have a tricky problem that I can't even begin to solve. I would like to score each text string by the words it contains. I have a csv file with a list of words and their score. Points for each text should be the sum of keyword scores; points should become a new field in each document.

Inputs:
word, score
girls, 50
boys, 30
dog, 20
cat, 9

Database:
{_id: 12, 'text': "i have a dog and a cat"}
{_id: 18, 'text': "girls and boys go to school"}

Desired output:
{_id: 12, 'text': "i have a dog and a cat", 'points': 29}
{_id: 18, 'text': "girls and boys go to school", 'points': 80}

Performance is a bit of an issue, as I have 30 million documents. 

As an aside, I'm having trouble finding resources for learning how to use monogodb. Most resources I've found assume experience with some other database like SQL, whereas I'm learning on mongodb. If you know of any good resources, please let me know.

Kate

unread,
Sep 13, 2012, 5:03:23 PM9/13/12
to mongod...@googlegroups.com
Addendum: I'm more comfortable in pymongo, but anything would be of use at this point.

Thomas Rueckstiess

unread,
Sep 14, 2012, 2:00:29 AM9/14/12
to mongod...@googlegroups.com
Hello Kate,

I don't see a way to do all this within the database. You would have to pull each document to the client side, calculate the score there, and then update the document with the score. Below are the steps in more detail (in Python/pymongo) and the code. Please use the code only as a guideline and double-check it, it may be slightly different for you, depending on your setup, mongod port, etc.


1. import the scores from the csv file and create a dictionary in python with the word as key and the score as value.

2. get a cursor to all documents in the sentences collection that don't have a score yet (this is important because it allows you to restart the script if it doesn't complete in one run. If you are updating 40 million documents, this can take a long time)

3. go through each document, calculate the score (ignoring words that don't have a score) and update the document with the score.



from pymongo import Connection
import csv

# create csv reader to import scores
reader = csv.reader(open('scores.txt', 'r'))

# skip header in csv file
headerline = reader.next()

# read all scores and create scores dictionary
scores = {}
for line in reader:
scores[line[0]] = int(line[1])


# establish connection to mongodb
con = Connection(port=30000)    # add your host/port here if not using default values
db = con['test']                # use the correct database (here: test)

# get a cursor to all documents in the collection that don't have score yet
cursor = db.sentences.find({'score': {'$exists': False}})

# iterate over all documents in the cursor
for doc in cursor:
# calculate score
sentence = doc['text']
words = sentence.split()
score = sum([scores[w] for w in words if w in scores])
# update document
db.sentences.update({'_id': doc['_id']}, {'$set': {'score': score}})



I hope this helps you solve your problem.

Best regards,
Thomas
Reply all
Reply to author
Forward
0 new messages