Classifier implementation help

Kate

unread,

Sep 13, 2012, 5:02:17 PM9/13/12

to mongod...@googlegroups.com

I have a tricky problem that I can't even begin to solve. I would like to score each text string by the words it contains. I have a csv file with a list of words and their score. Points for each text should be the sum of keyword scores; points should become a new field in each document.

Inputs:

word, score

girls, 50

boys, 30

dog, 20

cat, 9

Database:

{_id: 12, 'text': "i have a dog and a cat"}

{_id: 18, 'text': "girls and boys go to school"}

Desired output:

{_id: 12, 'text': "i have a dog and a cat", 'points': 29}

{_id: 18, 'text': "girls and boys go to school", 'points': 80}

Performance is a bit of an issue, as I have 30 million documents.

As an aside, I'm having trouble finding resources for learning how to use monogodb. Most resources I've found assume experience with some other database like SQL, whereas I'm learning on mongodb. If you know of any good resources, please let me know.

Kate

unread,

Sep 13, 2012, 5:03:23 PM9/13/12

to mongod...@googlegroups.com

Addendum: I'm more comfortable in pymongo, but anything would be of use at this point.

Thomas Rueckstiess

unread,

Sep 14, 2012, 2:00:29 AM9/14/12

to mongod...@googlegroups.com

Hello Kate,

I don't see a way to do all this within the database. You would have to pull each document to the client side, calculate the score there, and then update the document with the score. Below are the steps in more detail (in Python/pymongo) and the code. Please use the code only as a guideline and double-check it, it may be slightly different for you, depending on your setup, mongod port, etc.

1. import the scores from the csv file and create a dictionary in python with the word as key and the score as value.

2. get a cursor to all documents in the sentences collection that don't have a score yet (this is important because it allows you to restart the script if it doesn't complete in one run. If you are updating 40 million documents, this can take a long time)

3. go through each document, calculate the score (ignoring words that don't have a score) and update the document with the score.

from pymongo import Connection

import csv

# create csv reader to import scores

reader = csv.reader(open('scores.txt', 'r'))

# skip header in csv file

headerline = reader.next()

# read all scores and create scores dictionary

scores = {}

for line in reader:

scores[line[0]] = int(line[1])

# establish connection to mongodb

con = Connection(port=30000) # add your host/port here if not using default values

db = con['test'] # use the correct database (here: test)

# get a cursor to all documents in the collection that don't have score yet

cursor = db.sentences.find({'score': {'$exists': False}})

# iterate over all documents in the cursor

for doc in cursor:

# calculate score

sentence = doc['text']

words = sentence.split()

score = sum([scores[w] for w in words if w in scores])

# update document

db.sentences.update({'_id': doc['_id']}, {'$set': {'score': score}})

I hope this helps you solve your problem.

Best regards,

Thomas

Reply all

Reply to author

Forward