CategorizedCorpusReader issue with using cat_map

24 views
Skip to first unread message

Ron Lin

unread,
Jun 22, 2015, 3:42:12 PM6/22/15
to nltk-...@googlegroups.com
I'm using the CategorizedPlaintextCorpusReader and running into an issue using the catmap attribute to load the categorizations associated to each fileid.  

I'm loading a plaintext corpus:

catcorpus = CategorizedPlaintextCorpusReader(corpusdir, '.*', cat_map=corpus_dict)

cat_map is a parameter that accepts a key=>value dictionary containing a fileid=>category mapping.  So, corpus_dict['filename'] = category

The possible category values are categories = ['CREDIT', 'DEBT', 'MERGER', 'STOCK'], so I would expect only four possible categories.  Therefore, len(catcorpus.categories()) should return 4, but instead I get 20.   

So I'm able to load and manipulate my dataset successfully.  When I examine my categories, here's what I find:

>>> catcorpus.categories()
['A', 'B', 'C', 'D', 'E', 'G', 'H', 'I', 'K', 'L', 'M', 'N', 'O', 'P', 'R', 'S', 'T', 'U', 'V', 'W']

I have no idea where these came from!!?  I am looking at the API but I don't see where these categories are coming from there either.  

It's unclear to me if the problem is either that the cat_map attribute doesn't work for assigning categories to fileids in a corpus?  Most examples online seem to show use of cat_pattern to assign the categories.  

Ron Lin

unread,
Jun 22, 2015, 6:58:32 PM6/22/15
to nltk-...@googlegroups.com
The problem was that corpus_dict['filename'] = [ category ], since the value of the corpus_dict should be a list, not a string.

Steven Bird

unread,
Jun 22, 2015, 7:39:47 PM6/22/15
to nltk-...@googlegroups.com
Thanks for reporting back Ron.

Several methods in NLTK take a list of strings as one of the parameters, and will silently accept a string instead. When the method iterates over that string, it gets a sequence of (one-character) strings.

So it's worth noting that if ever you see one-character strings when you were expecting longer strings, you might have used a string instead of a list of strings as an argument to some method.

NLTK generally doesn't check argument types because this clutters the code and slows everything down.

-Steven Bird


--
You received this message because you are subscribed to the Google Groups "nltk-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to nltk-users+...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
Reply all
Reply to author
Forward
0 new messages