I'm using the CategorizedPlaintextCorpusReader and running into an issue using the catmap attribute to load the categorizations associated to each fileid.
I'm loading a plaintext corpus:
catcorpus = CategorizedPlaintextCorpusReader(corpusdir, '.*', cat_map=corpus_dict)
cat_map is a parameter that accepts a key=>value dictionary containing a fileid=>category mapping. So, corpus_dict['filename'] = category
The possible category values are categories = ['CREDIT', 'DEBT', 'MERGER', 'STOCK'], so I would expect only four possible categories. Therefore, len(catcorpus.categories()) should return 4, but instead I get 20.
So I'm able to load and manipulate my dataset successfully. When I examine my categories, here's what I find:
>>> catcorpus.categories()
['A', 'B', 'C', 'D', 'E', 'G', 'H', 'I', 'K', 'L', 'M', 'N', 'O', 'P', 'R', 'S', 'T', 'U', 'V', 'W']
I have no idea where these came from!!? I am looking at the
API but I don't see where these categories are coming from there either.
It's unclear to me if the problem is either that the cat_map attribute doesn't work for assigning categories to fileids in a corpus? Most examples online seem to show use of cat_pattern to assign the categories.