Classifying abbreviations into standard categories

37 views

Skip to first unread message

Pat Yearick

unread,

Jun 6, 2016, 4:24:36 PM6/6/16

to nltk-users

I am taking data from older (and newer) data systems and am working to build a process to put foods abbreviations from databases into standard categories.

Older data systems with character limitations may abbreviate "CHZ" for "Cheese" or "BACN" for "Bacon". Word phrases in the newer systems, of course, can support more characters but also don't always map into a clean hierarchy just to save paper: "Bnls Buff Wing".

I have been trying to use nltk, Weka and other packages to get started with a parsing and classification project. All I have running is a brute force search from all my known terms either using fuzzywuzzy or Jaccard similarity in SQL Server. I feel that they could be a way to do this with nltk and similarity and be set based, but so far my issue has been that the word for the abbreviation isn't found as the abbreviations are not actual words that the system knows about.

I am new to Python/nltk/gensim.

Here is my brute force fuzzywuzzy code:

x = 'test'

hiFuzz = 0

print 'incoming string: ' + string

if string in idict.keys():

x = idict[string]

print 'FOUND ' + string

# return x

else:

print 'NOT FOUND'

x = 'TBD'

if x == 'TBD':

for key, value in idict.iteritems():

# print ' *** string *** ' + string

# fRatio = fuzz.partial_ratio(string, key)

fRatio = fuzz.token_sort_ratio(key, string)

if fRatio > hiFuzz:

x = 'fuzzy: ' + string

hiFuzz = fRatio

print 'fRatio: ' + str(fRatio) + ' string: ' + string + ' key: ' + key + ' value: ' + value

print x

return x

Does anyone have any suggestions as a path to go?

Thank-you,

Pat

Reply all

Reply to author

Forward

0 new messages