word sorting script not giving me the results I am expecting.

39 views
Skip to first unread message

jettam

unread,
Sep 26, 2017, 1:02:36 PM9/26/17
to Python Programming for Autodesk Maya
I am trying to list the top 3 most occurring words from a text document. I have managed to distill it down to the top three  [13, 22, 24]. But for some reason my final print statement gives me the 4 most reoccurring words and not even in a numerical order [22, 22, 24, 13 ]   Could someone show me why this is happening ?
 
I have attached the text file that I am sourcing called EinsteinCredo.txt

''' Read this text file and return the top three most ocurring words '''

inFile
= r'E:\ProfessionalDevelopment\python\Introduction to Python Scripting in Maya\week4\EinsteinCredo.txt'
wordList
=[]
occurences
=[]
with open(inFile, 'r') as fin:

   
# removes the punctuation and splits the words into a list
   
for line in fin:
        punct
= ["'","?",".","!","?",",","\r\n","-"]
       
for p in punct:
            line
= line.replace(p,"").upper()
        line
= line.split()  
       
for word in line:
            wordList
.append(word)
           

# make a word count list
for x in wordList:
    occurences
.append(wordList.count(x))

# make a dictionary of both the wordList and occurences    
wordFrequencey
= dict(zip(wordList,occurences))

# find the top three most occuring words
order
= list(set(sorted(wordFrequencey.values())))

topThree
= order[-3:]

# print the results
for k, v in wordFrequencey.items():
   
if v in topThree:
       
print 'the word " %s " occured %s times' % (k,v)

EinsteinCredo.txt

Justin Israel

unread,
Sep 26, 2017, 3:03:32 PM9/26/17
to python_in...@googlegroups.com
Replying inline with comments and bugs...

On Wed, Sep 27, 2017 at 6:02 AM jettam <justin...@gmail.com> wrote:
I am trying to list the top 3 most occurring words from a text document. I have managed to distill it down to the top three  [13, 22, 24]. But for some reason my final print statement gives me the 4 most reoccurring words and not even in a numerical order [22, 22, 24, 13 ]   Could someone show me why this is happening ?
 
I have attached the text file that I am sourcing called EinsteinCredo.txt

''' Read this text file and return the top three most ocurring words '''

inFile
= r'E:\ProfessionalDevelopment\python\Introduction to Python Scripting in Maya\week4\EinsteinCredo.txt'
wordList
=[]
occurences
=[]
with open(inFile, 'r') as fin:

   
# removes the punctuation and splits the words into a list
   
for line in fin:
        punct
= ["'","?",".","!","?",",","\r\n","-"]

This has a duplicate for "?". You should define this once, outside the entire loop.
 

       
for p in punct:
            line
= line.replace(p,"").upper()

Just do the upper() once, before you split the string, instead of once for every time you replace punctuation.
 

        line
= line.split()  
       
for word in line:
            wordList
.append(word)
           

# make a word count list
for x in wordList:
    occurences
.append(wordList.count(x))

Be aware that this wordList can contain duplicate words. So you are adding counts for the same word.
 


# make a dictionary of both the wordList and occurences    
wordFrequencey
= dict(zip(wordList,occurences))

# find the top three most occuring words
order
= list(set(sorted(wordFrequencey.values())))

This is a bug. You disassociate the values from the keys, and change their order into a new list, which is also likely smaller than the key list. So none of the indices will match up any more. Also no reason to use the sorted() call in the way you are. When you pass the list into the set() they resort again.
 


topThree
= order[-3:]

# print the results
for k, v in wordFrequencey.items():
   
if v in topThree:
       
print 'the word " %s " occured %s times' % (k,v)


This is the other part of the bug. Your approach of checking if the count value is in your top 3 is inherently broken. What if different words have the same count, such as 22 like your example? You will end up getting as many words as having those counts in your topThree.

Remember that other mail thread where I was suggesting that you just get rid of the whole multi-list and zip approach? It would be better to just build up a dictionary directly within that word loop. That way you have a unique mapping of words to their occurrences. Then you can use sorted(words.items(), words.get) in order to sort the words by their value, in reverse order. That resulting list will let you slice off the last three, which will be the (key, val) tuples. You will no longer have issues with managing separate key/value lists.

Let me know if you want the example, or if you just want to work through these suggestions on your own?
 

--
You received this message because you are subscribed to the Google Groups "Python Programming for Autodesk Maya" group.
To unsubscribe from this group and stop receiving emails from it, send an email to python_inside_m...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/python_inside_maya/0655c287-af16-4657-a808-e77d56d65ed3%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Justin Israel

unread,
Sep 26, 2017, 3:06:01 PM9/26/17
to python_in...@googlegroups.com
I had a typo in this suggestion. I should have said to use:  sorted(words, words.get)
The result would be the sorted keys based on their values. You can then slice off the last 3 and look up their occurrences again in the words dictionary. 

jettam

unread,
Sep 26, 2017, 8:02:36 PM9/26/17
to Python Programming for Autodesk Maya
Thanks for your help Justin.  I would like an example.

"It would be better to just build up a dictionary directly within that word loop. That way you have a unique mapping of words to their occurrences. Then you can use sorted(words.items(), words.get) in order to sort the words by their value, in reverse order. That resulting list will let you slice off the last three, which will be the (key, val) tuples. You will no longer have issues with managing separate key/value lists. Let me know if you want the example"


Regarding the zipping of two lists to make a dictionary. I haven't noticed any disassociate of key and values in the process. The length of the list did shrink but only because the zipping process removes duplicates. So instead of seeing the word AND appear 24 times, in the zipped dictionary it appeared only once like this {'AND':24}





Justin Israel

unread,
Sep 26, 2017, 10:56:16 PM9/26/17
to python_in...@googlegroups.com
On Wed, Sep 27, 2017 at 1:02 PM jettam <justin...@gmail.com> wrote:
Thanks for your help Justin.  I would like an example.

Here is an example of the changes I had suggested:

occurences = {}
punct = set(["'", "?", ".", "!", ",", "\r\n", "-"])

with open(inFile, 'r') as
 fin
    for line in fin:
        
for p in punct:
            line = line.replace(p,""
)

        for word in line.upper().split():
            try:
                occurences[word] += 1
            except KeyError:
                occurences[word] = 1

ordered = sorted(occurences, key=occurences.get, reverse=True)
topThree = ordered[:3]

for k in topThree:
    v = occurences[k]
    
print 'the word " %s " occured %s times' % (k,v)
 

"It would be better to just build up a dictionary directly within that word loop. That way you have a unique mapping of words to their occurrences. Then you can use sorted(words.items(), words.get) in order to sort the words by their value, in reverse order. That resulting list will let you slice off the last three, which will be the (key, val) tuples. You will no longer have issues with managing separate key/value lists. Let me know if you want the example"


Regarding the zipping of two lists to make a dictionary. I haven't noticed any disassociate of key and values in the process. The length of the list did shrink but only because the zipping process removes duplicates. So instead of seeing the word AND appear 24 times, in the zipped dictionary it appeared only once like this {'AND':24}

The zipping process doesn't remove duplicates. Converting your list to a set is what removes duplicates. Both calling sorted() and converting to a set() changes the order of your values so that they no longer map to the original keys. So you end up with this order of random values. Then you can no longer map them back to the exact words since you can have duplicate word count values. 
 





--
You received this message because you are subscribed to the Google Groups "Python Programming for Autodesk Maya" group.
To unsubscribe from this group and stop receiving emails from it, send an email to python_inside_m...@googlegroups.com.

jettam

unread,
Sep 27, 2017, 1:05:56 PM9/27/17
to Python Programming for Autodesk Maya
This is great, thanks.   

I see in this section you are building the dictionary, assigning both keys and values, but I have some questions, see red
        for word in line.upper().split():   # I get this, you split the words into a list
           
try:
                occurences
[word] += 1  # Looks like you are giving the dict keys. But I am not sure how you are also inserting values.  
           
except KeyError:  # Not sure what this is doing!  
                occurences
[word] = 1  # or this


Looks like you are sorting the occurrences dict into a descending order based on the values?   So how are you telling sorted to look at the values fields to sort ? 
ordered = sorted(occurences, key=occurences.get, reverse=True)


To unsubscribe from this group and stop receiving emails from it, send an email to python_inside_maya+unsub...@googlegroups.com.

Justin Israel

unread,
Sep 27, 2017, 2:34:30 PM9/27/17
to Python Programming for Autodesk Maya


On Thu, Sep 28, 2017, 6:05 AM jettam <justin...@gmail.com> wrote:
This is great, thanks.   

I see in this section you are building the dictionary, assigning both keys and values, but I have some questions, see red
        for word in line.upper().split():   # I get this, you split the words into a list
           
try:
                occurences
[word] += 1  # Looks like you are giving the dict keys. But I am not sure how you are also inserting values.  

This is standard syntax for using a dictionary. You assign a value to a key. In this particular line, I am using the  x += 1 expression to add 1 to the current key value in the dictionary. It is the equivalent of doing 

occurrences[word] = occurrences[word] + 1


           
except KeyError:  # Not sure what this is doing!  
                occurences
[word] = 1  # or this

If you try to access a key in a dictionary that does not exist, Python will raise a KeyError exception. So first I am trying to increment an existing key (making the assumption that we had already added the word before). If we have never added that word yet, we catch the error and just start the new value at 1. This is how you count up each time you see the same word. 

I could have avoided the exception by writing this a different way, where we actually check if the key exists first 

if word in occurrences[word]:
    occurrences[word] += 1
else:
    occurrences[word] = 1

It has the same effect but included needing to look up the key once first to see if it exist. The previous way I had done it is using the "easier to ask forgiveness than permission" approach. 



Looks like you are sorting the occurrences dict into a descending order based on the values?   So how are you telling sorted to look at the values fields to sort ? 
ordered = sorted(occurences, key=occurences.get, reverse=True)

sort functions by default will just use each item for the comparisons when sorting. We don't want it to sort by the keys. So sorted() accepts a key function that it will call for each item to give it the sort key to use for comparisons. Since a dictionary has a handy  dict.get(key) method for getting a value for a key, we can just have it use that. Now it will sort the keys by getting each value for comparison. 


To unsubscribe from this group and stop receiving emails from it, send an email to python_inside_m...@googlegroups.com.

--
You received this message because you are subscribed to the Google Groups "Python Programming for Autodesk Maya" group.
To unsubscribe from this group and stop receiving emails from it, send an email to python_inside_m...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/python_inside_maya/576e1143-dfce-4fb9-aa4c-5228fc086bae%40googlegroups.com.

Michael Boon

unread,
Sep 27, 2017, 7:19:30 PM9/27/17
to Python Programming for Autodesk Maya
I appreciate you're doing this as an exercise...but if you ever have to do it in real life, the Python standard libraries contain collections.Counter, which I think does exactly what you want, and probably much faster.
>>> # Find the ten most common words in Hamlet
>>> import re
>>> words = re.findall(r'\w+', open('hamlet.txt').read().lower())
>>> Counter(words).most_common(10)
[('the', 1143), ('and', 966), ('to', 762), ('of', 669), ('i', 631),
 ('you', 554),  ('a', 546), ('my', 514), ('hamlet', 471), ('in', 451)]


To unsubscribe from this group and stop receiving emails from it, send an email to python_inside_maya+unsub...@googlegroups.com.

--
You received this message because you are subscribed to the Google Groups "Python Programming for Autodesk Maya" group.
To unsubscribe from this group and stop receiving emails from it, send an email to python_inside_maya+unsub...@googlegroups.com.

jettam

unread,
Sep 28, 2017, 11:25:31 AM9/28/17
to Python Programming for Autodesk Maya
Thank you again.  :) 
To unsubscribe from this group and stop receiving emails from it, send an email to python_inside_maya+unsub...@googlegroups.com.

--
You received this message because you are subscribed to the Google Groups "Python Programming for Autodesk Maya" group.
To unsubscribe from this group and stop receiving emails from it, send an email to python_inside_maya+unsub...@googlegroups.com.
Reply all
Reply to author
Forward
0 new messages