Speed of dictionary in wordcount.py exercise

195 views
Skip to first unread message

Geoffrey Eisenbarth

unread,
Sep 22, 2014, 10:45:39 AM9/22/14
to python-g...@googlegroups.com
When I first approached the wordcount exercise, I had defined my wordcount dictionary as such:

def word_count_dict(filename):
  file
= open(filename, 'rU')
 
  words
= []  #list of words in file
  dict
= {}   #returns how many times the word has been used
   
  words
= file.read().split()
  file
.close()
  words
= [word.lower() for word in words]
  words
= sorted(words)

 
for word in words:
    dict
[word] = words.count(word)
 
return dict

I think realized that the for loop was looping over duplicates, so I looked around and found out about the set() function, which would get rid of duplicates and speed up my code. Even then, it's taking a bit longer than the solution provided, which seems a bit more "brute-force" to me (quoted below). I was hoping to get some insight as to why the solution code is so much quicker. 

def word_count_dict(filename):
  word_count
= {}  # Map each word to its count
  input_file
= open(filename, 'r')
 
for line in input_file:
    words
= line.split()
   
for word in words:
      word
= word.lower()
     
# Special case if we're seeing this word for the first time.
     
if not word in word_count:
        word_count
[word] = 1
     
else:
        word_count
[word] = word_count[word] + 1
  input_file
.close()  # Not strictly required, but good form.
 
return word_count

Robert Mandić

unread,
Sep 22, 2014, 4:51:19 PM9/22/14
to python-g...@googlegroups.com
def get_trimmed(word):
    chars = [',','.','!','?','"',"'",':',';','-','`',')']
    for c in chars:
        word = word.replace(c,'')
    return word.lower()

def get_words(filename):
    d = {}
    for line in open(filename).read().split():
        word = get_trimmed(line)
        d.setdefault(word, 0)
        d[word] += 1
    return d

This was my solution when doing the class :D


--
You received this message because you are subscribed to the Google Groups "Python GCU Forum" group.
To unsubscribe from this group and stop receiving emails from it, send an email to python-gcu-for...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.



--
Lp, Robert

Geoffrey Eisenbarth

unread,
Sep 22, 2014, 4:54:47 PM9/22/14
to python-g...@googlegroups.com
Yeah, I just figured using a the .count method would be quicker than iterating over the words, but it doesn't seem to be so. Wasn't sure if anyone could provide insight as to why.

Thanks for the response, though!

--
You received this message because you are subscribed to a topic in the Google Groups "Python GCU Forum" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/python-gcu-forum/JhbJbEOLeg0/unsubscribe.
To unsubscribe from this group and all its topics, send an email to python-gcu-for...@googlegroups.com.

fracjackmac

unread,
Sep 30, 2014, 4:51:19 PM9/30/14
to python-g...@googlegroups.com
Hello Geoffrey,

Do the math on your code to determine the number of times you're handling the ENTIRE word list: 1) read the list in, 2) convert each word to lower case, 3) sort the list, 4) count the occurrences of each word.  Each trip through the list takes at a minimum some time X, so you're looking at a time of at least 4X to get through your code.

As you discovered, the 4th item was adding extra time to your original computations because you made N assignments to the dictionary, where N is the total number of words in the list.  Switching to the 'set' object: I assume your loop looks like this: 'for word in set(words)', ensures that you will make only a single assignment for each unique word, so this saves you considerable time if there are a large number of words that appear multiple times in the list.

Now, let's do the math on the solution provided: The entire word list is handled just once, so time of at least 1X to get through the list. While there are more "assignment" operations with this code compared to your modified version using the 'set' object, there is less time consumed churning through the list.

One observation that should help your version run faster: eliminate the 'sort' step.  This may reduce the operations by as much as 25%, possibly more.  There is no real value for including the 'sort' since sorting the list helps with identification of unique words, something the 'set' object does auto-magically behind the scenes.


fjm

Geoffrey Eisenbarth

unread,
Oct 1, 2014, 10:00:14 AM10/1/14
to python-g...@googlegroups.com
Thanks for the step by step help; now that you point that out, it's quite obvious to me! Also, good point about sorted() as well, since I'm putting them into a dictionary (and using set()), the order of the words won't matter. Since eventually I needed to print the words in order by frequency, I had it in my mind I needed to sort first, but that step should come when pulling out of the dictionary. 

gbe

Reply all
Reply to author
Forward
0 new messages