counting occurrences of a string in excel file, using python?

4,568 views
Skip to first unread message

notsonewdetective

unread,
Aug 30, 2011, 9:11:12 AM8/30/11
to python-excel
Hello!

I'm new to Python and loving it. It's the firs programming language
that I can somewhat comprehend.
I've got to do some work with some excel files. Work which involves a
lot of reading and noting down the occurrences of certain words. So I
will need to write a script that does it for us.

I've just today figured out how I can write to, and read from excel
files, using python. But since I am still so very new; I have no idea
how to make a simple count of the occurences of a string inside of
excel. I know that in regular, python allows you to do this via the
[variablename].count But other than that, I am completely lost.


I use xlwt and xlrd to read and write to excel files. And I seem to
be able understand this particular package.


Below is the code that will let me write to excel files ( i got this
code from a tutorial and I think I understand it better than most of
the others that I have come across):
------------------------------------------------------------------------------------------------------------------------------------

import xlwt

book = xlwt.Workbook(encoding="utf-8")

sheet1 = book.add_sheet("Python Sheet 1")
sheet2 = book.add_sheet("Python Sheet 2")
sheet3 = book.add_sheet("Python Sheet 3")

sheet1.write(0, 0, "This is the First Cell of the First Sheet")
sheet2.write(0, 0, "This is the First Cell of the Second Sheet")
sheet3.write(0, 0, "This is the First Cell of the Third Sheet")
sheet2.write(1, 10, "This is written to the Second Sheet")
sheet3.write(0, 2, "This is part of a list of information in the Third
Sheet")
sheet3.write(1, 2, "This is part of a list of information in the Third
Sheet")
sheet3.write(2, 2, "This is part of a list of information in the Third
Sheet")
sheet3.write(3, 2, "This is part of a list of information in the Third
Sheet")

book.save("python_spreadsheet.xls")


------------------------------------------------------------------------------------------------------------------------------------------

And here is the code that I wil use to read from the excel files:



------------------------------------------------------------------------------------
import xlrd

book = xlrd.open_workbook("python_spreadsheet.xls")

for sheet_name in book.sheet_names():
sheet = book.sheet_by_name(sheet_name)
print sheet.row_values(0)[0]

-------------------------------------------------------------------------------


Now, I just have no idea how I can use this FOR loop to count the
occurrences of a particular string such as 'corrupted'. And write it
to another excel file.

Any help would be much appreciated!

John Machin

unread,
Aug 30, 2011, 5:26:58 PM8/30/11
to python...@googlegroups.com
Read the tutorial that you can access via www.python-excel.org. As well as iteration over sheets, you need to iterate over rows and columns. Test each cell to see if it is a text cell and if so, test to see if the cell's value contains your target string and if so, add 1 to a counter.

You seem to want to then write 1 value (the count) to an Excel file however you have demonstrated this already; what's your problem with writing?

derek

unread,
Aug 31, 2011, 2:17:44 AM8/31/11
to python-excel
The following code will work, assuming that each cell contains an
entire string that you need to search for:

book = xlrd.open_workbook("python_spreadsheet.xls")
values = []
for sheet in book.sheets():
print sheet.name
for row in range(sheet.nrows):
for col in range(sheet.ncols):
values.append(sheet.cell(row,col).value)
print values.count("This is part of a list of information in the Third
Sheet")

Of course, if you need to search for individual words, it can be more
tricky ... because then you first need to decide what exactly
constitutes "a word". Python will, of course, allow you to split up a
string very easily (see http://docs.python.org/library/stdtypes.html#str.split
and similar functions).

Enjoy Python - its a great language!

Darryl Klassen

unread,
Aug 31, 2011, 11:09:25 AM8/31/11
to python-excel
notsonewdetective,

I think i would simply create a dictionary that contains the key
values of the words/phrases you are looking for and values of the
occurrences.

i.e.

test_dict = {'word1':0,'word2':0}

And then to populate it - each time you come across a word in the
dictionary - add 1 to the count.

test_dict['word1'] = test_dict['word1'] + 1

for you example you could write an IF statement:

if sheet.row_values(0)[0] in test_dict:
test_dict[sheet.row_values(0)[0]] =
test_dict[sheet.row_values(0)[0]] + 1

I too am fairly new to Python and working with XLS files, but this
method should work.

I hope this helps.

Gerard Blais

unread,
Aug 31, 2011, 11:39:17 AM8/31/11
to python...@googlegroups.com
Great start!   Might as well use a defaultdict, while you're at it:

import collections

test_dict = collections.defaultdict(int)

and for each cell value:

   test_dict[value] += 1

Gerry

--
You received this message because you are subscribed to the Google Groups "python-excel" group.
To post to this group, send an email to python...@googlegroups.com.
To unsubscribe from this group, send email to python-excel...@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/python-excel?hl=en-GB.




--

John Machin

unread,
Aug 31, 2011, 4:39:42 PM8/31/11
to python...@googlegroups.com
-1 Building a list of all values in the spreadsheet just so you can use the list.count method is more than a trifle excessive.

John Machin

unread,
Aug 31, 2011, 4:51:26 PM8/31/11
to python...@googlegroups.com
Your code would appear to be providing a count of the number of occurences of ALL unique values in the spreadsheet.

Providing a count of the number of occurrences of the items in a provided set or list doesn't need defaultdict:

import xlrd
def count_occurrences(filename, queries):
    book = xlrd.open_workbook(filename)
    tally = dict((q, 0) for q in queries)
    for sheet in book.sheets():
        for rowx in xrange(sheet.nrows):
            for value in sheet.row_values(rowx):
                if value in tally:
                    tally[value] += 1
    return tally

Adnan Omar

unread,
Sep 1, 2011, 3:24:57 AM9/1/11
to python...@googlegroups.com
Thanks for the help everyone!

I'll try and get to work on this in the weekend. Should see a few replies from me then.


Ciao. 

--
You received this message because you are subscribed to the Google Groups "python-excel" group.
To view this discussion on the web, visit https://groups.google.com/d/msg/python-excel/-/-QK0ilWQHw4J.

Adnan Omar

unread,
Sep 8, 2011, 4:09:59 PM9/8/11
to python...@googlegroups.com
Hello everyone!
I have finally gotten back. I had internet access, but couldn't really do a lot in terms of Programming since I did not have my computer with me. 

Thanks to your help. I have managed to understand how I can look for a specific string and count it, and print that count unto a separate file! Absolutely grateful for this.

I have one problem left.
How can I make it so that the cells are written two, without me specifying manually what rows and columns to use?
Below is my complete code: 

---------------------------------------------------------
import xlrd
import xlwt

program = 'on'

while program == 'on':
      filename = raw_input("What file ")
      
      book = xlrd.open_workbook(filename)
      values = []
      for sheet in book.sheets():
          print sheet.name
          for row in range(sheet.nrows):
              for col in range(sheet.ncols):
                  values.append(sheet.cell(row,col).value)
      print values.count('Consecutive spaces')
      doublespaces = values.count('Consecutive spaces')
      corruptedchar = values.count('Corrupt character')
      unexpected = values.count('Unexpected # translations')
      print unexpected
      unexpected = unexpected - 1
      corruptedchar = corruptedchar - 1 
      doublespaces = doublespaces  - 1 
      lang = filename
      book = xlwt.Workbook(encoding="utf-8")
      
      sheet1 = book.add_sheet("ERRORSHEET")

      sheet1.write(0, 0, "LANGUAGE")
      sheet1.write(0, 1, "Double Spaces")
      sheet1.write(0, 2, "Corrupted Characters")
      sheet1.write(0, 3, "Unexpected Duplicate translation")
      sheet1.write(1, 0, lang)
      sheet1.write(1, 1, doublespaces)
      sheet1.write(1, 2, corruptedchar)
      sheet1.write(1, 3, unexpected)
      


      program = raw_input("On or Off")
 
 
 
 
book.save("errors.xls")
---------------------------------------------------- 



ignore the lines where some variables (doublespaces etc.) are have  -1 taken from them. I do that so as to avoid the fact that there is always an extra occurrence that is not really an error in all of the files I process ( it's a place in the excel file where the errors are defined before they are printed by the program that creates them)
This completely irrelevant, so just ignore it!

So, as said before. How can I get the program to write to cells without me specifying what cell to write to? Sort of like how you can use .append to just add something to a file on a new line, without overriding it.
 Because I want the script to read another excel file, and just take the values for ( doublespaces,unexpected,corruptedchar) in that file, and just put them underneath the column headers I set already, without overriding the values I had already collected from processing the first excel file. So just as I said above, how can I append them to the sheet?

Hope you guys get to read this!

Truly greatful I am :) 

-Adnan
Message has been deleted

John Machin

unread,
Sep 8, 2011, 5:23:07 PM9/8/11
to python...@googlegroups.com
On Fri, September 9, 2011 6:09 am, Adnan Omar wrote:

> I have one problem left.
> How can I make it so that the cells are written two, without me specifying
> manually what rows and columns to use?

Untested:

import xlrd
import xlwt
import sys

obook = xlwt.Workbook() # don't need encoding="utf-8"
# Your code won't work -- you use "book" for both input and output.
sheet1 = obook.add_sheet("ERRORSHEET")


sheet1.write(0, 0, "LANGUAGE")
sheet1.write(0, 1, "Double Spaces")
sheet1.write(0, 2, "Corrupted Characters")
sheet1.write(0, 3, "Unexpected Duplicate translation")

# get filenames from the command line
for outrowx, filename in enumerate(sys.argv[1:], 1):


book = xlrd.open_workbook(filename)
values = []
for sheet in book.sheets():

for rowx in xrange(sheet.nrows):
values.extend(sheet.row_values(rowx))


doublespaces = values.count('Consecutive spaces')
corruptedchar = values.count('Corrupt character')
unexpected = values.count('Unexpected # translations')

lang = filename
for colx, value in enumerate((lang, doublespaces - 1, corruptedchar -
1, unexpected - 1)):
sheet1.write(outrowx, colx, value)

obook.save("errors.xls")

Adnan Omar

unread,
Sep 9, 2011, 6:23:40 AM9/9/11
to python...@googlegroups.com
 I am very sorry for this.

I am having a hard time understanding some of what you did there. It's completely my fault. I wanted to get this finished and go over the code and understand what exactly I had done. 
Whenever I am making this. I have a hard time when it comes to for loops. 
I don't think I understand the general layout of for loops. Even though i have read countless tutorials.
I understand for loops that work like this ( very simple ones) 

li = ['a', 'b', 'e']
for s in li:
print s

I understand that what I am doing here is looping through the list. And then assigning 's' to each value in this list, and then printing each value.
But beyond this, I have no idea how for loops work. Particularly, when I come across something like this: 

 for rowx in xrange(sheet.nrows):
           values.extend(sheet.row_values(rowx))

Xrange ( or any range) confuses me. And I don't understand how it takes these arguments, or where these arguments are coming from , and what object they can be ( functions? variables?)  ( sheet.nrows) for instance. 

And this line:
____________________________________
  values.extend(sheet.row_values(rowx))
____________________________________

I don't get it at all! 

I would really appreciate it if someone could let me in how for loops work in this regard. I think it's the most difficult thing for me to understand yet. And it also happens to be the most useful.

I have also noticed that I  have some difficulties with with the rules regarding methods,  and how to use methods, particularly when using methods from modules ( which I will try and read up on today) 
 


--
You received this message because you are subscribed to the Google Groups "python-excel" group.

John Yeung

unread,
Sep 9, 2011, 9:04:22 AM9/9/11
to python...@googlegroups.com
> I don't think I understand the general layout of for loops. Even
> though i have read countless tutorials.

It is hard for me to believe you have read countless *Python*
tutorials if you are confused by for loops and range. You said
earlier that Python is the first programming language you "can
somewhat comprehend" so I get that you're not a programmer.

The questions you are asking here are very basic, fundamental Python
questions and belong on a list or forum for Python beginners or
programming beginners, not on a list specifically for working with
Excel using Python.

I recommend reading through the official Python tutorial, which is
part of the standard documentation at www.python.org (or, if you are
using Windows, the same material is already on your computer,
reachable from the Start menu, listed as "Python Manuals" in the same
place that you would invoke the Python command-line or GUI).

If that tutorial is too advanced and programmer-oriented for you, I
recommend joining the python-tutor mailing list:

http://mail.python.org/mailman/listinfo/tutor

There you will find lots of helpful, patient Python programmers who
are accustomed to and willing to explain things to beginners.

Also check out

http://wiki.python.org/moin/BeginnersGuide/NonProgrammers

(That last link is also mentioned in the python-tutor info page.)

Good luck!

John Y.

Adnan Omar

unread,
Sep 9, 2011, 9:34:42 AM9/9/11
to python...@googlegroups.com
This is exactly what I needed. Thanks a lot John : ) 







--
Reply all
Reply to author
Forward
0 new messages