converting cell value to string problems

11,526 views
Skip to first unread message

Aron Kisdi

unread,
Dec 5, 2011, 1:51:35 PM12/5/11
to openpyx...@googlegroups.com
Hello,
 
I am trying to read a large workbook and manipulate cells if they have a certain phrase. I am having some encoding issues. The worksheet contains some non-ascii characters (extended ascii), so when I try to convert cell.value to str I get a UnicodeEncodeError. What can I do?
 
I tried all sort of things but it seems that cell.value is NoneType. Is this right?
 
Example:
 
 
wb = load_workbook(filename = r'example.xlsm')
ws = wb.get_sheet_by_name(name = 'sheet1')
col = "B"
for row in xrange(1, ws.get_highest_row()):
 text_in_cell = ws.cell('%s%s'%(col, row)).value
 text_in_cell = str(text_in_cell)
 if text_in_cell[0:3] == "EIDA":
  print "do stuff"
 
Any ideas?
 
Thanks,
 
Aron
 
 

Aron Kisdi

unread,
Dec 6, 2011, 5:05:34 AM12/6/11
to openpyx...@googlegroups.com
Short example attached. I read that openpyxl read cells as unicode, what am I doing wrong?
Also I noticed that when I save the manipulated result like this all the formatting is lost... How can I manipulate a file and preserve the formatting?
 
Many Thanks,
 
Aron

encoding and formating bug.py
encoding_bug.xlsx

sjma...@lexicon.net

unread,
Dec 6, 2011, 2:57:16 PM12/6/11
to openpyx...@googlegroups.com


On Tuesday, 6 December 2011 21:05:34 UTC+11, Áron wrote:
Short example attached. I read that openpyxl read cells as unicode, what am I doing wrong?

See below.
 
Also I noticed that when I save the manipulated result like this all the formatting is lost... How can I manipulate a file and preserve the formatting?

Please ask a separate question.
 
On 5 December 2011 18:51, Aron Kisdi <kis...@gmail.com> wrote:
Hello,
 
I am trying to read a large workbook and manipulate cells if they have a certain phrase. I am having some encoding issues. The worksheet contains some non-ascii characters (extended ascii), so when I try to convert cell.value to str I get a UnicodeEncodeError. What can I do?
 
I tried all sort of things but it seems that cell.value is NoneType. Is this right?
 
Example:
 
 
wb = load_workbook(filename = r'example.xlsm')
ws = wb.get_sheet_by_name(name = 'sheet1')
col = "B"
for row in xrange(1, ws.get_highest_row()):
 text_in_cell = ws.cell('%s%s'%(col, row)).value
 text_in_cell = str(text_in_cell)
 if text_in_cell[0:3] == "EIDA":
  print "do stuff"
 
Any ideas?
 


It is correct that openpyxl produces a unicode object when the cell contains text. Change your sample script to include the line

print row, type(text_in_cell), repr(text_in_cell)

just before the line containing the str() call. You will see that cell B4 contains

u'the next character is from extended ascii \u2265'

The error message from the next line (which you should have included in your first message) is

UnicodeEncodeError: 'ascii' codec can't encode character u'\u2265' in position 42: ordinal not in range(128)

This is quite expected. You are trying to bash into ASCII a character that can't be represented in ASCII.

The short answer to your problem is "Don't do that". Just omit the attempted str conversion.

Longer answer:

In general: get your input text data into Python unicode objects. (Any Excel file reader does this for you, without being told to).

Work in unicode. This includes using unicode literals u"etcetc" for at least all literals that won't fit in ASCII. You should really use unicode literals for all text literals if you are planning to use Python 3.X some time soon (if so, read http://python3porting.com/toc.html).

When it comes to output, ensure that the unicode objects are encoded in a manner appropriate to the receiver. Any Excel file writer does this for you, without being told to. Likewise any XML file writer. You will have trouble writing to sys.stdout if your script is running in a Command Prompt window on Windows if your text includes characters that can't be encoded in a legacy encoding e.g. cp850 -- this is a wholly separate issue, and doesn't obviate the general principles.

Hints:

1. Use repr() for debugging -- see above.

2.

| >>> import unicodedata
| >>> unicodedata.name(text_in_cell[42])
| 'GREATER-THAN OR EQUAL TO'
| >>>

3. Read these:

| http://www.joelonsoftware.com/articles/Unicode.html

| http://docs.python.org/howto/unicode.html

Hope this helps,
John





Aron Kisdi

unread,
Dec 6, 2011, 3:55:26 PM12/6/11
to openpyx...@googlegroups.com
Thank you for the excellent answer, now I understand what was going on and the article on unicode was very useful.
 
First I did it without the string conversion which I added to avoid some errors but at the same time causing more...
 
For the benefit of others I learnt the following things as well (correct me if I'm wrong on any)
 
An empty cell will be NoneType so adding something like
 
if value: (which is the same as if not value == None:)
where value is cell.value
is useful if you want to manipulate the string inside a cell.
 
if you want to change the type of all cell to strings (including excel time datetime.datetime) you can use
unicode(value)
 
Thanks for taking the time to answer my question, I'm sure the reply will be useful for many people to come, it certainly was for me.
 
Thanks,
 
Aron

James Parris

unread,
Aug 28, 2012, 2:37:58 PM8/28/12
to openpyx...@googlegroups.com
I had a similar issue, so rather than create my own post I thought I'd bump this one.  My problem is many ascii characters don't have equivalents in the proprietary version of latin-1 that AIX uses.  So my reporting jobs were choking whenever someone tried to put in a ® or a © or a ™. 

Here is the fix I applied to openpyxl to get this to work:

In:
/openpyxl-1.5.8-py2.6.egg/openpyxl/cell.py
Lines 199,200.

I changed this:

        if not isinstance(value, unicode):
            value = unicode(value, self.encoding)

To this:

        if not isinstance(value, unicode):
            value = unicode(value, self.encoding, errors='replace')

That will continue the process while switching out the unreadable characters with this thing:  �  (u'\ufffdabc').  Since I am just creating internal reports, my customers don't care if they see "Store Visa�" rather than "Store Visa™"

I hope that helps.
Reply all
Reply to author
Forward
Message has been deleted
0 new messages