Not able to extract strings containing apostrophe or single quotation mark of a excel cell

5,369 views
Skip to first unread message

pradnya

unread,
Nov 7, 2011, 8:08:14 AM11/7/11
to openpyxl-users
hi ,
i have used openpyxl module to import excel data into database, data
contains apostrophe mark and other punctuations marks .but when tried
to run i got error
UnicodeEncodeError: 'charmap' codec can't encode character u'\u2019'
in position
4: character maps to <undefined>

when tried to find this python code
import unicodedata
unicodedata.name(u'\u2019')
i get : 'RIGHT SINGLE QUOTATION MARK'

is there some problem data in the cell or are there any issues with
parsing of data in openpyxl module

Sample data giving me error is :The Banker's algorithm is used

Eric Gazoni

unread,
Nov 9, 2011, 3:50:41 AM11/9/11
to openpyx...@googlegroups.com
Hello,
maybe you could use the encoding parameter when creating your workbook:

wb = openpyxl.workbook.Workbook(encoding='utf-8') # replace utf-8 by the
encoding you're actually using in your database ;-)

By default, it's using utf-8, and make sure you've got the latest
version of openpyxl, as encoding was only added recently.

Cheers,
Eric


Le 7/11/11 14:08, pradnya a �crit :

Brent Hoover

unread,
Nov 9, 2011, 8:37:34 AM11/9/11
to openpyx...@googlegroups.com
It's also possible that the spreadsheet is using 'latin-1' encoding rather than utf8. This is the default encoding for Windows and the right quotation mark would be an example of a character that would be not be the same between the two character sets.

I can give you a better answer later today once I have a chance to look at the code again.

Brent Hoover
Computer Scientist
323-687-3265

pradnya

unread,
Nov 11, 2011, 1:08:35 AM11/11/11
to openpyxl-users
hi,
Thanks for your reply.
But Eric the version i am using is 1.5.6 and i m not creating object
of workbook
here is the code

from openpyxl.reader.excel import load_workbook
def main():
wb = load_workbook(filename = u'test.xlsx')
for sheet in wb.worksheets:
for row in sheet.rows:
for cell in row:
print cell.value

and the xlsx file contains two rows as

1 Subject Test 1
2 Question What is the banker’s algorithm ?

and the error is :

Traceback (most recent call last):
File "test.py", line 9, in <module>
main()
File "test.py", line 7, in main
print cell.value
File "C:\Python26\lib\encodings\cp437.py", line 12, in encode
return codecs.charmap_encode(input,errors,encoding_map)
UnicodeEncodeError: 'charmap' codec can't encode character u'\u2019'
in position
18: character maps to <undefined>

Eric Gazoni

unread,
Nov 11, 2011, 8:21:02 AM11/11/11
to openpyx...@googlegroups.com
Ok, I see what's the issue is:

have a look there :
http://stackoverflow.com/questions/3224268/python-unicode-encode-error

When you're printing cell.value, actually you're trying to decode the
string contained inside value using your system default encoding. Try to
do "print cell.value.decode('utf-8')" if your workbook is in utf-8 for
example.

Cheers,
Eric

Le 11/11/11 07:08, pradnya a �crit :


> hi,
> Thanks for your reply.
> But Eric the version i am using is 1.5.6 and i m not creating object
> of workbook
> here is the code
>
> from openpyxl.reader.excel import load_workbook
> def main():
> wb = load_workbook(filename = u'test.xlsx')
> for sheet in wb.worksheets:
> for row in sheet.rows:
> for cell in row:
> print cell.value
>
> and the xlsx file contains two rows as
>
> 1 Subject Test 1

> 2 Question What is the banker�s algorithm ?

pradnya

unread,
Nov 12, 2011, 1:52:49 AM11/12/11
to openpyxl-users
Thanks Eric,
cell.value.encode('ascii','ignore')
solved the issue little bit
the output i m getting for What is the banker's algorithm ?
is What is the bankers algorithm ?

Is there any way to get the apostrophe ?

Eric Gazoni

unread,
Nov 12, 2011, 4:58:22 AM11/12/11
to openpyx...@googlegroups.com
So, here you're asking to display the content using 'ascii' character
map, and to 'ignore' what's not in the character map (that's why it
displays the regular alphabet letters, but not the apostrophe that was
causing issue since the begining).

To get the apostrophe, I'd suggest to use a different encoding, that
contains the apostrophe. That's why it's important to know what is the
original encoding of your file, usually in europe, you can try 'latin1'
or 'utf-8' and get lucky.

Cheers,
Eric

Le 12/11/11 07:52, pradnya a �crit :

sjma...@lexicon.net

unread,
Nov 13, 2011, 3:59:24 PM11/13/11
to openpyx...@googlegroups.com
An XLSX file is a zipped collection of XML documents. An XML parser will return text in unicode objects. What encoding was used in the XML document is utterly irrelevant.

Your problem is nothing to do with openpyxl at all. Your problem can be summarised in the following one-line script:

print u'\u2019\n

You are currently running this in a Command Prompt window on a Windows machine. sys.stdout.encoding is set to the antique MS-DOS cp437. Result: splat.

Option 1: Run your script in an environment (e.g. IDLE) where writing unicode objects to stdout produces a better outcome.

Option 2: Substitute replacement characters before you write to stdout. You are likely to find other non-cp437-able characters in other workbooks. You may want to use the unicode.translate method (http://docs.python.org/library/stdtypes.html#str.translate) instead of just your_unicode_object.replace(u"\u2019", u"'")

sjma...@lexicon.net

unread,
Nov 13, 2011, 9:26:01 PM11/13/11
to openpyx...@googlegroups.com

On Thursday, November 10, 2011 12:37:34 AM UTC+11, Brent Hoover wrote:
It's also possible that the spreadsheet is using 'latin-1' encoding rather than utf8. This is the default encoding for Windows and the right quotation mark would be an example of a character that would be not be the same between the two character sets.

 What on earth does "the spreadsheet is using XYZ encoding" mean? If a workbook is "in" anything, it is Unicode, which is not an encoding.
 
latin-1 is not the default encoding for windows. The default depends on the locale of the individual PC ... cp1252 for Western European languages, cp1251 for languages written in Cyrillic script, cp1250 for Eastern European languages, plus others: one each for Greek/Turkish/Arabic/Hebrew/Vietnamese/Japanese/Korean and 2 for Chinese (Traditional & Simplified).
 
In any case none of this has anything to do with the OP's problem -- see my other answer.
Reply all
Reply to author
Forward
0 new messages