Data extraction problem using xlrd

vandevel

unread,

May 5, 2009, 4:23:15 PM5/5/09

to python-excel

Hello,

I am experiencing a problem reading an excel file using xlrd; it seems
that strings are being converted to floats.
For example, the following line is explicitly formatted as text in the
spreadsheet as:

21116,1952,42104,0121.1,241.9,S45.03,350.02

However, once this line is parsed by xlrd, values such as 21116 end up
being converted to 21116.0(once again despite explicit text
formatting). Using the same spreadsheet in a test, I was able to
retain the formatting in the aforementioned line when using the
pyexcelerator xls2csv script as well as gnumeric's ssconvert utility.
Since those tools were able to perform a seemingly simple conversion,
I am wondering if there is a deficiency
in the data type detection facility of xlrd or if there is something
that I am overlooking.

Any advice would be appreciated!

Daniel Burke

unread,

May 5, 2009, 8:03:07 PM5/5/09

to python...@googlegroups.com

You can not "explicitly format it as text", excel will treat any string that only contains numbers as a number, regardless of what formats are applied to it. (most of the time)

If you want it to stay as a string, the canonical thing to do is prefix the string with an apostrophe.

This is a feature of Excel, not a bug in xlrd.

--Dan
"Beware of bugs in the above code; I have only proved it correct, not tried it." -- D. Knuth

John Machin

unread,

May 5, 2009, 8:37:17 PM5/5/09

to python...@googlegroups.com

On 6/05/2009 6:23 AM, vandevel wrote:
> Hello,

Hi vandevel,
Welcome to the group.

>
> I am experiencing a problem reading an excel file using xlrd; it seems
> that strings are being converted to floats.

Things are often not what they seem, and this is one of those cases.
xlrd definitely does not convert strings to floats.

> For example, the following line is explicitly formatted as text in the
> spreadsheet as:
>
> 21116,1952,42104,0121.1,241.9,S45.03,350.02

For a start, Excel doesn't format "lines", it formats cells
individually. The result of the formatting is displayed on the user's
screen; the result does not exist *in* the spreadsheet file.

> However, once this line is parsed by xlrd, values such as 21116 end up
> being converted to 21116.0(once again despite explicit text
> formatting).
>
> Using the same spreadsheet in a test, I was able to
> retain the formatting in the aforementioned line when using the
> pyexcelerator xls2csv script as well as gnumeric's ssconvert utility.

Perhaps we have a different understanding of "text formatting". Perhaps
you believe that all the cell values in the spreadsheet are strings, not
floats. Firstly obviously S45.03 can't be a float [unless it's a typo;
if so do yourself and the rest of the world a favour by using copy/paste
when reporting such things]. Less obviously 0121.1 must be a string;
if it were a float value with leading-zero display formatting in Excel,
there is no way that pxExcelerator's xls2csv utility would reproduce the
leading zero.

If the first cell displays as 21116.0 after you have extracted it using
xlrd, then it must be stored in the spreadsheet file as a float.

You can see for yourself ... assuming your data is in Excel's row 1,
type this formula =cell("type",a1) into cell A2 and than copy it across.
From the Excel help: """
"type" Text value corresponding to the type of data in the cell. Returns
"b" for blank if the cell is empty, "l" for label if the cell contains a
text constant, and "v" for value if the cell contains anything else.
"""

If you haven't been changing the default horizontal alignment, increase
all the column widths; text values will cling to the left and numeric
values will cling to the right.

> Since those tools were able to perform a seemingly simple conversion,

Huh? I thought you didn't want any conversion!

As an accidental consequence of the way Excel compresses floats when
writing them to the file, pyExcelerator returns a Python int object
instead of a Python float object for *some* Excel float values that can
be represented as integer values. This is why you are seeing 21116
instead of 21116.0 in the xls2csv output. Actually you are seeing
"21116" because the xls2csv wraps quotes around everything ... possibly
giving casual observers the impression that it was a string right from
the beginning.

pyExcelerator's xls2csv is doing these further transformations on each
value retrieved:
v = values[(row_idx, col_idx)]
if isinstance(v, unicode):
v = v.encode('cp866', 'backslashreplace')
else:
v = `v`
v = '"%s"' % v.strip()
Encoding text using an IBM PC-DOS Russian codec?
Anyone under age 30 know what `v` does without looking it up in the manual?
Stripping leading and trailing whitespace is sound forensic practice?

> I am wondering if there is a deficiency
> in the data type detection facility of xlrd or if there is something
> that I am overlooking.

Which do you think has the greater likelihood?

Cheers,

John

Chris Withers

unread,

May 6, 2009, 3:48:50 AM5/6/09

to python...@googlegroups.com

John Machin wrote:
> Anyone under age 30 know what `v` does without looking it up in the manual?

Ooo oo, I know this one...

Aw, shucks, I don't meet the criterea...

I think I'm right in thinking that this little gem goes away in Python 3
and you have to use the (more sane) repr(v) instead :-)

Chris

--
Simplistix - Content Management, Zope & Python Consulting
- http://www.simplistix.co.uk

Wouter van Heyst

unread,

May 7, 2009, 3:14:25 AM5/7/09

to python...@googlegroups.com

On Wed, May 06, 2009 at 08:48:50AM +0100, Chris Withers wrote:
>
> John Machin wrote:
> > Anyone under age 30 know what `v` does without looking it up in the manual?
>
> Ooo oo, I know this one...
>
> Aw, shucks, I don't meet the criterea...
>
> I think I'm right in thinking that this little gem goes away in Python 3
> and you have to use the (more sane) repr(v) instead :-)

I knew that, and meet the criteria, woo! ;)

Wouter van Heyst

Chris Withers

unread,

May 7, 2009, 5:32:10 AM5/7/09

to python...@googlegroups.com

Wouter van Heyst wrote:
>> I think I'm right in thinking that this little gem goes away in Python 3
>> and you have to use the (more sane) repr(v) instead :-)
>
> I knew that, and meet the criteria, woo! ;)

Excellent, your prize is that you now owe everyone over 30 on the list a
beer ;-)

Wouter van Heyst

unread,

May 7, 2009, 6:03:48 AM5/7/09

to python...@googlegroups.com

On Thu, May 07, 2009 at 10:32:10AM +0100, Chris Withers wrote:
>
> Wouter van Heyst wrote:
> >> I think I'm right in thinking that this little gem goes away in Python 3
> >> and you have to use the (more sane) repr(v) instead :-)
> >
> > I knew that, and meet the criteria, woo! ;)
>
> Excellent, your prize is that you now owe everyone over 30 on the list a
> beer ;-)

I'll go register for EuroPython, you can claim your beer there ;)

Wouter van Heyst

vandevel

unread,

May 7, 2009, 2:10:35 PM5/7/09

to python-excel

Thanks John,

> Perhaps we have a different understanding of "text formatting". Perhaps
> you believe that all the cell values in the spreadsheet are strings, not
> floats. Firstly obviously S45.03 can't be a float [unless it's a typo;
> if so do yourself and the rest of the world a favour by using copy/paste
> when reporting such things].

You are correct S45.03 is not a float. The example of concern, was the
value 21116, which when parsed by other tools, remains an integer,
whereas it seems to be converted to a float when using xlrd. It may
well be that Excel performs the conversion internally and that xlrd
faithfully extracts
the data without applying transformation, however, it strikes me that
the other tools are taking it one step further and preserving the
representation of the value
as it is displayed on the screen.

>
> > Since those tools were able to perform a seemingly simple conversion,
>
> Huh? I thought you didn't want any conversion!

Many apologies for the semantic ambiguity; by conversion, I meant from
XLS to CSV(as stated, xls2csv was used). Sorry to trip you up.

Cheers!

vandevel
>
> Cheers,
>
> John

John Machin

unread,

May 7, 2009, 8:17:53 PM5/7/09

to python...@googlegroups.com

On 8/05/2009 4:10 AM, vandevel wrote:
> Thanks John,
>
>> Perhaps we have a different understanding of "text formatting". Perhaps
>> you believe that all the cell values in the spreadsheet are strings, not
>> floats. Firstly obviously S45.03 can't be a float [unless it's a typo;
>> if so do yourself and the rest of the world a favour by using copy/paste
>> when reporting such things].
>
> You are correct S45.03 is not a float. The example of concern, was the
> value 21116, which when parsed by other tools, remains an integer,

It was not an integer to start with.

> whereas it seems to be converted to a float when using xlrd.

As explained, it was a float to start with.

> It may
> well be that Excel performs the conversion internally

"Excel performs the conversion internally" ... what conversion???

> and that xlrd
> faithfully extracts
> the data without applying transformation,

There is no "may well be" about it; xlrd faithfully extracts the data,
period.

> however, it strikes me that
> the other tools are taking it one step further and preserving the
> representation of the value
> as it is displayed on the screen.

In the case of pyExcelerator's xls2csv, this is, as explained,
fortuitous/accidental/not-deliberate/coincidental.

That is *NOT* the goal of xlrd. xlrd's goal is to extract the data as
is, irrespective of whatever formatting may have been applied, and tell
you what internal type it was, just in case you have any vague concerns
about data integrity.

What you see on the screen may be a stinking heap inside, like a column
that looks like dates, but as the result of an Excel 2000 text import
DMY/MDY confusion bug, contains some seemingly valid dates [the ones
where the DAY was 12 or less!] and the remainder text and all stuffed up
except for the ones where the day and the month were equal.

>
>>> Since those tools were able to perform a seemingly simple conversion,
>> Huh? I thought you didn't want any conversion!
>
> Many apologies for the semantic ambiguity; by conversion, I meant from
> XLS to CSV(as stated, xls2csv was used). Sorry to trip you up.

There is no need to apologise for what you didn't manage to achieve :-)

Reply all

Reply to author

Forward