Hi vandevel,
Welcome to the group.
>
> I am experiencing a problem reading an excel file using xlrd; it seems
> that strings are being converted to floats.
Things are often not what they seem, and this is one of those cases.
xlrd definitely does not convert strings to floats.
> For example, the following line is explicitly formatted as text in the
> spreadsheet as:
>
> 21116,1952,42104,0121.1,241.9,S45.03,350.02
For a start, Excel doesn't format "lines", it formats cells
individually. The result of the formatting is displayed on the user's
screen; the result does not exist *in* the spreadsheet file.
> However, once this line is parsed by xlrd, values such as 21116 end up
> being converted to 21116.0(once again despite explicit text
> formatting).
>
> Using the same spreadsheet in a test, I was able to
> retain the formatting in the aforementioned line when using the
> pyexcelerator xls2csv script as well as gnumeric's ssconvert utility.
Perhaps we have a different understanding of "text formatting". Perhaps
you believe that all the cell values in the spreadsheet are strings, not
floats. Firstly obviously S45.03 can't be a float [unless it's a typo;
if so do yourself and the rest of the world a favour by using copy/paste
when reporting such things]. Less obviously 0121.1 must be a string;
if it were a float value with leading-zero display formatting in Excel,
there is no way that pxExcelerator's xls2csv utility would reproduce the
leading zero.
If the first cell displays as 21116.0 after you have extracted it using
xlrd, then it must be stored in the spreadsheet file as a float.
You can see for yourself ... assuming your data is in Excel's row 1,
type this formula =cell("type",a1) into cell A2 and than copy it across.
From the Excel help: """
"type" Text value corresponding to the type of data in the cell. Returns
"b" for blank if the cell is empty, "l" for label if the cell contains a
text constant, and "v" for value if the cell contains anything else.
"""
If you haven't been changing the default horizontal alignment, increase
all the column widths; text values will cling to the left and numeric
values will cling to the right.
> Since those tools were able to perform a seemingly simple conversion,
Huh? I thought you didn't want any conversion!
As an accidental consequence of the way Excel compresses floats when
writing them to the file, pyExcelerator returns a Python int object
instead of a Python float object for *some* Excel float values that can
be represented as integer values. This is why you are seeing 21116
instead of 21116.0 in the xls2csv output. Actually you are seeing
"21116" because the xls2csv wraps quotes around everything ... possibly
giving casual observers the impression that it was a string right from
the beginning.
pyExcelerator's xls2csv is doing these further transformations on each
value retrieved:
v = values[(row_idx, col_idx)]
if isinstance(v, unicode):
v = v.encode('cp866', 'backslashreplace')
else:
v = `v`
v = '"%s"' % v.strip()
Encoding text using an IBM PC-DOS Russian codec?
Anyone under age 30 know what `v` does without looking it up in the manual?
Stripping leading and trailing whitespace is sound forensic practice?
> I am wondering if there is a deficiency
> in the data type detection facility of xlrd or if there is something
> that I am overlooking.
Which do you think has the greater likelihood?
Cheers,
John
Ooo oo, I know this one...
Aw, shucks, I don't meet the criterea...
I think I'm right in thinking that this little gem goes away in Python 3
and you have to use the (more sane) repr(v) instead :-)
Chris
--
Simplistix - Content Management, Zope & Python Consulting
- http://www.simplistix.co.uk
I knew that, and meet the criteria, woo! ;)
Wouter van Heyst
Excellent, your prize is that you now owe everyone over 30 on the list a
beer ;-)
I'll go register for EuroPython, you can claim your beer there ;)
Wouter van Heyst
It was not an integer to start with.
> whereas it seems to be converted to a float when using xlrd.
As explained, it was a float to start with.
> It may
> well be that Excel performs the conversion internally
"Excel performs the conversion internally" ... what conversion???
> and that xlrd
> faithfully extracts
> the data without applying transformation,
There is no "may well be" about it; xlrd faithfully extracts the data,
period.
> however, it strikes me that
> the other tools are taking it one step further and preserving the
> representation of the value
> as it is displayed on the screen.
In the case of pyExcelerator's xls2csv, this is, as explained,
fortuitous/accidental/not-deliberate/coincidental.
That is *NOT* the goal of xlrd. xlrd's goal is to extract the data as
is, irrespective of whatever formatting may have been applied, and tell
you what internal type it was, just in case you have any vague concerns
about data integrity.
What you see on the screen may be a stinking heap inside, like a column
that looks like dates, but as the result of an Excel 2000 text import
DMY/MDY confusion bug, contains some seemingly valid dates [the ones
where the DAY was 12 or less!] and the remainder text and all stuffed up
except for the ones where the day and the month were equal.
>
>>> Since those tools were able to perform a seemingly simple conversion,
>> Huh? I thought you didn't want any conversion!
>
> Many apologies for the semantic ambiguity; by conversion, I meant from
> XLS to CSV(as stated, xls2csv was used). Sorry to trip you up.
There is no need to apologise for what you didn't manage to achieve :-)