Creating the correct cell data type from raw text

1,287 views
Skip to first unread message

Dan White

unread,
Sep 12, 2009, 8:35:38 PM9/12/09
to python-excel
HI Python-Excel,

Thanks for providing such a great module, xlrd and xlwt are truly
handy. I have a question on the most pythonic way to take raw data
and type it correctly when using xlwt. Forgive me if this is covered
in the documentation, but what's the best way to take the comma-
separated tokens in the RAWDATA string below and correctly set their
datatype? If I do nothing, everything is copied as-is as strings into
the spreadsheet, which I don't want. The script below produces the
result I'm looking for, but I wonder if there is a more idiomatic way
using some built-in methods in xlwt for writing the correct data
type. Any suggestions? Thanks --- Dan

#
import re
from xlwt import Workbook

def doworksheet():
'''Load a worksheet with data'''
RAWDATA = 'A,1,3 more words,2.34501'
wb = Workbook()
ws = wb.add_sheet('Data')
tokens = RAWDATA.split(',')
for c,data in enumerate(tokens):
if re.match('^\d+$',data):
cell = int(data)
elif re.match('^\d+\.\d*$',data):
cell = float(data)
else:
cell = data
ws.write(0,c,cell)
wb.save('test.xls')
print 'Wrote test.xls'

if __name__ == "__main__":
doworksheet()

Jon Clements

unread,
Sep 13, 2009, 10:14:29 AM9/13/09
to python-excel
From the docs on https://secure.simplistix.co.uk/svn/xlwt/trunk/xlwt/doc/xlwt.html:

label
The data value to be written. An int, long, or decimal.Decimal
instance is converted to float. A unicode instance is written as is. A
str instance is converted to unicode using the encoding (default:
'ascii') specified when the Workbook instance was created. A
datetime.datetime, datetime.date, or datetime.time instance is
converted into Excel date format (a float representing the number of
days since (typically) 1899-12-31T00:00:00, under the pretence that
1900 was a leap year). A bool instance will show up as TRUE or FALSE
in Excel. None causes the cell to be blank -- no data, only
formatting. An xlwt.Formula instance causes an Excel formula to be
written. [NC]

... so it looks like you're doing it the right way. The only thing is
I'd probably get rid of the regexp and use something similar to:

for colno, colval in enumerate(tokens):
for T in (int, float, str):
try:
colval = T(colval)
except ValueError:
continue
else:
ws.write(0, colno, colval)
break

And it might also be worth adding in a datetime or Decimal type
there...

Cheers,
Jon.

John Machin

unread,
Sep 13, 2009, 12:24:25 PM9/13/09
to python...@googlegroups.com
On 13/09/2009 10:35 AM, Dan White wrote:

> what's the best way to take the comma-
> separated tokens in the RAWDATA string below and correctly set their
> datatype? If I do nothing, everything is copied as-is as strings into
> the spreadsheet, which I don't want. The script below produces the
> result I'm looking for, but I wonder if there is a more idiomatic way
> using some built-in methods in xlwt for writing the correct data
> type.

In general, when you /know/ what type of Excel cell you want to write,
it is better to use directly the Row.set_cell_XXXX(col_index, value,
style) methods. Note that sheet.write() calls one of these after a fair
amount of testing to determine which one to call.

The values of XXXX for the 5 basic cell types are: number, text,
boolean, blank, and error.

"number" means float. There is no point in testing your input for int
separately unless you want to format int and float data cells differently.

"date" is NOT a basic type. A date is just a float number of days since
one of two epochs plus a "number format" that will display it as a date.
See the xlrd docs for the gory details. You can write a date cell by
using Row.set_cell_date(colx, a_datetime.datetime_object,
an_appropriate_style).


> #
> import re
> from xlwt import Workbook

from xlwt import Workbook, easyxf

int_style = easyxf('', num_format_str='#,##0')
float_style = easyxf('', num_format_str='#,##0.00')

>
> def doworksheet():
> '''Load a worksheet with data'''
> RAWDATA = 'A,1,3 more words,2.34501'
> wb = Workbook()
> ws = wb.add_sheet('Data')
> tokens = RAWDATA.split(',')

Consider using the csv module even when you think you don't need it.

r = ws.row(0)

> for c,data in enumerate(tokens):
> if re.match('^\d+$',data):

Negative numbers?

r.set_cell_number(c, float(data), int_style)


> elif re.match('^\d+\.\d*$',data):

The above pattern is not very general. Apart from negatives, it doesn't
handle .123 and 1e42. Consider stealing this from ExcelFormulaLexer.py:

flt_const_pattern = r"""
(?:
(?: \d* \. \d+ ) # .1 .12 .123 etc 9.1 etc 98.1 etc
|
(?: \d+ \. ) # 1. 12. 123. etc
)
# followed by optional exponent part
(?: [Ee] [+-]? \d+ ) ?
"""
and prepend "-?" to catch negatives.

r.set_cell_number(c, float(data), float_style)
> else:
> r.set_cell_text(c, data)

You may wish to consider using try/except instead of regular
expressions. For smallish input, speed doesn't matter. Ease of
understanding your code a week later is not enhanced by using that
float-matching regex :-)

With largish data (e.g. database query output) and a knowledge of what
kind of data is present in each column, a different approach is possible:

[OTTOMH, untested, caveat lector]
"""
Date,Ticker,Qty,Price
2009-09-14,BHP,123,456.789
"""
# setup
def date_conv(s):
return datetime.strpdate(s, '%Y-%m-%d')
cdate = (dateconv, Row.set_cell_date, date_style)
ctext = (None, Row.set_cell_text, text_style)
cint = (int, Row.set_cell_number, date_style)
cprice = (float, Row.set_cell_number, price_style)
config = (cdate, ctext, cint, cprice)

# for each row
row = ws.row(row_index)
for colx, data in enumerate(tokens):
convfunc, setfunc, style = config[c]
if convfunc:
data = convfunc(data)
setfunc(row, colx, data, style)

HTH,
John

Dan White

unread,
Sep 15, 2009, 12:08:17 AM9/15/09
to python-excel
Hi John,

Thanks for the help, let me try out your suggestions,

--- Dan

Chris Withers

unread,
Sep 18, 2009, 6:37:01 AM9/18/09
to python...@googlegroups.com
John Machin wrote:
> In general, when you /know/ what type of Excel cell you want to write,
> it is better to use directly the Row.set_cell_XXXX(col_index, value,
> style) methods. Note that sheet.write() calls one of these after a fair
> amount of testing to determine which one to call.

For most people's uses, I don't think the overhead is actually worth
worrying about compared to the simplicity of just having to remember to
do sheet.write(...)

>> for c,data in enumerate(tokens):
>> if re.match('^\d+$',data):
>
> Negative numbers?
>
> r.set_cell_number(c, float(data), int_style)
>> elif re.match('^\d+\.\d*$',data):
>
> The above pattern is not very general. Apart from negatives, it doesn't
> handle .123 and 1e42. Consider stealing this from ExcelFormulaLexer.py:

Maybe simpler, how about just doing the try-except dance:

for type in (int,blah...):
try:
value = type(string)
except:
pass
else:
break

I'd also highly recommend this excellent time/date parser for spotting
dates:

http://labix.org/python-dateutil#head-1443e0f14ad5dff07efd465e080d1110920673d8-2

cheers,

Chris

--
Simplistix - Content Management, Batch Processing & Python Consulting
- http://www.simplistix.co.uk

Reply all
Reply to author
Forward
0 new messages