Contributing..?

487 views
Skip to first unread message

Jon Clements

unread,
Oct 8, 2009, 8:45:40 AM10/8/09
to python-excel
Hi all,

Is there anything I can contribute back to the library (xlrd in
particular)? I have some free-time until end of mid Nov. (before
everything goes mad), and thought as this is one of the major
libraries installed on all my boxes, then perhaps I might be able to
do something in return...

Anyway, any suggestions are welcome.

Cheers,

Jon.

Chris Withers

unread,
Oct 8, 2009, 11:07:45 AM10/8/09
to python...@googlegroups.com

The one the majority of people ask about is formula support in xlrd.
John has some experimental code for this, I believe, but taking that and
making it into bulletproof, production ready code ready for an xlrd
release would be met with much cheers and many beers :-)

Chris

--
Simplistix - Content Management, Batch Processing & Python Consulting
- http://www.simplistix.co.uk

Jon Clements

unread,
Oct 8, 2009, 11:36:19 AM10/8/09
to python-excel
Thanks Chris -- and hope you received my last email.

Sounds good to me, but can you elaborate?

Simple example though: "=sum(A1:A2000)", what should xlrd return
there?

I'm just reading through the specs provided by OO -- I originally
wanted some pivot table support, but the doc's aren't trivial and
pretty much useless. F support looks reasonable, if a little MeSsy. :)

Cheers,


Jon.

John Yeung

unread,
Oct 8, 2009, 4:23:44 PM10/8/09
to python...@googlegroups.com
On Thu, Oct 8, 2009 at 11:07 AM, Chris Withers <ch...@simplistix.co.uk> wrote:
>
> Jon Clements wrote:
>> Is there anything I can contribute back to the
>> library (xlrd in particular)? I have some free-time [...]

>
> The one the majority of people ask about is formula
> support in xlrd. John has some experimental code
> for this, I believe, but taking that and making it into
> bulletproof, production ready code ready for an xlrd
> release would be met with much cheers and many beers :-)

It would indeed. I agree with Chris that it seems to be the itch most
in need of scratching. Personally, I am not an xlrd user (yet), so
what would have more immediate interest to me is R1C1 support in xlwt.
I would guess that it's both technically less ambitious and would
provide less overall benefit than formula-extracting in xlrd.

For what it's worth, if you (or anyone else reading this) happen to be
comfortable with Java, there is a project which might be useful as a
resource (in terms of seeing the code someone else has used to tackle
the problem of reading formulas in Excel sheets):

http://www.formulacompiler.org/

John Y.

Chris Withers

unread,
Oct 9, 2009, 6:47:10 AM10/9/09
to python...@googlegroups.com
Jon Clements wrote:
>
> Simple example though: "=sum(A1:A2000)", what should xlrd return
> there?

I'm coming at this with my xlutils maintainers hat on with the most
common complaint being that xlutils.copy does't copy formulae.
This is because xlrd doesn't read them. For me, just being able to get
the above text would be fine I suspect. However, I'm sure others would
want more... John will no doubt comment on his plans at some point :-)

Chris

Chris Withers

unread,
Oct 9, 2009, 6:47:47 AM10/9/09
to python...@googlegroups.com
John Yeung wrote:
> in need of scratching. Personally, I am not an xlrd user (yet), so
> what would have more immediate interest to me is R1C1 support in xlwt.

What do you mean by R1C1 support?

Chris

John Machin

unread,
Oct 9, 2009, 8:46:31 AM10/9/09
to python...@googlegroups.com
On 9/10/2009 2:36 AM, Jon Clements wrote:
>
>
> On 8 Oct, 16:07, Chris Withers <ch...@simplistix.co.uk> wrote:
>> Jon Clements wrote:
>>> Is there anything I can contribute back to the library (xlrd in
>>> particular)? I have some free-time until end of mid Nov. (before
>>> everything goes mad), and thought as this is one of the major
>>> libraries installed on all my boxes, then perhaps I might be able to
>>> do something in return...
>>> Anyway, any suggestions are welcome.
>> The one the majority of people ask about is formula support in xlrd.
>> John has some experimental code for this,

What I have consists of the decompile_formula function in formula.py.
This produces a string from the Excel formula bytecode. It is currently
used as a sparring partner for the xlwt formula parser. It needs
robusticising and should be made to work with all types of formula, and
with all opcodes.

Then what is needed is code to handle FORMULA, SHRFMLA and ARRAY records
and build data structures such that it can be determined quickly whether
a given (rowx, colx) has a formula and if so what type and what is the
text of the formula.

All of this must be optional:
book = xlrd.open_workbook(..., extract_formulas=False)

>> I believe, but taking that and
>> making it into bulletproof, production ready code ready for an xlrd
>> release would be met with much cheers and many beers :-)
>>

>

> Sounds good to me, but can you elaborate?
>
> Simple example though: "=sum(A1:A2000)",

The "=" character is swallowed by the UI ...

> what should xlrd return
> there?

u'SUM(A1:A2000)'
or the equivalent in R1C1 notation if the user desires [not at first
release]

sum(1,4,9,16) could come back as SUM(1,4,9,16) or SUM(1;4;9;16) ... the
list separator should presumably be extracted from the user's locale.


> I'm just reading through the specs provided by OO -- I originally
> wanted some pivot table support, but the doc's aren't trivial and
> pretty much useless. F support looks reasonable, if a little MeSsy. :)

Yup. If you ever find yourself thinking that some parts of XLS files are
cute or elegant, it means that you haven't read the docs or they've left
something out or both ;-)

Cheers,

John

John Machin

unread,
Oct 9, 2009, 9:06:11 AM10/9/09
to python...@googlegroups.com
On 9/10/2009 7:23 AM, John Yeung wrote:
> On Thu, Oct 8, 2009 at 11:07 AM, Chris Withers <ch...@simplistix.co.uk> wrote:
>> Jon Clements wrote:
>>> Is there anything I can contribute back to the
>>> library (xlrd in particular)? I have some free-time [...]
>> The one the majority of people ask about is formula
>> support in xlrd. John has some experimental code
>> for this, I believe, but taking that and making it into
>> bulletproof, production ready code ready for an xlrd
>> release would be met with much cheers and many beers :-)
>
> It would indeed. I agree with Chris that it seems to be the itch most
> in need of scratching. Personally, I am not an xlrd user (yet), so
> what would have more immediate interest to me is R1C1 support in xlwt.
> I would guess that it's both technically less ambitious and would
> provide less overall benefit than formula-extracting in xlrd.

Ever thought of having a go at it yourself?

>
> For what it's worth, if you (or anyone else reading this) happen to be
> comfortable with Java,

Bit of an oxymoron ... rummaging through a Java codebase is like
"spending a wet Sunday afternoon in a Welsh graveyard with no cigarettes".

> there is a project which might be useful as a
> resource (in terms of seeing the code someone else has used to tackle
> the problem of reading formulas in Excel sheets):
>
> http://www.formulacompiler.org/

To see the code that they are calling, you have to get Andy Khan's
jexcelapi verson 2.6.10 and apply their patches ;-)

Cheers,
John

Jon Clements

unread,
Oct 9, 2009, 12:48:58 PM10/9/09
to python-excel
Wow - this is quite an interesting challenge! Although, my gut feeling
is I'd get more pleasure squeezing lemon juice into my eyes, then
trying to extract them with a cocktail stick... :)
Quite frankly John, I'm really not sure how you've had the patience to
get the library this far :)

Anyway, if it's agreeable to you John, I'm going to write (mostly for
myself I s'pose) a list of what needs doing, in what order, etc... and
post it to this list/put on web somewhere. Hopefully then you can
review it, make sure it fits in with the xlrd design etc...and your
experience would be most appreciated, as *cross fingers* you'll be
able to point out blackspots or parts of the library I might be able
to re-use etc...

For instance, should Cell have a new ctype, 'XL_CELL_FORMULA',
whose .value is the constructed text from the bytecode, and a .result
which is the actual result of the formula. Or should .value always be
the result, and formula cells provide a .text property or __str__ to
show the constructed?
[Calls to *_values() expensive, but intuitive, of course, the
alternative is relying on the fact the user knows it's a formula --
although having .result == .value where ctype != XL_CELL_FORMULA
works]

Of course, please forgive my naivety about some things, but willing to
have a good crack at this. I'm going through the xlrd codebase and OOo
docs! (ouch). Hopefully, if we can set the goal posts, and agree what
the end result should be and how it should be interface wise etc..

Cheers,

Jon.


Umm, now where's that lemon juice? :0

Jon Clements

unread,
Oct 10, 2009, 9:02:35 AM10/10/09
to python-excel
Okay replying to myself, that's pretty lame :)

Would any of the group members fancy whipping up an xls with formulae,
as a test bed? That way, I'm not inclined to bias my test data with my
code. Also, it'd be handy to have a formula that references another
formula.....

If someone would be so obliging that'd be great.

Cheers,

Jon.

Jon Clements

unread,
Oct 10, 2009, 9:32:07 AM10/10/09
to python-excel
2009/10/10 Jon Clements <jon...@googlemail.com>:
Some toy code and not enough sleep, now I have a question.

A "trivial" 'SUM(a1:a100)', where some are not numeric, should return
what? Do we take a best guess at all the remaining values, or raise
some exception. Should we emulate what the UI does, or be more strict?

I can already see this is going to be a nasty project. Oh well, I'll
do what I can!

Cheers,

Jon.

John Machin

unread,
Oct 10, 2009, 10:16:53 AM10/10/09
to python...@googlegroups.com
On 11/10/2009 12:32 AM, Jon Clements wrote:
> Some toy code and not enough sleep, now I have a question.
>
> A "trivial" 'SUM(a1:a100)', where some are not numeric, should return
> what?

Already answered that. The goal is to produce the decompiled formula,
like I said u'SUM(A1:A100)'

> Do we take a best guess at all the remaining values, or raise
> some exception. Should we emulate what the UI does, or be more strict?

None of the above. The goal does NOT include calculating anything. That
would would take one person a year or two or ...

Repeating what I said:
"""
What I have consists of the decompile_formula function in formula.py.
This produces a string from the Excel formula bytecode. It is currently
used as a sparring partner for the xlwt formula parser. It needs
robusticising and should be made to work with all types of formula, and
with all opcodes.

Then what is needed is code to handle FORMULA, SHRFMLA and ARRAY records
and build data structures such that it can be determined quickly whether
a given (rowx, colx) has a formula and if so what type and what is the
text of the formula.

All of this must be optional:
book = xlrd.open_workbook(..., extract_formulas=False)
"""

Please feel free to ask if you don't understand any of the above.

Please feel compelled to publish for discussion what APIs you are
intending to provide before writing any code.

Cheers,
John

John Machin

unread,
Oct 10, 2009, 10:47:44 AM10/10/09
to python...@googlegroups.com
On 10/10/2009 3:48 AM, Jon Clements wrote:

>
> Anyway, if it's agreeable to you John, I'm going to write (mostly for
> myself I s'pose) a list of what needs doing, in what order, etc... and
> post it to this list/put on web somewhere. Hopefully then you can
> review it, make sure it fits in with the xlrd design etc...and your
> experience would be most appreciated, as *cross fingers* you'll be
> able to point out blackspots or parts of the library I might be able
> to re-use etc...
>
> For instance, should Cell have a new ctype, 'XL_CELL_FORMULA',
> whose .value is the constructed text from the bytecode, and a .result
> which is the actual result of the formula.

No. That would be majorly backwardly incompatible. Currently all cell
objects have a .value attribute, irrespective of whether that was
derived from a formula or not.

Even if we were starting from scratch, we wouldn't want to do that.
Imagine writing code like this:

# count the products whose price is over $100
# [yes I know there's a COUNTIF() function]
n = 0
for cell in some_iterable:
if (cell.result if cell.has_formula() else cell.value) > 100:
n += 1

> Or should .value always be
> the result, and formula cells provide a .text property or __str__ to
> show the constructed?

Calling it .formula might be less vague.

> [Calls to *_values() expensive, but intuitive, of course, the
> alternative is relying on the fact the user knows it's a formula --
> although having .result == .value where ctype != XL_CELL_FORMULA
> works]

I don't understand any of that paragraph.

BTW you need to understand that the Cell class is a sham. Cell objects
are constructed on the fly from the real data when requested. Have a
look in sheet.py. The real data for formulas will need to be sparse,
otherwise the folk with big files and a handful of formulas wouldn't
like us very much.


> Of course, please forgive my naivety about some things, but willing to
> have a good crack at this. I'm going through the xlrd codebase and OOo
> docs! (ouch). Hopefully, if we can set the goal posts, and agree what
> the end result should be and how it should be interface wise etc..

Sounds good.

Cheers,
John

Reply all
Reply to author
Forward
0 new messages