The one the majority of people ask about is formula support in xlrd.
John has some experimental code for this, I believe, but taking that and
making it into bulletproof, production ready code ready for an xlrd
release would be met with much cheers and many beers :-)
Chris
--
Simplistix - Content Management, Batch Processing & Python Consulting
- http://www.simplistix.co.uk
It would indeed. I agree with Chris that it seems to be the itch most
in need of scratching. Personally, I am not an xlrd user (yet), so
what would have more immediate interest to me is R1C1 support in xlwt.
I would guess that it's both technically less ambitious and would
provide less overall benefit than formula-extracting in xlrd.
For what it's worth, if you (or anyone else reading this) happen to be
comfortable with Java, there is a project which might be useful as a
resource (in terms of seeing the code someone else has used to tackle
the problem of reading formulas in Excel sheets):
http://www.formulacompiler.org/
John Y.
I'm coming at this with my xlutils maintainers hat on with the most
common complaint being that xlutils.copy does't copy formulae.
This is because xlrd doesn't read them. For me, just being able to get
the above text would be fine I suspect. However, I'm sure others would
want more... John will no doubt comment on his plans at some point :-)
Chris
What do you mean by R1C1 support?
Chris
What I have consists of the decompile_formula function in formula.py.
This produces a string from the Excel formula bytecode. It is currently
used as a sparring partner for the xlwt formula parser. It needs
robusticising and should be made to work with all types of formula, and
with all opcodes.
Then what is needed is code to handle FORMULA, SHRFMLA and ARRAY records
and build data structures such that it can be determined quickly whether
a given (rowx, colx) has a formula and if so what type and what is the
text of the formula.
All of this must be optional:
book = xlrd.open_workbook(..., extract_formulas=False)
>> I believe, but taking that and
>> making it into bulletproof, production ready code ready for an xlrd
>> release would be met with much cheers and many beers :-)
>>
>
> Sounds good to me, but can you elaborate?
>
> Simple example though: "=sum(A1:A2000)",
The "=" character is swallowed by the UI ...
> what should xlrd return
> there?
u'SUM(A1:A2000)'
or the equivalent in R1C1 notation if the user desires [not at first
release]
sum(1,4,9,16) could come back as SUM(1,4,9,16) or SUM(1;4;9;16) ... the
list separator should presumably be extracted from the user's locale.
> I'm just reading through the specs provided by OO -- I originally
> wanted some pivot table support, but the doc's aren't trivial and
> pretty much useless. F support looks reasonable, if a little MeSsy. :)
Yup. If you ever find yourself thinking that some parts of XLS files are
cute or elegant, it means that you haven't read the docs or they've left
something out or both ;-)
Cheers,
John
Ever thought of having a go at it yourself?
>
> For what it's worth, if you (or anyone else reading this) happen to be
> comfortable with Java,
Bit of an oxymoron ... rummaging through a Java codebase is like
"spending a wet Sunday afternoon in a Welsh graveyard with no cigarettes".
> there is a project which might be useful as a
> resource (in terms of seeing the code someone else has used to tackle
> the problem of reading formulas in Excel sheets):
>
> http://www.formulacompiler.org/
To see the code that they are calling, you have to get Andy Khan's
jexcelapi verson 2.6.10 and apply their patches ;-)
Cheers,
John
Already answered that. The goal is to produce the decompiled formula,
like I said u'SUM(A1:A100)'
> Do we take a best guess at all the remaining values, or raise
> some exception. Should we emulate what the UI does, or be more strict?
None of the above. The goal does NOT include calculating anything. That
would would take one person a year or two or ...
Repeating what I said:
"""
What I have consists of the decompile_formula function in formula.py.
This produces a string from the Excel formula bytecode. It is currently
used as a sparring partner for the xlwt formula parser. It needs
robusticising and should be made to work with all types of formula, and
with all opcodes.
Then what is needed is code to handle FORMULA, SHRFMLA and ARRAY records
and build data structures such that it can be determined quickly whether
a given (rowx, colx) has a formula and if so what type and what is the
text of the formula.
All of this must be optional:
book = xlrd.open_workbook(..., extract_formulas=False)
"""
Please feel free to ask if you don't understand any of the above.
Please feel compelled to publish for discussion what APIs you are
intending to provide before writing any code.
Cheers,
John
>
> Anyway, if it's agreeable to you John, I'm going to write (mostly for
> myself I s'pose) a list of what needs doing, in what order, etc... and
> post it to this list/put on web somewhere. Hopefully then you can
> review it, make sure it fits in with the xlrd design etc...and your
> experience would be most appreciated, as *cross fingers* you'll be
> able to point out blackspots or parts of the library I might be able
> to re-use etc...
>
> For instance, should Cell have a new ctype, 'XL_CELL_FORMULA',
> whose .value is the constructed text from the bytecode, and a .result
> which is the actual result of the formula.
No. That would be majorly backwardly incompatible. Currently all cell
objects have a .value attribute, irrespective of whether that was
derived from a formula or not.
Even if we were starting from scratch, we wouldn't want to do that.
Imagine writing code like this:
# count the products whose price is over $100
# [yes I know there's a COUNTIF() function]
n = 0
for cell in some_iterable:
if (cell.result if cell.has_formula() else cell.value) > 100:
n += 1
> Or should .value always be
> the result, and formula cells provide a .text property or __str__ to
> show the constructed?
Calling it .formula might be less vague.
> [Calls to *_values() expensive, but intuitive, of course, the
> alternative is relying on the fact the user knows it's a formula --
> although having .result == .value where ctype != XL_CELL_FORMULA
> works]
I don't understand any of that paragraph.
BTW you need to understand that the Cell class is a sham. Cell objects
are constructed on the fly from the real data when requested. Have a
look in sheet.py. The real data for formulas will need to be sparse,
otherwise the folk with big files and a handful of formulas wouldn't
like us very much.
> Of course, please forgive my naivety about some things, but willing to
> have a good crack at this. I'm going through the xlrd codebase and OOo
> docs! (ouch). Hopefully, if we can set the goal posts, and agree what
> the end result should be and how it should be interface wise etc..
Sounds good.
Cheers,
John