Grabbing the text of formulas from an Excel spreadsheet with xlrd

3,917 views
Skip to first unread message

Matthew

unread,
Dec 22, 2010, 1:35:05 AM12/22/10
to python-excel
Hi,

I've got a system where I'm producing xls files with xlwt which
include formulas. Sometimes I need to read those files back in with
xlrd and "patch" them before writing them back out again (yes, it's
nasty). For my purposes, it's enough to know that the cell contained
a formula and get the text of that formula, then dump it back out
again with xlwt. I don't care what the formula evaluates to, so that
makes the problem somewhat simpler.

I noticed that this capability is pretty much already implemented in
formula.py as decompile_formula(), but there's no way to bubble its
results back to the surface. Actually, it seems that this function is
only used for debugging.

I was planning on just hacking in a "dump_formula" flag to the reader
and maybe a new cell type XL_CELL_FORMULA so that I can detect the
cells and write the correct data back out again, but it sounded kind
of too easy. Are there some gotchas with that function which I should
know about? Some reason it's not available like that already?

Thanks,

- Matthew Duggan

John Machin

unread,
Dec 22, 2010, 3:26:27 AM12/22/10
to python-excel


On Dec 22, 5:35 pm, Matthew <matthew.c.dug...@gmail.com> wrote:
> Hi,
>
> I've got a system where I'm producing xls files with xlwt which
> include formulas.  Sometimes I need to read those files back in with
> xlrd and "patch" them before writing them back out again (yes, it's
> nasty).  For my purposes, it's enough to know that the cell contained
> a formula and get the text of that formula, then dump it back out
> again with xlwt. I don't care what the formula evaluates to, so that
> makes the problem somewhat simpler.
>
> I noticed that this capability is pretty much already implemented in
> formula.py as decompile_formula(), but there's no way to bubble its
> results back to the surface.  Actually, it seems that this function is
> only used for debugging.

Yes, for debugging enhancements to the xlwt formula compiler.


> I was planning on just hacking in a "dump_formula" flag to the reader
> and maybe a new cell type XL_CELL_FORMULA so that I can detect the
> cells and write the correct data back out again, but it sounded kind
> of too easy.  Are there some gotchas with that function which I should
> know about?  Some reason it's not available like that already?

Nix on XL_CELL_FORMULA. Why is covered (along with other questions) in
this:

http://groups.google.com/group/python-excel/browse_frm/thread/95d33d7703585b7/355db74f8cbdc9d6?lnk=gst&q=formula#355db74f8cbdc9d6

At the moment, it works only on "cell" formulas. These are what xlwt
writes; each cell has its own tailored formula. However suppose you
have numbers in col C that are calculated by a formula that was typed
into C2 and copied down as far as needed. Excel will store that
formula in a "shared formula" record once and most of the cells will
have a "cell formula" record that's just a stub referring to the
shared formula. You might say, "So what? xlwt doesn't write those".
True, but unless you lock the file away, somebody will open it with
Excel and save it again; goodbye cell formulas, hello shared formulas.

The shared formula is very similar to a cell formula; the main
difference is that as it is being decompiled, you need to record the
positions of all the relative references in the formula so that you
can construct a tailored formula for any sharing cell when asked. And
of course a dictionary of shared formulas needs to be maintained ...
key is (rowx, colx) of "parent" cell.

Which brings us to a design decision: do we decompile every formula as
we meet them in the input file, or do we do a JIT decompilation when
asked. A priori I would tend to favour the latter. In your case I
would expect every formula-containing cell to be processed exactly
once, so it wouldn't matter to you. However others may not be
accessing each cell, so the JIT strategy would be better.

You may like to have a look at this: http://sc.openoffice.org/excelfileformat.pdf
... items of interest are:

Section 3 (page 28) Formulas in general
Section 4.8 (p 96) Shared formulas
and the sections on SHAREDFMLA records and FORMULA records that are
referenced in section 4.8

Cheers,
John


Matthew

unread,
Dec 22, 2010, 9:16:41 AM12/22/10
to python-excel
Hi John,

Thanks for replying with all the information, sorry I didn't find that
previous thread. Also thanks for the pointers into the docs - I've
had a look at the sections you mentioned.

As I worried, the problem is much more complex than it first seems.
I'd rather not implement something half-arsed that only fixes my
problem, so I'm afraid to say I might have to wimp out at this point.
For my problem I can probably just use a workaround of trying to work
out when/where to regenerate the formulas from scratch.

Thanks again for the reply, and many thanks for the work on xlrd/xlwr
- they're really useful!

- Matthew Duggan

On Dec 22, 5:26 pm, John Machin <sjmac...@lexicon.net> wrote:
> On Dec 22, 5:35 pm, Matthew <matthew.c.dug...@gmail.com> wrote:
>
> > Hi,
>
> > I've got a system where I'm producing xls files with xlwt which
> > include formulas.  Sometimes I need to read those files back in with
> > xlrd and "patch" them before writing them back out again (yes, it's
> > nasty).  For my purposes, it's enough to know that the cell contained
> > a formula and get the text of that formula, then dump it back out
> > again with xlwt. I don't care what the formula evaluates to, so that
> > makes the problem somewhat simpler.
>
> > I noticed that this capability is pretty much already implemented in
> > formula.py as decompile_formula(), but there's no way to bubble its
> > results back to the surface.  Actually, it seems that this function is
> > only used for debugging.
>
> Yes, for debugging enhancements to the xlwt formula compiler.
>
> > I was planning on just hacking in a "dump_formula" flag to the reader
> > and maybe a new cell type XL_CELL_FORMULA so that I can detect the
> > cells and write the correct data back out again, but it sounded kind
> > of too easy.  Are there some gotchas with that function which I should
> > know about?  Some reason it's not available like that already?
>
> Nix on XL_CELL_FORMULA. Why is covered (along with other questions) in
> this:
>
> http://groups.google.com/group/python-excel/browse_frm/thread/95d33d7...

Jarryd Garisch

unread,
Jul 9, 2015, 8:25:50 AM7/9/15
to python...@googlegroups.com
Hi John, I'm hoping you still frequent these!

I too have a similar issue. I want to read in an excel doc, make some edits and generally preserve things I haven't edited. In my case, I just want to preserve formulas exactly as is.

I came across this workaround for preserving formatting to cells being written to (quite possibly yours) on SO:

def copy2(wb):
    w = XLWTWriter()
    process(
        XLRDReader(wb,'anon.xls'),
        w
    )
    return w.output[0][1], w.style_list

My thought are similar approach could be used for the formulas. My really rough hacky workaround is do a replace of "=" in all the cells of workbook with some string of junk so that these formulas are treated as text.

I wish to then get rid of my junk string and put the ='s back in the final output.

I think it would be best for my workaround to happen in some modified copy function of xlutils, but I'm currently not following exactly what's going on between copy() and process() in xltuils.filter.py.

Any suggestions where in the code I should put my string replacement for this workaround so that it occurs as the copy is made. (Alternatively, if this isn't practical I would probably just try to iterate over each cell and do replacements then sometime after call to copy / copy2 is made)

Alfred Vachris

unread,
Jul 9, 2015, 12:06:58 PM7/9/15
to python...@googlegroups.com

I replace "=" with "|="
Does the trick
Easy to undo
Regards
Alfred

--
You received this message because you are subscribed to the Google Groups "python-excel" group.
To unsubscribe from this group and stop receiving emails from it, send an email to python-excel...@googlegroups.com.
To post to this group, send email to python...@googlegroups.com.
Visit this group at http://groups.google.com/group/python-excel.
For more options, visit https://groups.google.com/d/optout.

Charlie Clark

unread,
Jul 9, 2015, 12:08:57 PM7/9/15
to python...@googlegroups.com
Am .07.2015, 17:32 Uhr, schrieb Alfred Vachris <alfred....@gmail.com>:

> I replace "=" with "|="

FWIW

In openpyxl we don't evaluate the formula but give the user the choice of
the formula (default), or the cached value (data_only=True)

Charlie
--
Charlie Clark
Managing Director
Clark Consulting & Research
German Office
Kronenstr. 27a
Düsseldorf
D- 40217
Tel: +49-211-600-3657
Mobile: +49-178-782-6226

John Yeung

unread,
Jul 9, 2015, 12:10:58 PM7/9/15
to python-excel
On Thu, Jul 9, 2015 at 12:08 PM, Charlie Clark
<charli...@clark-consulting.eu> wrote:
> In openpyxl we don't evaluate the formula but give the user the choice of
> the formula (default), or the cached value (data_only=True)

Is there any particular reason why you can't expose both the formula
and the value?

John Y.

Charlie Clark

unread,
Jul 9, 2015, 12:20:30 PM7/9/15
to python...@googlegroups.com
Am .07.2015, 18:10 Uhr, schrieb John Yeung <gallium....@gmail.com>:

> Is there any particular reason why you can't expose both the formula
> and the value?

At the same time you mean? The main reason is that we don't evaluate the
formula (we don't ever plan to, either). In a read/write context this
introduces an almost quantum uncertainty as to whether the cached value is
still valid. Defaulting to the formula is the safe choice. In practice, it
seems people want either one or the other.

In API terms it would also add complexity: which of the two is the value?
Much better to expose the formula that could be evaluated by a function if
desired.

John Yeung

unread,
Jul 9, 2015, 2:11:56 PM7/9/15
to python-excel
On Thu, Jul 9, 2015 at 12:20 PM, Charlie Clark
<charli...@clark-consulting.eu> wrote:
> Am .07.2015, 18:10 Uhr, schrieb John Yeung <gallium....@gmail.com>:
>
>> Is there any particular reason why you can't expose both the formula
>> and the value?
>
> At the same time you mean? The main reason is that we don't evaluate the
> formula (we don't ever plan to, either). In a read/write context this
> introduces an almost quantum uncertainty as to whether the cached value is
> still valid. Defaulting to the formula is the safe choice. In practice, it
> seems people want either one or the other.

I agree that you shouldn't evaluate the formula. That's a monumental
task, well beyond an Excel file handler.

But I disagree that there's anything wrong with the formula and the
cached value being out of sync. Inherent in the file format is that
the formula and the value are separate things, with no assurance that
they correspond to each other.

Keep in mind that even Excel (the Microsoft program) does not
guarantee that the value displayed in the worksheet is up to date.
Sure, most people use the default calculation setting of "automatic",
but you can easily set it to manual, and you can even turn off
"recalculate before save".

> In API terms it would also add complexity: which of the two is the value?
> Much better to expose the formula that could be evaluated by a function if
> desired.

In API terms, it would not be at all complex. The value is the value
and the formula is the formula. What could be simpler?

I agree 100% with "Much better to expose the formula that could be
evaluated by a function if desired." but to me that statement means
"Always expose the value and always expose the formula." Hiding the
value goes against both Pythonic sensibility and Excel functionality.

If you want to be like Excel, the thing to do is provide a "value as
displayed in the worksheet" (which the API would call "value") and a
"value as displayed in the formula bar" (which the API would call
"formula", even though sometimes this is just a copy of the value).
This way, if someone wants the "formula bar value", and they don't
want to do (pseudocode)

value = cell.formula if cell.has_formula() else cell.value

and they don't want to do the old-school (pseudocode)

value = cell.formula or cell.value

they can just do (pseudocode)

value = cell.formula

I get that OpenPyXL is relatively mature now, and it's highly
undesirable to make breaking changes to its API. But if there is
nothing *structurally* preventing you from providing the formula and
the value, how about a new method or property that gives you the one
you're missing? (I do think it's unfortunate that "data_only" was
chosen as the name of a keyword parameter, because then it reads
poorly to have a workbook open as "data only" and then also have a
method or property that amounts to "well, give me the formula anyway".
I guess the less contradictory choice is to have a new method or
property that always gives you the cached value, no matter what mode
you opened the workbook with.)

John Y.

Charlie Clark

unread,
Jul 9, 2015, 2:56:26 PM7/9/15
to python...@googlegroups.com
> But I disagree that there's anything wrong with the formula and the
> cached value being out of sync. Inherent in the file format is that
> the formula and the value are separate things, with no assurance that
> they correspond to each other.

I think the ambiguity and possible inconsistency is a real problem. Until
recently, we've also had problems with shared formulae, though that is now
resolved.

> Keep in mind that even Excel (the Microsoft program) does not
> guarantee that the value displayed in the worksheet is up to date.
> Sure, most people use the default calculation setting of "automatic",
> but you can easily set it to manual, and you can even turn off
> "recalculate before save".

I know that. I remember the days (before Excel) when manual recalculation
was standard.

>> In API terms it would also add complexity: which of the two is the
>> value?
>> Much better to expose the formula that could be evaluated by a function
>> if
>> desired.
>
> In API terms, it would not be at all complex. The value is the value
> and the formula is the formula. What could be simpler?

That's not simple in my view. As previously stated, people tend to want
one or the other. When editing files, which a lot of users do, you want
consistency. When reading you tend to want one or the other. Recently had
a case when somebody needed both but he said reading the file twice was
the least of his problems.

> If you want to be like Excel, the thing to do is provide a "value as
> displayed in the worksheet" (which the API would call "value") and a
> "value as displayed in the formula bar" (which the API would call
> "formula", even though sometimes this is just a copy of the value).
> This way, if someone wants the "formula bar value", and they don't
> want to do (pseudocode)

We don't want to be like Excel.

> value = cell.formula if cell.has_formula() else cell.value
>
> and they don't want to do the old-school (pseudocode)
>
> value = cell.formula or cell.value
>
> they can just do (pseudocode)
>
> value = cell.formula

Sure, I can model it but I still get nightmares thinking about actually
using it. An alternative might be a property for the cached value as that
would not be preserved when saving the file: you have to consider the
values dirty then.

> I get that OpenPyXL is relatively mature now, and it's highly
> undesirable to make breaking changes to its API.

Well, I hope to break it less often than in the past… ;-)

> But if there is
> nothing *structurally* preventing you from providing the formula and
> the value, how about a new method or property that gives you the one
> you're missing? (I do think it's unfortunate that "data_only" was
> chosen as the name of a keyword parameter, because then it reads
> poorly to have a workbook open as "data only" and then also have a
> method or property that amounts to "well, give me the formula anyway".
> I guess the less contradictory choice is to have a new method or
> property that always gives you the cached value, no matter what mode
> you opened the workbook with.)

Names are hard. Of course, if it were possible then data_only would be
removed as a keyword.

If someone were to submit a PR then it would be considered but at the
moment it's unlikely to be accepted because, as noted above, the current
API seems to serve well enough so YAGNI.

John Yeung

unread,
Jul 9, 2015, 10:54:04 PM7/9/15
to python-excel
On Thu, Jul 9, 2015 at 2:56 PM, Charlie Clark
<charli...@clark-consulting.eu> wrote:
>> But I disagree that there's anything wrong with the formula and the
>> cached value being out of sync. Inherent in the file format is that
>> the formula and the value are separate things, with no assurance that
>> they correspond to each other.
>
> I think the ambiguity and possible inconsistency is a real problem.

It's a real problem, but I don't believe it's a problem for a *file*
handling library to solve. What I expect and want from a file handling
library is to tell me what is in the file (if I'm reading) and to
faithfully write what I tell it to write (if I'm writing).

>> Keep in mind that even Excel (the Microsoft program) does not
>> guarantee that the value displayed in the worksheet is up to date.
>
> I know that. I remember the days (before Excel) when manual recalculation
> was standard.

And recalculation is way out of the scope of a file-reader and file-writer.

>> In API terms, it would not be at all complex. The value is the value
>> and the formula is the formula. What could be simpler?
>
> That's not simple in my view. As previously stated, people tend to want one
> or the other.

And if you give them both, then they can still choose one or the other.

> When editing files, which a lot of users do, you want
> consistency.

And giving users both does not reduce consistency.

> When reading you tend to want one or the other. Recently had a
> case when somebody needed both but he said reading the file
> twice was the least of his problems.

And that was one person (who doesn't sound like a programmer, to be
honest). I am not really understanding how you can be so sanguine that
"read the file twice" is actually a *better* API than "here are both:
you can choose to always use the worksheet value or always use the
formula-bar-value or alternate between them as you like".

> We don't want to be like Excel.

It sure seems like you do, in some ways. Not in all. You seem like you
prefer to cater to the uninitiated rather than to the knowledgeable,
like Excel tends to, and there is nothing inherently wrong in that.

Just as some believe computing systems should use decimal math by
default (instead of IEEE 754) so that its users don't have to
understand the intricacies of floats, you apparently believe that it's
better to hide information than to require users to understand
inherent properties of the file format.

>> value = cell.formula if cell.has_formula() else cell.value
>>
>> and they don't want to do the old-school (pseudocode)
>>
>> value = cell.formula or cell.value
>>
>> they can just do (pseudocode)
>>
>> value = cell.formula
>
> Sure, I can model it but I still get nightmares thinking about actually
> using it.

I'm trying not to sound disrespectful, but it doesn't seem like you
understood what I said. How would it be a nightmare? Everyone who
wants the cached value (who currently uses data_only=True) would be
able to get it simply by always using cell.cached_value (what I like
to call cell.value). Doesn't seem hard. Everyone who wants the formula
when there is one (who currently uses anything other than
data_only=True) would be able to get it by always using
cell.formula_bar_value. They don't have to know or care that sometimes
it's a formula and sometimes not (but they still could if they want).
Again, fail to see the difficulty. Each camp would get *precisely* the
API that they already have (modulo spelling, in the
backward-incompatible pseudocode example, but not even with changed
spelling in my fully backward compatible proposal). And then the third
camp, who is currently reading the file twice (?!?!) would gain a sane
API.

If you think it's a nightmare, is it because you think the cognitive
burden of realizing that values and formulas are two different things,
and stored independently, is just too much to ask of users?

(Maybe it is. I know that it's a lot to ask users to accept that
round(2.675, 2) == 2.67, even though round(2.75, 1) == 2.8.)

John Y.

Charlie Clark

unread,
Jul 10, 2015, 7:33:39 AM7/10/15
to python...@googlegroups.com
> And if you give them both, then they can still choose one or the other.

>> When editing files, which a lot of users do, you want
>> consistency.
>
> And giving users both does not reduce consistency.

It does, because the cached value is not guaranteed in any way when
editing files.

>> When reading you tend to want one or the other. Recently had a
>> case when somebody needed both but he said reading the file
>> twice was the least of his problems.
>
> And that was one person (who doesn't sound like a programmer, to be
> honest). I am not really understanding how you can be so sanguine that
> "read the file twice" is actually a *better* API than "here are both:
> you can choose to always use the worksheet value or always use the
> formula-bar-value or alternate between them as you like".

It was an example of the one enquiry I've seen about this. The guy
certainly seems to me like he knows how to program.

>> We don't want to be like Excel.
>
> It sure seems like you do, in some ways. Not in all. You seem like you
> prefer to cater to the uninitiated rather than to the knowledgeable,
> like Excel tends to, and there is nothing inherently wrong in that.
>
> Just as some believe computing systems should use decimal math by
> default (instead of IEEE 754) so that its users don't have to
> understand the intricacies of floats, you apparently believe that it's
> better to hide information than to require users to understand
> inherent properties of the file format.

Actually, we expose a lot of the internal details. We try and provide a
clean Python API of working with them. Hence, 1-indexing, slices, etc.

>>> value = cell.formula if cell.has_formula() else cell.value
>>>
>>> and they don't want to do the old-school (pseudocode)
>>>
>>> value = cell.formula or cell.value
>>>
>>> they can just do (pseudocode)
>>>
>>> value = cell.formula
>>
>> Sure, I can model it but I still get nightmares thinking about actually
>> using it.
>
> I'm trying not to sound disrespectful, but it doesn't seem like you
> understood what I said.

There's nothing disrespectful about a difference of opinion.

> How would it be a nightmare? Everyone who
> wants the cached value (who currently uses data_only=True) would be
> able to get it simply by always using cell.cached_value (what I like
> to call cell.value).

The value is *always* the formula. A cached_value is then contingent upon
the value being a formula.

> Doesn't seem hard. Everyone who wants the formula
> when there is one (who currently uses anything other than
> data_only=True) would be able to get it by always using
> cell.formula_bar_value. They don't have to know or care that sometimes
> it's a formula and sometimes not (but they still could if they want).
> Again, fail to see the difficulty. Each camp would get *precisely* the
> API that they already have (modulo spelling, in the
> backward-incompatible pseudocode example, but not even with changed
> spelling in my fully backward compatible proposal). And then the third
> camp, who is currently reading the file twice (?!?!) would gain a sane
> API.

The API is sane and clear about what is available.

> If you think it's a nightmare, is it because you think the cognitive
> burden of realizing that values and formulas are two different things,
> and stored independently, is just too much to ask of users?

In a read-only (or write-only as xlsxwriter does) world I'd have no
trouble exposing the cached value, which would always have the same degree
of reliability - we assume it's good. This is no longer the case when the
file is editable. It seems to me that this is crux of the matter.

EOT

John Yeung

unread,
Jul 16, 2015, 5:56:55 PM7/16/15
to python-excel
On Fri, Jul 10, 2015 at 7:33 AM, Charlie Clark
<charli...@clark-consulting.eu> wrote:
>>> When editing files, which a lot of users do, you want
>>> consistency.
>>
>> And giving users both does not reduce consistency.
>
> It does, because the cached value is not guaranteed in any way when editing
> files.

The cached value is not guaranteed in any way regardless of whether
you expose the formula in the same pass or not.

> In a read-only (or write-only as xlsxwriter does) world I'd have no trouble
> exposing the cached value, which would always have the same degree of
> reliability - we assume it's good. This is no longer the case when the file
> is editable. It seems to me that this is crux of the matter.

There are disagreements where both parties understand each other and
there is just a difference of opinion. But I haven't gotten the sense
that you understand what I'm saying, so I'm not sure this is one of
those disagreements. Let me try to build an understanding:

You say that you're OK with a read-only package exposing the formula
and cached value in the same pass. Fine. We can agree on that.

I don't agree that we assume it's good. The nature of the file format
is such that there is intrinsically no guarantee that a cached value
matches a formula. You already agreed with me in a previous e-mail
that even a file that is saved by the genuine article Microsoft Excel
can have stale cached values (due to manual calculation mode).

But let's put that aside. When you said "in a read-only world I'd have
no trouble exposing the cached value, which would always have the same
degree of reliability" perhaps the gist was "if it's read-only, we
can't make things worse". That I can agree with. If the cached values
match the formulas, reading them both at the same time doesn't change
that. If the cached values do not match the formulas, well, reading
them both at the same time doesn't change that either. OK.

Now, you say you are even OK with a write-only package exposing both
the formula and the cached value. I have trouble seeing why you would
be OK with this, because if you can write the formula and also write
the cached value, then you can definitely write them such that they do
not match.

So, let me assume you'd like to be as conservative as possible and
retract your endorsement of making both the cached value and the
formula writable properties in a write-only package.

Now, here is my question for you: How does making only one of the
properties writable (as openpyxl does) help you?

Let's say you have just read in a file whose cached values all match
their formulas. And let's say you've chosen for openpyxl to expose the
cached values. Am I correct that openpyxl then lets you write anything
you like in the cached value, but leaves the formula untouched? If so,
then clearly exposing the cached value as a writable property is not
protecting you from inconsistency at all.

Or, let's say you have chosen for openpyxl to expose only the
formulas. So, is it true that openpyxl lets you overwrite an existing
formula with a new one? Now, if the cached value had been consistent
with the original formula, and you've just changed the formula, how
has openpyxl protected you from inconsistency?

If anything, exposing both the formula and the cached value
*increases* the user's ability to keep things consistent, because they
can update the formula, and then in the same pass, update the cached
value to match.

Well, that's quite an onus on the user, to keep both the formula and
the cached value in sync, isn't it? Yes, indeed it is. But that is
fundamental to the nature of the file format. If you want them to be
in sync, either the library should do it for you, or the library
should give you the ability to do it yourself. And currently, openpyxl
does neither. But it DOES let you take a synchronized file and put it
out of sync.

John Y.
Reply all
Reply to author
Forward
0 new messages