Formula- Grammar

Markus CLERMONT

unread,

Sep 6, 2001, 2:51:10 AM9/6/01

to

Hello!

I'm doing currently some research in debugging excel-spreadsheets. As we
are currently
developping a tool for detecting errors in spreadsheets, we have to
parse
Excel-Formulas. Therefore I'd need a formula-grammar for excel-formulas.

If anyone could help me, obtaining it (or maybe you have already one?),
please
contact me. My e-mail adress is ma...@uni-klu.ac.at

Markus
--
***************************************************************
Man muss Chaos in sich tragen, um einen Stern zu gebären.
Nietzsche
***************************************************************

mark.vcf

Norman Harker

unread,

Sep 6, 2001, 5:34:24 AM9/6/01

to

Hi Markus!

Sent list of complete list of Excel formulas, sources, classifications and
syntax.

hth
"Markus CLERMONT" <ma...@isys.uni-klu.ac.at> wrote in message
news:3B971CDE...@isys.uni-klu.ac.at...

Zorba Eisenhower

unread,

Sep 6, 2001, 8:01:16 AM9/6/01

to

"The Excel Auditor" at www.bygsoftware.com

Markus CLERMONT <ma...@isys.uni-klu.ac.at> wrote in message news:<3B971CDE...@isys.uni-klu.ac.at>...

Harlan Grove

unread,

Sep 6, 2001, 2:44:31 PM9/6/01

to

Markus CLERMONT <ma...@isys.uni-klu.ac.at> wrote...

>I'm doing currently some research in debugging excel-spreadsheets.
>As we are currently developping a tool for detecting errors in
>spreadsheets, we have to parse Excel-Formulas. Therefore I'd
>need a formula-grammar for excel-formulas.

Meaning a BNF-style grammar? I don't think there is one. For the most part
it's very basic: 5 infix arithmetic operators (+, -, *, /, ^), 6 infix
comparison operators (=, <>, <, <=, >, >=), locale-varying argument
separator and decimal 'point' (usually , or ; and . or ,) taken from
Windows' Regional Settings (that alone would make a BNF grammar rather a
tricky thing to write), numeric constants, string constants, error constants
(all begin with #), function calls as function-name(argument-list), and then
there's the harder stuff.

- range references, including collections, intersections and 3D blocks
- array constants
- whitespace in limited instances
- no more than 7 levels of formula nesting
- no argument list longer than 29 arguments
- anything not interpreted as one of the items above that matches the
regular expression [_A-Za-z][_0-9A-Za-z]* but not 'R' or 'C' would be
interpreted as defined names

Not sure why you need this. Excel won't allow users to enter syntactically
invalid formulas, so you'll never find any cell formulas in Excel workbooks
for which you'd need a grammar to check them. If the tool you're developing
would locate argument type errors in function calls within cell formulas,
that's problematic. In may instances Excel can use numeric and text values
interchangeably, like awk or perl.

So likely all you really need is a means of identifying function calls
within formulas. The syntax isn't materially different from C. Function
names match the regular expression [A-Za-z][_0-9A-Za-z]* (no restrictions,
but several built-in functions), function names are immediately followed by
left parenthesis, then there's the argument list, and finally the closing
right parenthesis. So the regular expression [A-Za-z][_0-9A-Za-z]*[(]
(assuming you're not inside a double-quoted string) matches the beginning of
any function call.

Dave Peterson

unread,

Sep 6, 2001, 7:57:09 PM9/6/01

to

Application.International(xlListSeparator)
Application.International(xlDecimalSeparator)

====
Now things should be easy <vbg>.

--

Dave Peterson
ec3...@msn.com

Markus CLERMONT

unread,

Sep 7, 2001, 2:21:21 AM9/7/01

to

Indeed, I need a BCNF- Style Grammar.

Our approach for tracing errors in spreadsheets tries to classify
formulas by their logical structure- so we try to find formulas,
which are not only identical- but also similar ones should be
revealed (i.e A1 + A2 is similar to A3+ A4, but not to $A$1 + A2).

Therefore we must be able to parse formulas- even if Excel guarantees
me,
that the entry is syntactically correct, because we classify the
formulas regarding to the degree of similarity between the
abstract syntax trees.

Maybe someone has a clue how to get the grammar?

Markus.

mark.vcf

Harlan Grove

unread,

Sep 7, 2001, 2:51:53 AM9/7/01

to

Markus CLERMONT <ma...@isys.uni-klu.ac.at> wrote...
...

>Our approach for tracing errors in spreadsheets tries to classify
>formulas by their logical structure- so we try to find formulas,
>which are not only identical- but also similar ones should be
>revealed (i.e A1 + A2 is similar to A3+ A4, but not to $A$1 + A2).

Read up on R1C1-style range references. =A1+A2 if entered in B1 becomes
=RC[-1]+R[1]C[-1] and =A3+A4 if entered in B3 becomes the same R1C1-style
formula. =$A$1+A2 in B1 becomes =R1C1+R[1]C[-1]. R1C1-style addressing is
ideal for spotting differences between formulas in different cells.

>Therefore we must be able to parse formulas- even if Excel
>guarantees me, that the entry is syntactically correct, because
>we classify the formulas regarding to the degree of similarity
>between the abstract syntax trees.

Not proven. Most errors in simple formulas (those with no function calls)
would be revealed by R1C1-style addressing. Errors in function calls require
comparing argument data types to function definitions.

Anyway, download the Gnu awk source code. The formula grammar for Excel
doesn't differ much from awk's expression grammar (excluding awk's getline
built-in) other than that all Excel formulas must start with an equal sign,
'=', and very limited whitespace is allowed. Range address and manipulation
syntax would be the VERY tricky part.

>Maybe someone has a clue how to get the grammar?

Unless Microsoft publishes it in the Excel Software Developer's Kit or MSDN,
I suppose you'd have to go to work for Microsoft to see it. They might be
willing to give it to you, but I'd bet you'll need to reverse engineer it.

Markus CLERMONT

unread,

Sep 7, 2001, 8:47:11 AM9/7/01

to

Harlan Grove wrote:

>
> Read up on R1C1-style range references. =A1+A2 if entered in B1 becomes
> =RC[-1]+R[1]C[-1] and =A3+A4 if entered in B3 becomes the same R1C1-style
> formula. =$A$1+A2 in B1 becomes =R1C1+R[1]C[-1]. R1C1-style addressing is
> ideal for spotting differences between formulas in different cells.
>

I'm well aware of the R1C1- style. Nevertheless RC[-1] +R[1]C[-1] should
also be recognised to be the same as R[1]C[-1] + RC[-1].

> Not proven. Most errors in simple formulas (those with no function calls)
> would be revealed by R1C1-style addressing. Errors in function calls require
> comparing argument data types to function definitions.

You're right. But it is also true, that most of the remaining errors can
be
revealed by detecting irregularities in the functionality of formulas in
a certain geometrical region.

>
> Anyway, download the Gnu awk source code. The formula grammar for Excel
> doesn't differ much from awk's expression grammar (excluding awk's getline
> built-in) other than that all Excel formulas must start with an equal sign,
> '=', and very limited whitespace is allowed. Range address and manipulation
> syntax would be the VERY tricky part.

You are perfectly right. Thangs for the hint- I'll have a look at the
source
code. Maybe I will also have a look at Sun's StarCalc Source. But that
is
a reengineering problem, too :-)

mark.vcf

Harlan Grove

unread,

Sep 7, 2001, 12:55:45 PM9/7/01

to

Markus CLERMONT <ma...@isys.uni-klu.ac.at> wrote...

>I'm well aware of the R1C1- style. Nevertheless RC[-1] +R[1]C[-1] should
>also be recognised to be the same as R[1]C[-1] + RC[-1].

This is NOT a syntactic problem. If you want to create a tool that produces
all mathematically equivalent reorderings of formula subexpressions, you're
embarking on a nightmare of a programming task. Just imagine how many
possibilities you'd have to store for something like the following
equivalents

=SUM(A1:D4)

=SUM(SUM(A1:A4),SUM(B1:B4),SUM(C1:C4),SUM(D1:D4))

=SUM(SUM(A1:D1),SUM(A2:D2),SUM(A3:D3),SUM(A4:D4))

=SUM(A1,B2,C3,D4,B1,C2,D3,A4,C1,D2,A3,B4,D1,A2,B3,C4)

etc. I don't think I need to tell you that such permutations grow very
rapidly.

Even if you really do want to do this, it's NOT a syntactic matter. The
grammar doesn't care that '+' is commmutative while '-' isn't, it only cares
that the tokens on either side of these operators are valid operands, so to
the parser '+' and '-' are identical in nature. Operator commutivity is a
SEMANTIC matter, and I just don't see how a grammar provides any relevant
semantic information.

Caveat: I've only ever used [f]lex and yacc|bison, so you may be using tools
about which I'm completely ignorant. But you are trying to do something
outside the scope of simply parsing formulas.

> . . . But it is also true, that most of the remaining errors can

>be revealed by detecting irregularities in the functionality of formulas in
>a certain geometrical region.

Those irregularities would more likely be spotted when a few cells in a
range have different formulas when using R1C1-style references. You just
need string comparison to do this, not a formula grammar.

ONE THING I JUST REMEMBERED. Excel has one perverse syntactic difference
from almost every other piece of numerical software I've ever used: unary
minus has higher precedence than exponentiation, so =-1^2 gives 1 rather
than -1, i.e, =-1^2 is equivalent to =(-1)^2.

Jody Goldberg

unread,

Sep 9, 2001, 3:47:23 AM9/9/01

to

In article <3B986761...@isys.uni-klu.ac.at>, Markus CLERMONT wrote:
>
> Maybe someone has a clue how to get the grammar?

Gnumeric (www.gnumeric.org) has a decent handle built lexer and yacc
based parser.