Functional CSL processor?

129 views
Skip to first unread message

Frank Bennett

unread,
Jan 25, 2009, 2:15:00 AM1/25/09
to zotero-dev
I had an interesting little experience today that started me
thinking. In the current Zotero implementation of CSL, if you call a
macro that is defined twice with the same name, it actually "executes"
twice.

That surprised me at first, but then I realized that the CSL processor
is built as a monolithic script interpreter, with the CSL style files
as a bundle of parameters that have to be stepped through one by one
on each cycle. It seems like it would be more efficient to set up the
interpreter as a state engine that first generates a set of inter-
related functions from the CSL spec, and then accepts the Item as a
state object to be passed between functions until completion.

I'm just daydreaming on this, but would that be correct? Or is it six
of one and half a dozen of the other, in terms of performance?

Bruce D'Arcus

unread,
Jan 25, 2009, 7:18:54 AM1/25/09
to zoter...@googlegroups.com
On Sun, Jan 25, 2009 at 2:15 AM, Frank Bennett <bierc...@gmail.com> wrote:

[...]

> I'm just daydreaming on this, but would that be correct? Or is it six
> of one and half a dozen of the other, in terms of performance?

I don't know, but:

a) the Zotero implementation was first written before there were
macros, so the design likely reflects that

b) take a look at the Haskell implementation, which I believe is
strongly functional:

<http://code.haskell.org/citeproc-hs/>

Bruce

PS - It would be great if at some point we could have a
pure-Javascript CSL processor (rather than E4X).

Erik Hetzner

unread,
Jan 25, 2009, 11:19:21 AM1/25/09
to zoter...@googlegroups.com
In my opinion this is a good idea. It would be faster, probably, but
more importantly it would be easier to maintain. I started work on
something like what I imagine you have in mind but got bogged down
with other work & with trying to figure out Zotero's internals.
Attached if you are interested.

The best thing to do might be to write something like this from
scratch as Bruce says, as pure javascript, & then to see if it
couldn't be integrated into Zotero. Certainly I found that learning
Zotero internals at the same time that you are trying to write new
code is difficult.

-Erik

On Sun, Jan 25, 2009 at 4:18 AM, Bruce D'Arcus <bda...@gmail.com> wrote:
>
> On Sun, Jan 25, 2009 at 2:15 AM, Frank Bennett <bierc...@gmail.com> wrote:
>
> [...]
>
>> I'm just daydreaming on this, but would that be correct? Or is it six
>> of one and half a dozen of the other, in terms of performance?

[...]

cite2.js

Bruce D'Arcus

unread,
Jan 25, 2009, 11:36:11 AM1/25/09
to zoter...@googlegroups.com
On Sun, Jan 25, 2009 at 11:19 AM, Erik Hetzner <ehet...@gmail.com> wrote:

> In my opinion this is a good idea. It would be faster, probably, but
> more importantly it would be easier to maintain. I started work on
> something like what I imagine you have in mind but got bogged down
> with other work & with trying to figure out Zotero's internals.
> Attached if you are interested.

Cool!

> The best thing to do might be to write something like this from
> scratch as Bruce says, as pure javascript, & then to see if it
> couldn't be integrated into Zotero. Certainly I found that learning
> Zotero internals at the same time that you are trying to write new
> code is difficult.

A couple of things:

First, I was thinking that JQuery might help on the sort of basic
parsing and XML support that E4X provides for the current Zotero code
(and in browsers, could help with additional functionality).

Second, WRT to the generic vs. Zotero specific issue, the approach
that most CSL implementations take is to define an independent data
representation and then write different input drivers to map to that,
and different output drivers to get the output (XHTML, ODF, TeX, RTF,
etc.). So, in other words, I'd expect that a rewritten cite.js (or
csl.js) file would not know anything about Zotero.

Bruce

Dan Stillman

unread,
Jan 25, 2009, 2:09:24 PM1/25/09
to zoter...@googlegroups.com
On 1/25/09 11:36 AM, Bruce D'Arcus wrote:
>> The best thing to do might be to write something like this from
>> scratch as Bruce says, as pure javascript,& then to see if it

>> couldn't be integrated into Zotero. Certainly I found that learning
>> Zotero internals at the same time that you are trying to write new
>> code is difficult.
>>
>
> A couple of things:
>
> First, I was thinking that JQuery might help on the sort of basic
> parsing and XML support that E4X provides for the current Zotero code
> (and in browsers, could help with additional functionality).
>

I can't speak for Simon, but, for what it's worth, I don't see us
integrating something into Zotero that had jQuery as a requirement.
Abstracting the processor so that it can more easily be used outside of
Zotero is a worthy (and, I suspect, fairly easy) goal, but E4X is an
ECMA standard that adds a pretty critical language feature, and
replacing it in Zotero with a large third-party JavaScript code base (as
good as it is) wouldn't really make sense.

Bruce D'Arcus

unread,
Jan 25, 2009, 2:46:57 PM1/25/09
to zoter...@googlegroups.com
On Sun, Jan 25, 2009 at 2:09 PM, Dan Stillman <dsti...@zotero.org> wrote:

> I can't speak for Simon, but, for what it's worth, I don't see us
> integrating something into Zotero that had jQuery as a requirement.
> Abstracting the processor so that it can more easily be used outside of
> Zotero is a worthy (and, I suspect, fairly easy) goal, but E4X is an
> ECMA standard that adds a pretty critical language feature, and
> replacing it in Zotero with a large third-party JavaScript code base (as
> good as it is) wouldn't really make sense.

I see your point, but a "standard" isn't particularly relevant unless
its widely implemented. An E4X-only library effectively means it's
Mozilla-only ATM; right*?

But that's admittedly somewhat orthogonal to Frank's and Erik's
concerns (and certainly your's). Was just expressing the hope that if
anybody did bother to rewrite the code, that there would be room for
it to work more widely. The problem is really parsing the CSL file I
guess.

Bruce

* A quick search suggests that there's some movement on implementing
it in WebKit, but not sure the status. I'd guess that MS will never
implement it.

skornblith

unread,
Jan 25, 2009, 3:53:29 PM1/25/09
to zotero-dev, xbibli...@lists.sourceforge.net
The Zotero CSL parser is certainly not the cleanest piece of code, as
it has evolved from a much different version of CSL to its current
representation. Cleaning it up is a good idea, and I would be very
receptive to attempts to do so. I haven't gotten around to it because
I have very little time to contribute at the moment, and at this point
the parser is fairly stable.

I don't think either JQuery or E4X ought to be a requirement for a
functional CSL parser. The current parser's use of the E4X predicate
filter is minimal and could be trivially avoided. The main advantage
to E4X is that the whole CSL can be manipulated as an object. If your
approach is to compile everything from the start, DOM XML should be
sufficient, although probably a little messier and more annoying.

The sample code that Eric wrote is very nice, and much cleaner than
the current implementation! If there is interest in continuing this
work, I would be happy to provide help/clarification where necessary.
The biggest sticking point that I foresee is the <substitute> element
of <names>. If macros are compiled independently of the CSL, then to
implement this feature properly, one must keep track of another state
besides item/citation. This can certainly be worked out, but it's
probably better to plan out a solution in advance than to run into
this issue later in the coding process.

Simon

On Jan 25, 11:46 am, "Bruce D'Arcus" <bdar...@gmail.com> wrote:

Bruce D'Arcus

unread,
Jan 25, 2009, 6:31:56 PM1/25/09
to zoter...@googlegroups.com
In response to Simon's question about handling substitutions, here's
how Andrea explains how he did it in his Haskell implementation ...

---------- Forwarded message ----------
From: andrea rossato <andre...@unitn.it>
Date: Sun, Jan 25, 2009 at 5:51 PM
Subject: Re: Fwd: Functional CSL processor?
To: Bruce D'Arcus <xxxx...@gmail.com>


On Sun, Jan 25, 2009 at 04:42:00PM -0500, Bruce D'Arcus wrote:

> Andrea: so how did you handle the substitution case Simon notes below?

Easily... the style elements (names, macros, text, etc.), is parsed
into a recursive data-type, Element, defined in Style.hs [1]. Among
the constructors of this data-type there's is 'Names'. 'Names' holds
the variable attribute value, of the <names> element, the <name> and
<label> elements, and the list of substitutions - which are Element
types too. If the variables evaluate to nothing (do not produce an
output), then the substitutions are tried. This is the relevant part
of the code, in Eval.hs:

evalElement :: Element -> State EvalState [Output]
evalElement el
[...]
| Names s n fm d sub <- el = ifEmpty (evalNames s n d)
(withName (getName n) $
evalElements sub)
(appendOutput fm)
| Substitute (e:els) <- el = ifEmpty (consuming $ evalElement e)
(getFirst els)
id

[...]

> I can't quite figure out your code (well, really Haskell) ;-)

Haskell can be quite difficult to read, indeed.

Hope this helps.

Andrea

[1] http://code.haskell.org/citeproc-hs/docs/Text-CSL-Style.html#t%3AElement

skornblith

unread,
Jan 25, 2009, 11:23:48 PM1/25/09
to zotero-dev, andrea....@ing.unitn.it, xbibli...@lists.sourceforge.net
Unless I'm missing something, this explanation doesn't seem to address
the tricky part of substitution in such a model. If a substitution is
made, it should prevent the variable that has been substituted from
being displayed later in the citation. If I have a macro

<macro name="author">
<names variable="author">
<name name-as-sort-order="all" and="symbol" sort-separator=", "
initialize-with=". " delimiter=", " delimiter-precedes-
last="always"/>
<label form="short" prefix=" (" suffix=".)" text-
case="capitalize-first"/>
<substitute>
<names variable="editor"/>
</substitute>
</names>
</macro>

and a bibliography entry

<layout suffix=".">
<text macro="author" suffix="."/>
<text macro="issued" suffix=" "/>
<text macro="title"/>
<names variable="editor"><-! ... --></names>
</layout>

then, if there is no author, the second occurrence of the editor
variable should be ignored, and the editor should only be printed
once. This requires maintaining some kind of state regarding what
should be ignored. This construction remains from the first days of
CSL, and we could replace it with conditionals, but to do so would
require 20+ lines extra lines of not particularly intuitive logic for
most author-date styles.

Frank Bennett

unread,
Feb 6, 2009, 2:05:12 AM2/6/09
to zotero-dev
On Jan 26, 5:53 am, skornblith <si...@simonster.com> wrote:
> The Zotero CSL parser is certainly not the cleanest piece of code, as
> it has evolved from a much different version of CSL to its current
> representation. Cleaning it up is a good idea, and I would be very
> receptive to attempts to do so. I haven't gotten around to it because
> I have very little time to contribute at the moment, and at this point
> the parser is fairly stable.
>
> I don't think either JQuery or E4X ought to be a requirement for a
> functional CSL parser. The current parser's use of the E4X predicate
> filter is minimal and could be trivially avoided. The main advantage
> to E4X is that the whole CSL can be manipulated as an object. If your
> approach is to compile everything from the start, DOM XML should be
> sufficient, although probably a little messier and more annoying.
>
> The sample code that Eric wrote is very nice, and much cleaner than
> the current implementation! If there is interest in continuing this
> work, I would be happy to provide help/clarification where necessary.
> The biggest sticking point that I foresee is the <substitute> element
> of <names>. If macros are compiled independently of the CSL, then to
> implement this feature properly, one must keep track of another state
> besides item/citation. This can certainly be worked out, but it's
> probably better to plan out a solution in advance than to run into
> this issue later in the coding process.

While I can't make a firm promise to come up with something useful, I
would like to take a look at this problem, at least if no one else is
likely to latch hold to it in the short term. At the least, I might
be able to do some of the routine work of producing functions to cover
attribute primitives to be digested by a compiler. I'll be slow, but
having raised this issue, it doesn't seem right to just walk away.

I'm familiar with using test suites in Python, and I'd like to have
something similar in place to exercise scraps of javascript. Are
there any recommended tools for this? On a quick look around, JSUnit
seems to be popular. Is that likely to work, or should I look
elsewhere?

Frank Bennett

Bruce D'Arcus

unread,
Feb 6, 2009, 8:48:57 AM2/6/09
to zoter...@googlegroups.com
On Fri, Feb 6, 2009 at 2:05 AM, Frank Bennett <bierc...@gmail.com> wrote:

> While I can't make a firm promise to come up with something useful, I
> would like to take a look at this problem, at least if no one else is
> likely to latch hold to it in the short term.

If you manage to make progress, it might be good to put this in a
public SCM repo so that others might contribute as time permits.

Bruce

Frank Bennett

unread,
Feb 6, 2009, 9:13:29 PM2/6/09
to zotero-dev


On Feb 6, 10:48 pm, "Bruce D'Arcus" <bdar...@gmail.com> wrote:
> On Fri, Feb 6, 2009 at 2:05 AM, Frank Bennett <biercena...@gmail.com> wrote:
> > While I can't make a firm promise to come up with something useful, I
> > would like to take a look at this problem, at least if no one else is
> > likely to latch hold to it in the short term.
>
> If you manage to make progress, it might be good to put this in a
> public SCM repo so that others might contribute as time permits.

Absolutely. After I get some sort of test framework in place, that
must happen.



> Bruce

Frank Bennett

unread,
Feb 8, 2009, 3:55:26 PM2/8/09
to zotero-dev
Small steps. I see the test suite for Mozilla itself now. Runs from
the command line, I guess that will be the way to go.


> > Bruce

Bruce D'Arcus

unread,
Feb 8, 2009, 4:13:04 PM2/8/09
to zoter...@googlegroups.com
On Sun, Feb 8, 2009 at 3:55 PM, Frank Bennett <bierc...@gmail.com> wrote:

> Small steps. I see the test suite for Mozilla itself now. Runs from
> the command line, I guess that will be the way to go.

I really know nothing about JS testing, but why would it follow that
you'd use a Mozilla-specific testing option?

BTW, not sure it's relevant, but I recently came across this:

<http://ejohn.org/blog/fireunit/>

Bruce

Frank Bennett

unread,
Feb 8, 2009, 4:40:26 PM2/8/09
to zotero-dev
On Feb 9, 6:13 am, "Bruce D'Arcus" <bdar...@gmail.com> wrote:
> On Sun, Feb 8, 2009 at 3:55 PM, Frank Bennett <biercena...@gmail.com> wrote:
> > Small steps.  I see the test suite for Mozilla itself now.  Runs from
> > the command line, I guess that will be the way to go.
>
> I really know nothing about JS testing, but why would it follow that
> you'd use a Mozilla-specific testing option?

Because I don't know what I'm doing. :/

The Mozilla JS unit test page I was looking at was out of date; the
archive hierarchy has changed since it was written, and a quick flip
through the existing sources was not enlightening.

> BTW, not sure it's relevant, but I recently came across this:
>
> <http://ejohn.org/blog/fireunit/>

Looks interesting, I'll try to learn more about this. A lot of the
testing stuff for JS seems to be aimed at testing UI behaviour in the
browser, and embed the testing code as a tagged environment in some
sort of web page. I'm not sure that's right for this task. I've used
the Python and Plone test case suites in the past, and I'd like to
find something similar that I can run quickly from the command line --
to be honest, I'm about point-and-clicked out from my recent efforts
to abuse Zotero. Several tools also seem to be cast in Java, and I'd
prefer to avoid learning a new language in its entirety just to get
simple tests up and running.

One of the commenters in the fireunit page at the link mentions
firewatir:

http://wiki.openqa.org/display/WTR/FireWatir

I don't know anything about this, but it might be a way forward.


> Bruce

Frank Bennett

unread,
Feb 8, 2009, 5:04:48 PM2/8/09
to zotero-dev
May have found one. This looks like it might do the trick:

http://www.notesfromatooluser.com/2008/11/unit-testing-in-javascript.html

Frank


> > Bruce

Frank Bennett

unread,
Feb 8, 2009, 5:54:06 PM2/8/09
to zotero-dev
>  http://www.notesfromatooluser.com/2008/11/unit-testing-in-javascript....

The D.O.H. test framework does look good. It's apparently run from
Rhino in the command line interface, which means JS only, no browser
functionality. I assume that for this task that's not going to be a
problem, but if that's mistaken ... please let me know.




> Frank
>
> > > Bruce

Bruce D'Arcus

unread,
Feb 8, 2009, 7:45:32 PM2/8/09
to zoter...@googlegroups.com
On Sun, Feb 8, 2009 at 5:54 PM, Frank Bennett <bierc...@gmail.com> wrote:

> The D.O.H. test framework does look good. It's apparently run from
> Rhino in the command line interface, which means JS only, no browser
> functionality. I assume that for this task that's not going to be a
> problem, but if that's mistaken ... please let me know.

Well, my perspective: the CSL-related processing should care nothing
at all about the browser, or the application: it's just taking some
input (CSL file and data, probably as JSON) and generating output.

I would think that for Zotero there'd just be some little driver code
that map the Mozilla storage stuff to the processing input..

But Dan or Simon obviously know much more about the
Zotero/Firefox-specific details.

Bruce

Frank Bennett

unread,
Feb 8, 2009, 8:18:00 PM2/8/09
to zotero-dev
On Feb 9, 9:45 am, "Bruce D'Arcus" <bdar...@gmail.com> wrote:
I think I have the beginnings of a sandbox going, with a readline-
enabled JS at the command line, the test suite, and Erik's code.
Should be able to start building broken tests pretty soon. I'm
feeling a bit relieved to have the basics to hand, and will take a
look at Bruce's suggestions for repositories before starting to fiddle
around later in the week.

Rhino 1.7r1 was able to load and parse a CSL file successfully with
xml = new XML( ... ); this is all very new to me ... would that be
E4X at work?

I'm thinking that what this should do is generate a compiled CSL
object, with a method that accepts a data blob with a bunch of string
attributes (an Item), and spits out a nested list object. The non-
list elements in the list would be text blobs, each containing a
string, formatting hints, and an inherited rendering hook. The list
object would have a method that accepts a spec blob with info on how
to handle each of the hints, and would apply it to its content by
walking the list before spitting out a string object on completion.
This would separate the parsing and evaluation work from the
generation of the final string output, which should be a little easier
to build, follow and maintain. Would also make it easier to build
extensions for export formats other than HTML and RTF.

If that general concept makes sense, I can start building little
pieces of tested code, which might eventually mature into an actual
CSL processor.


> Bruce

Bruce D'Arcus

unread,
Feb 8, 2009, 8:49:34 PM2/8/09
to zoter...@googlegroups.com
On Sun, Feb 8, 2009 at 8:18 PM, Frank Bennett <bierc...@gmail.com> wrote:

.....

> Rhino 1.7r1 was able to load and parse a CSL file successfully with
> xml = new XML( ... ); this is all very new to me ... would that be
> E4X at work?

Yeah; no such support in regular JS (though libraries like JQuery can
provide similar kinds of convenience).

> I'm thinking that what this should do is generate a compiled CSL
> object, with a method that accepts a data blob with a bunch of string
> attributes (an Item), and spits out a nested list object. The non-
> list elements in the list would be text blobs, each containing a
> string, formatting hints, and an inherited rendering hook. The list
> object would have a method that accepts a spec blob with info on how
> to handle each of the hints, and would apply it to its content by
> walking the list before spitting out a string object on completion.
> This would separate the parsing and evaluation work from the
> generation of the final string output, which should be a little easier
> to build, follow and maintain. Would also make it easier to build
> extensions for export formats other than HTML and RTF.
>
> If that general concept makes sense, I can start building little
> pieces of tested code, which might eventually mature into an actual
> CSL processor.

Your explanation makes sense, but I'd strongly suggest you post
questions like this to the xbib dev list, since Andrea is the expert
in how do this with a functional approach*, he's got a working
implementation behind him (which I've not yet used a lot, but my
testing shows it's solid, and really fast), and I don't believe he's
on this list ;-)

Bruce

* Though XSL is a functional language too, so I guess I have some
useful experience as well.

Frank Bennett

unread,
Feb 12, 2009, 6:46:27 PM2/12/09
to zotero-dev
On Feb 9, 10:49 am, "Bruce D'Arcus" <bdar...@gmail.com> wrote:
> On Sun, Feb 8, 2009 at 8:18 PM, Frank Bennett <biercena...@gmail.com> wrote:
>
> .....
>
> > Rhino 1.7r1 was able to load and parse a CSL file successfully with
> > xml = new XML( ... );  this is all very new to me ... would that be
> > E4X at work?
>
> Yeah; no such support in regular JS (though libraries like JQuery can
> provide similar kinds of convenience).

The slowly emerging code for a reimplementation of the CSL processor
is now up in SVN, with a hat tip to Bruce for access permission at
xbiblio:

SVN: svn co https://xbiblio.svn.sourceforge.net/svnroot/xbiblio/citeproc-js
citeproc-js
Browse: http://xbiblio.svn.sourceforge.net/viewvc/xbiblio/citeproc-js/

It's still in a rudimentary state, but starting to take shape.
@Simon, if you can spare a moment, now might be a good time to take a
quick look and see if I'm shooting myself in the foot in any obvious
way.

Now that the fundamentals of the output engine are done, I've been
thinking about how to handle the CSL file, and I've come to the same
conclusion as Bruce, that this can be done in native Javascript. The
parsing is hugely simplified by the fact that CSL files contain no
text nodes; if there are factory function primitives for each element,
we can build a compiler by just splitting the tags to a list, and
stepping through registering frozen functions as either child or
sibling until we run out of stuff to be processed. Then the item
object can be thrown at the compiler and Bob's your uncle. It'll take
awhile to write, but the coding will very simple once the basic
framework is in place. Hopefully it will run fast as well.

Frank

Bruce D'Arcus

unread,
Feb 12, 2009, 8:09:41 PM2/12/09
to zoter...@googlegroups.com
On Thu, Feb 12, 2009 at 6:46 PM, Frank Bennett <bierc...@gmail.com> wrote:

> Now that the fundamentals of the output engine are done, I've been
> thinking about how to handle the CSL file, and I've come to the same
> conclusion as Bruce, that this can be done in native Javascript. The
> parsing is hugely simplified by the fact that CSL files contain no

> text nodes; ....

Except, of course, some of the cs:info metadata (title, id, etc.). But
that'd be handled separately from the main macro, etc. stuff.

Bruce

Frank Bennett

unread,
Feb 12, 2009, 8:58:05 PM2/12/09
to zotero-dev
On Feb 13, 10:09 am, "Bruce D'Arcus" <bdar...@gmail.com> wrote:
> On Thu, Feb 12, 2009 at 6:46 PM, Frank Bennett <biercena...@gmail.com> wrote:
> > Now that the fundamentals of the output engine are done, I've been
> > thinking about how to handle the CSL file, and I've come to the same
> > conclusion as Bruce, that this can be done in native Javascript.  The
> > parsing is hugely simplified by the fact that CSL files contain no
> > text nodes; ....
>
> Except, of course, some of the cs:info metadata (title, id, etc.). But
> that'd be handled separately from the main macro, etc. stuff.

Yep, that stuff is someone else's problem. :)

> Bruce

skornblith

unread,
Feb 12, 2009, 9:45:08 PM2/12/09
to zotero-dev
On Feb 12, 3:46 pm, Frank Bennett <biercena...@gmail.com> wrote:
> On Feb 9, 10:49 am, "Bruce D'Arcus" <bdar...@gmail.com> wrote:
>
> > On Sun, Feb 8, 2009 at 8:18 PM, Frank Bennett <biercena...@gmail.com> wrote:
>
> > .....
>
> > > Rhino 1.7r1 was able to load and parse a CSL file successfully with
> > > xml = new XML( ... );  this is all very new to me ... would that be
> > > E4X at work?
>
> > Yeah; no such support in regular JS (though libraries like JQuery can
> > provide similar kinds of convenience).
>
> The slowly emerging code for a reimplementation of the CSL processor
> is now up in SVN, with a hat tip to Bruce for access permission at
> xbiblio:
>
> SVN:  svn cohttps://xbiblio.svn.sourceforge.net/svnroot/xbiblio/citeproc-js
> citeproc-js
> Browse:http://xbiblio.svn.sourceforge.net/viewvc/xbiblio/citeproc-js/
>
> It's still in a rudimentary state, but starting to take shape.
> @Simon, if you can spare a moment, now might be a good time to take a
> quick look and see if I'm shooting myself in the foot in any obvious
> way.
>
> Now that the fundamentals of the output engine are done, I've been
> thinking about how to handle the CSL file, and I've come to the same
> conclusion as Bruce, that this can be done in native Javascript.  The
> parsing is hugely simplified by the fact that CSL files contain no
> text nodes; if there are factory function primitives for each element,
> we can build a compiler by just splitting the tags to a list, and
> stepping through registering frozen functions as either child or
> sibling until we run out of stuff to be processed.  Then the item
> object can be thrown at the compiler and Bob's your uncle.  It'll take
> awhile to write, but the coding will very simple once the basic
> framework is in place.  Hopefully it will run fast as well.

From this post and xml.js, I get the impression that you're planning
on writing your own XML parser? This is unnecessary. There are two XML
APIs for JavaScript: E4X (which is supported only by SpiderMonkey,
Rhino, and Flash's ActionScript interpreter ATM), and DOM XML (which
is supported everywhere or nearly everywhere). DOM XML is a much
bulkier API than E4X, but certainly preferable to coding your own
parser. The method for initializing an XML document differs between IE
and all other browsers, but beyond this, the API is the standard W3C
DOM interface.

Also, you might want to consider using the JSDoc Toolkit (http://
code.google.com/p/jsdoc-toolkit/) when commenting your code. At the
moment, its use in the Zotero codebase is pretty sparse, but it
provides very useful interface overviews with little effort.

Otherwise, the code looks very promising so far. Keep up the good
work!

Simon

Frank Bennett

unread,
Feb 13, 2009, 5:16:00 AM2/13/09
to zotero-dev
Simon,

Thanks for taking a look. I've never tackled something like this
before, it gives me that little bit of necessary courage.

I spent some time exploring E4X/DOM yesterday, and I'm not sure
whether the lights will go on for me before the very small amount of
parsing code that this will need has already been written.

Out of the box, my Rhino happily loads text via XML(<string>), but the
instance has no methods and only a single node, which doesn't seem
quite right. I'm not sure whether it's meant to be DOM or E4X, but
after tugging at the object a few times, I concluded it just wasn't
working. I checked around the Net and found a note that xbean.jar is
needed for Rhino to run E4X. So I grabbed the sources, then installed
ant, then scratched around to find a javac compiler. Then did some
incantations to get ant to find the compiler, and then compiled the
sources, and found xbean.jar in the release directory. Then I went to
bed. Still don't know whether xbean.jar will change Rhino's behavior,
and still clueless about how to correctly load a document to the
parser (or what dialect of object it will deliver). I also seem to
have an allergy that is specific to XML documentation, since I've
never been able to grasp what it's trying to say in cookbook terms
that I can understand. So ... I'm getting a little bit gun-shy of all
this XML type of stuff. :)

Despite the rant, I'm not whining; it's just that the heavy lifting is
all in the compiler end of things, and timewise it probably makes more
sense for me to stick with what I know, clear this little hurdle for
the present, and get on with that end of things. If the bundle is
well structured and readable, it can easily be adapted later to DOM or
E4X by someone who knows those tools, if that's the better way to go
in the long term.

> Also, you might want to consider using the JSDoc Toolkit (http://
> code.google.com/p/jsdoc-toolkit/) when commenting your code. At the
> moment, its use in the Zotero codebase is pretty sparse, but it
> provides very useful interface overviews with little effort.

Will definitely explore this.

Bruce

unread,
Feb 13, 2009, 10:08:22 AM2/13/09
to zotero-dev


On Feb 13, 5:16 am, Frank Bennett <biercena...@gmail.com> wrote:

...

> Out of the box, my Rhino happily loads text via XML(<string>), but the
> instance has no methods and only a single node, which doesn't seem
> quite right.  I'm not sure whether it's meant to be DOM or E4X, but
> after tugging at the object a few times, I concluded it just wasn't
> working.  I checked around the Net and found a note that xbean.jar is
> needed for Rhino to run E4X.  So I grabbed the sources, then installed
> ant, then scratched around to find a javac compiler.  Then did some
> incantations to get ant to find the compiler, and then compiled the
> sources, and found xbean.jar in the release directory.  Then I went to
> bed.  Still don't know whether xbean.jar will change Rhino's behavior,
> and still clueless about how to correctly load a document to the
> parser (or what dialect of object it will deliver).  I also seem to
> have an allergy that is specific to XML documentation, since I've
> never been able to grasp what it's trying to say in cookbook terms
> that I can understand.  So ... I'm getting a little bit gun-shy of all
> this XML type of stuff.  :)

When Simon mentioned that he thought you might be writing your own
parser, what did he mean?

One thing that complicates XML is namespaces. For example, this:

<root xmlns="http://ex.net"/>

.. is that same as:

<foo:root xmlns:foo="http://ex.net"/>

APIs usually account for that.

You have three options for or the CSL parsing:

1) use the standard JS DOM stuff; have not used it myself, but I
understand it's not the easiest thing to work with

2) use E4X (as Simon did); I'm sure this is significantly easier than
1

3) use a third-party JS library like JQuery; this gives the benefit of
1 and 2, with some added weight

My hunch is that you might start with 1 and see how that works. If
it's too onerous, then things get more complicated. One option might
be to have a csl-e4x.js file and leave room to port it to the third
option.

> Despite the rant, I'm not whining; it's just that the heavy lifting is
> all in the compiler end of things, and timewise it probably makes more
> sense for me to stick with what I know, clear this little hurdle for
> the present, and get on with that end of things.  If the bundle is
> well structured and readable, it can easily be adapted later to DOM or
> E4X by someone who knows those tools, if that's the better way to go
> in the long term.

Yeah; just keep that in mind as a goal though.

> > Also, you might want to consider using the JSDoc Toolkit (http://
> > code.google.com/p/jsdoc-toolkit/) when commenting your code. At the
> > moment, its use in the Zotero codebase is pretty sparse, but it
> > provides very useful interface overviews with little effort.
>
> Will definitely explore this.

Yes, +1; these sorts of tools are really useful.

Bruce

Robert Forkel

unread,
Feb 13, 2009, 11:01:14 AM2/13/09
to zoter...@googlegroups.com
if i understand the code here
http://xbiblio.svn.sourceforge.net/viewvc/xbiblio/citeproc-js/src/xml.js?revision=545&view=markup
correctly, i'd say that is implementing you own parser.
you may want to have a look here:
http://www.w3schools.com/Xml/xml_parser.asp
to see how to instantiate a proper xml parser (unfortunately you'll
need a couple try..catch blocks to make it work in all browsers). the
dom you get from the parser will provide methods like
"getElementsByTagName" which are sort of clumsy to use, but at least
you don't have to deal with xml intricacies like encodings and
namespaces and so on.
regards,
robert

Bruce D'Arcus

unread,
Feb 13, 2009, 11:06:02 AM2/13/09
to zoter...@googlegroups.com
On Fri, Feb 13, 2009 at 11:01 AM, Robert Forkel <xrot...@googlemail.com> wrote:
>
> if i understand the code here
> http://xbiblio.svn.sourceforge.net/viewvc/xbiblio/citeproc-js/src/xml.js?revision=545&view=markup
> correctly, i'd say that is implementing you own parser.

Yeah, you're right; and that will fail on valid CSL files that use
namespace prefixes.

Bruce

Frank Bennett

unread,
Feb 13, 2009, 6:37:16 PM2/13/09
to zotero-dev
On Feb 14, 1:06 am, "Bruce D'Arcus" <bdar...@gmail.com> wrote:
> On Fri, Feb 13, 2009 at 11:01 AM, Robert Forkel <xrotw...@googlemail.com> wrote:
>
> > if i understand the code here
> >http://xbiblio.svn.sourceforge.net/viewvc/xbiblio/citeproc-js/src/xml...
> > correctly, i'd say that is implementing you own parser.
>
> Yeah, you're right; and that will fail on valid CSL files that use
> namespace prefixes.

Okay, I've gotten the e4x example in the Rhino distribution archive to
run successfully. There was a small error in the syntax of one
example that had it crashing, but now I see how it works. Could use
this, and it certainly looks like it would be convenient for many
purposes, but I still don't see how it would make this particular task
any simpler. Let me implement something basic, and then see where we
stand.

Meanwhile, keep those eggs and rotten tomatoes coming. :)

Frank


> Bruce

Frank Bennett

unread,
Feb 13, 2009, 6:37:40 PM2/13/09
to zotero-dev
On Feb 14, 1:06 am, "Bruce D'Arcus" <bdar...@gmail.com> wrote:
> On Fri, Feb 13, 2009 at 11:01 AM, Robert Forkel <xrotw...@googlemail.com> wrote:
>
> > if i understand the code here
> >http://xbiblio.svn.sourceforge.net/viewvc/xbiblio/citeproc-js/src/xml...
> > correctly, i'd say that is implementing you own parser.
>
> Yeah, you're right; and that will fail on valid CSL files that use
> namespace prefixes.

Submit a test case.

> Bruce

Frank Bennett

unread,
Feb 14, 2009, 5:53:52 AM2/14/09
to zotero-dev
On Feb 14, 1:01 am, Robert Forkel <xrotw...@googlemail.com> wrote:
> if i understand the code herehttp://xbiblio.svn.sourceforge.net/viewvc/xbiblio/citeproc-js/src/xml...
> correctly, i'd say that is implementing you own parser.
> you may want to have a look here:http://www.w3schools.com/Xml/xml_parser.asp
> to see how to instantiate a proper xml parser (unfortunately you'll
> need a couple try..catch blocks to make it work in all browsers). the
> dom you get from the parser will provide methods like
> "getElementsByTagName" which are sort of clumsy to use, but at least
> you don't have to deal with xml intricacies like encodings and
> namespaces and so on.
> regards,
> robert

I've checked in code for a parser, together with tests proving that it
captures attributes correctly, copes with namespaces and comments, and
handles escaped quotes within attribute arguments. I've used the APA
style for some of the test cases, and the parser chews through it
without complaint. It's a cut little thing at 133 lines of code, and
has the advantage of allowing the CSL to be handled simultaneously as
a list and as a nested tag environment, which is the functionality I
was after. From here, we can build a compiled function from the CSL
with a simple iterator. After that's out of the way, it will just be
a matter of writing the factory functions for the compiler, which will
be extremely straightforward once the first few examples have been
done and tested.

This currently stands at 500 lines of code, including the beginnings
of the output engine and some utility functions copied in from
Zotero. With the current version of csl.js weighing in at 2962 lines,
we still have some room to maneuver.

Frank

Robert Forkel

unread,
Feb 14, 2009, 6:15:00 AM2/14/09
to zoter...@googlegroups.com
hi frank,
i wouldn't want to make things more complicated than necessary. but
here's another xml intricacy which i actually do use quite a bit:
named entities. you can define named strings in a DOCTYPE declaration
in xml which you can than reference in the body of the document using
the &name; syntax.
here's a small example (run in the firebug interactive interpreter):

>>> var xml = '<?xml version="1.0"?>\n<!DOCTYPE r [<!ENTITY t "text">]>\n<r>&t;</r>';
>>> var p = new DOMParser();
>>> var doc=p.parseFromString(xml,"text/xml");
>>> doc.getElementsByTagName('r')[0].firstChild.nodeValue
"text"

note the reference &t; in the content of the r element got replaced by
"text" as mandated by the xml spec.
so a full xml parser is probably not something you want to build from scratch.
regards,
robert

skornblith

unread,
Feb 16, 2009, 12:06:07 AM2/16/09
to zotero-dev
Frank,

I've taken a look at the new parser code. I recognize that you have
specific reasons for using regexps to parse the XML, but from the code
that's there so far, those reasons still aren't clear to me. The DOM
parser does actually allow you to handle XML as both a list and a
nested tag environment. Getting all tags with a certain tag name at
any level of the hierarchy requires no code at all: you can simply use
the getElementsByTagNameNS() method. If for some reason you need a
list of all tags, it's still simpler than parsing the entire document
with regexps:

// this code not actually tested
var element = xmlDoc.documentElement;
var newlist = new Array();
while(true) {
if(element.nodeType == 1) { // element node
newlist.push(this.createTag(element));
}

if(element.hasChildNodes()) {
element = element.firstChild;
} else if(element.nextSibling) {
element = element.nextSibling;
} else if(element.parentNode) {
element = element.parentNode;
} else { // top level
break;
}
}

To get all nodes at a given level of the hierarchy, you can use the
childNodes attribute. The advantage of parsing everything into a flat
list and then recreating a hierarchy is really not clear to me. It
would be possible (indeed, very easy) to rewrite xml.js to use the DOM
parser, but in the end, it seems to me that you will end up
reinventing a lot of functionality that the DOM parser already
provides in builder.js as well. Additionally, since you are going to
need to use the DOM parser to resolve XML entities anyway, you will
definitely take a speed hit by parsing the document twice. If there
are specific situations where you aren't sure whether it's possible to
use DOM methods to get the desired behavior, I'd be discuss these
issues further, but otherwise, it seems like the regexp parsing
methods will only add cruft to the code.

Simon

Frank Bennett

unread,
Feb 16, 2009, 12:42:45 AM2/16/09
to zotero-dev
Thanks and apologies from this end. That code was a byproduct of
being hemmed in on three sides, unable to get something DOM-ish
working in Rhino, unsure of what the DOM would deliver, and unable to
change development environments. But I got E4X running this morning,
and I now see (with painful clarity) that all of this is so.

The pain has not been without benefit, though. I now see clearly how
things could fit together with a minimum of code and a minimum of
fuss. Elements and attributes could be defined as template functions,
and invoked with call or apply on the E4X XML objects themselves.
With a few little helper functions to choose regions of the style for
execution, this should basically do it.

I've torn out all of the parser stuff, and am replacing it with code
that instantiates E4X, and gives wrappers to the E4X commands needed
for addressing, in case someone decides to use another DOM
implementation.

So I think I'm back on track. To avoid another cartwheel, though, I
should probably ask for your take on this latest idea -- is it right
that there is no point in slurping names and parameters from the DOM
into Javascript objects? If so, this will be very straightforward and
quick to implement. I love it when that happens, but I'd like to
check that I'm not being mistaken again (!).

Frank

Frank Bennett

unread,
Feb 16, 2009, 7:23:03 AM2/16/09
to zotero-dev
Progress here. I've checked in code for a small recursive execution
function that operates on an E4X object, together with one test as a
quick demo. Still needs to be thoroughly tested, and there are no
wrapper functions yet, but you should be able to throw arbitrary
portions of the CSL object at it, together with an Item object, and
get back the string (or set up configuration, or whatever) that the
CSL is meant to generate for that chunk. One thing this will mean is
that the 150-odd lines of code in CSL.Global that extracts locale
strings and installs them as JS objects can go away; we can just merge
the locale into the CSL object prior to execution, and the wrappers
will grab the correct terms as a matter of course.

I want to be the first to point out that this would never have worked
with the parser-that-Frank-built. :)

Bruce D'Arcus

unread,
Feb 16, 2009, 11:48:39 AM2/16/09
to zoter...@googlegroups.com
On Mon, Feb 16, 2009 at 7:23 AM, Frank Bennett <bierc...@gmail.com> wrote:

> Progress here. I've checked in code for a small recursive execution
> function that operates on an E4X object, together with one test as a
> quick demo. Still needs to be thoroughly tested, and there are no
> wrapper functions yet, but you should be able to throw arbitrary
> portions of the CSL object at it, together with an Item object, and
> get back the string (or set up configuration, or whatever) that the
> CSL is meant to generate for that chunk. One thing this will mean is
> that the 150-odd lines of code in CSL.Global that extracts locale
> strings and installs them as JS objects can go away; we can just merge
> the locale into the CSL object prior to execution, and the wrappers
> will grab the correct terms as a matter of course.

Just try to leave room for non-E4X implementations.

I just recalled, for example, that the JQuery project has abstracted
out their selector engine for possible implementation in other
frameworks. Here's an (XML) example from the unit tests:

jQuery.get('data/dashboard.xml', function(xml) {
var titles = [];
jQuery('tab', xml).each(function() {
titles.push(jQuery(this).attr('title'));
});
equals( titles[0], 'Location', 'attr() in XML context: Check first title' );
equals( titles[1], 'Users', 'attr() in XML context: Check second title' );
start();
});

So the "jQuery('tab', xml).each" bit iterates through all "tab"
elements in the file, and the "jQuery(this).attr('title')" pulls out
the title attributes.

An interesting integration of tests, BTW.

<http://github.com/jeresig/sizzle/tree/master>

Bruce

Frank Bennett

unread,
Feb 16, 2009, 5:27:52 PM2/16/09
to zotero-dev
On Feb 17, 1:48 am, "Bruce D'Arcus" <bdar...@gmail.com> wrote:
What's in place so far would be simple to adapt, because all the E4X
is in two functions (CSL.Xml and the private function
CSL.Builder.setXml). I'm wondering now whether Build should be used
to execute the XML directly for each citation item, or to pre-compile
the style into a nested list of functions that can be executed on the
item by a secondary wrapper. If JQuery's portable functionality is
read-only, that would be an argument for the latter approach. One of
the neat possibilities that this opens up is overloading of portions
of the XML, which would simplify the handling of locales. It would
also allow macro overloading, so that dependent styles don't have to
copy all of the code from the master version in the sources. If
JQuery is read only against the XML, pre-compiling would make sense.
It would probably also run faster. So I guess that's the next thing
to look at.

It shouldn't be too difficult, actually. We can fashion a copy of the
CSL in ordinary JS objects that offer the sole E4X method that is used
for the walk (node.children()). Then a second wrapper (same as the
first, for E4X) would be able to walk that tree once for each overload
operation, replacing the relevant function. After that, we're right
back to the simple model, with everything broken out into individual
functions for each attribute and element.

This should come together pretty quickly once this basic
infrastructure is in place. A lot of the code for the tricky bits
(names, shudder) can be copied across from the existing
implementation. Doing the tests will take some work, but once they're
done, we'll know that it's rock solid.

Frank

Frank Bennett

unread,
Feb 18, 2009, 9:29:25 PM2/18/09
to zotero-dev
On Feb 17, 7:27 am, Frank Bennett <biercena...@gmail.com> wrote:
> On Feb 17, 1:48 am, "Bruce D'Arcus" <bdar...@gmail.com> wrote:
[snip]
> What's in place so far would be simple to adapt, because all the E4X
> is in two functions (CSL.Xml and the private function
> CSL.Builder.setXml).  I'm wondering now whether Build should be used
> to execute the XML directly for each citation item, or to pre-compile
> the style into a nested list of functions that can be executed on the
> item by a secondary wrapper.  If JQuery's portable functionality is
> read-only, that would be an argument for the latter approach.  One of
> the neat possibilities that this opens up is overloading of portions
> of the XML, which would simplify the handling of locales.  It would
> also allow macro overloading, so that dependent styles don't have to
> copy all of the code from the master version in the sources.  If
> JQuery is read only against the XML, pre-compiling would make sense.
> It would probably also run faster.  So I guess that's the next thing
> to look at.
>
> It shouldn't be too difficult, actually.  We can fashion a copy of the
> CSL in ordinary JS objects that offer the sole E4X method that is used
> for the walk (node.children()).  Then a second wrapper (same as the
> first, for E4X) would be able to walk that tree once for each overload
> operation, replacing the relevant function.  After that, we're right
> back to the simple model, with everything broken out into individual
> functions for each attribute and element.

I've finally settled on a design, after several rewrites and cul-de-
sacs. The style will be represented in memory as flat list of bare JS
objects, with no nesting. Each object will carry a set of functions
to perform the steps appropriate at that stage of processing. There
will be no jiggery-pokery over locale terms or macros; these will be
exploded into the tag list by the compiler, so that execution is in a
straight line, start to finish. All of the awkward conditional
decision tree navigation stuff will be sited in the compiler; at
runtime, and the only string conditionals executed will be the
specific comparisons needed to evaluate attributes against item
content. The new design should speed things up a little (that's
probably putting it mildly).

I'm including execution wrappers for the DOM navigation commands, to
make it easy to port to non-E4X systems.

Frank

Frank Bennett

unread,
Feb 24, 2009, 10:51:01 PM2/24/09
to zotero-dev
On Feb 13, 8:46 am, Frank Bennett <biercena...@gmail.com> wrote:
> On Feb 9, 10:49 am, "Bruce D'Arcus" <bdar...@gmail.com> wrote:
>
> > On Sun, Feb 8, 2009 at 8:18 PM, Frank Bennett <biercena...@gmail.com> wrote:
>
> > .....
>
> > > Rhino 1.7r1 was able to load and parse a CSL file successfully with
> > > xml = new XML( ... );  this is all very new to me ... would that be
> > > E4X at work?
>
> > Yeah; no such support in regular JS (though libraries like JQuery can
> > provide similar kinds of convenience).
>
> The slowly emerging code for a reimplementation of the CSL processor
> is now up in SVN, with a hat tip to Bruce for access permission at
> xbiblio:
>
> SVN:  svn cohttps://xbiblio.svn.sourceforge.net/svnroot/xbiblio/citeproc-js
> citeproc-js
> Browse:http://xbiblio.svn.sourceforge.net/viewvc/xbiblio/citeproc-js/
>
> It's still in a rudimentary state, but starting to take shape.
> @Simon, if you can spare a moment, now might be a good time to take a
> quick look and see if I'm shooting myself in the foot in any obvious
> way.
>
> Now that the fundamentals of the output engine are done, I've been
> thinking about how to handle the CSL file, and I've come to the same
> conclusion as Bruce, that this can be done in native Javascript.  The
> parsing is hugely simplified by the fact that CSL files contain no
> text nodes; ...

For posterity, I'll just note here that this was one of several things
I was wrong about when I started in on this. Locale terms are text
nodes, so whatever parser is used needs to be able to recognize and
handle them... which goes to show ... well ... what Robert said.

Frank

> ... if there are factory function primitives for each element,
Reply all
Reply to author
Forward
0 new messages