CSL processor news

Frank Bennett

unread,

Mar 3, 2009, 8:41:39 PM3/3/09

to zotero-dev

I've been doing some work on a new CSL processor (citeproc-js), and
have made some progress. The next academic term is approaching,
though, and I'll be battening down the work on citeproc-js over the
next few days. I'll be pretty much leaving the code alone until
sometime during the summer, but I don't claim ownership of it, and any
work by others on the project will be very welcome as far as I'm
concerned. (In fact, it's probably better for the long term if I'm
not the primary maintainer. I'm a hobbyist, my skill level is not
that high, and my ability to focus on programming issues varies with
the season.) Before downing tools, I'll go through the code to update
the comments and bring them into line with the state of the code. The
test suites don't show much organization, but I'll probably leave
those alone for the present.

It's been an exciting ride over the past month. There's a lot still
to do, but most of the seriously worrisome issues have been cleared.
Here are some of the highlights:

- The commented code is available online at:
http://gsl-nagoya-u.net/http/pub/citeproc-js-doc/index.html

- The sources are available at: http://xbiblio.svn.sourceforge.net/viewvc/xbiblio/citeproc-js/

- There are 93 tests in the test suite covering the work to date, all
of which pass.

- CSL.Build and CSL.Configure assemble a style object containing only
the functions needed to render the requested style (no more if/then/
else over the entire CSL language to determine the operation to
perform).

- CSL.Render produces output from the compiled style. End-to-end
input/output testing of added functionality is now possible.

- Locale loading and overloading from the CSL style both work
correctly.

- Conditional branching is implemented efficiently and works
correctly.

- Macros are resolved in CSL.Build and vanish, simplifying the rest of
the code base.

- The E4X XML parser has been encased in a wrapper, and the wrapper
has been tested against both E4X and a homebrew Javascript-only parser
bundled with the code. Support for other parsers can be added with
very little additional effort.

- The beginnings of a similarly modular system for retrieving Item
data is in place, which will permit the engine to accept item keys as
well as full Item objects as input.

- Persistent state awareness across a series of cites (i.e. within a
"citation") is available, and can be exploited to control
capitalization and splicing.

Here are some things that still need to be done:

- Standard end-to-end test fixtures for CSL styling are needed, and
machinery to digest and apply them needs to be built.

- A disambiguation/sorting registry needs to be completed and tested,
and integrated into CSL.Render, and again tested. Some framework code
for this is in tests/test_speed.js (the CSL.Registry code is wrong,
and should be redone from scratch, using the test_speed.js code as a
rough model).

- Separate opt/token areas, similar to state.citation and
state.bibliography, need to be established for producing sort keys and
for the disambiguation of entries according to parameters that still
need to be defined. These areas will be populated with tokens by
CSL.Build. For sort keys, this will be done with a build method to the
sort tag, and for disambiguation through a function invoked at the
close of /citation, after the various disambiguate attributes have
been collected from the style (the description is a little opaque, but
it should make sense when you look through the code). These two
special token renderers will be needed to get the registry going.

- Name formatting needs to be implemented. I'm guessing that 90% or
more of the remaining work to implement the engine lies here and in
disambiguation.

- Date formatting needs to be implemented, together with the localized
dates proposal recently added (i think) to CSL.

- Various options and special formatting attributes need to be
implemented.

- All of the additions above need to be rigorously tested, so that the
engine doesn't blow up in our faces when we turn on the switch (!).

I'll be following the list while I'm "away", so feel free to post
questions if you look at the code and something doesn't make sense.

Frank Bennett
Nagoya
2009-03-04

Frank Bennett

unread,

Mar 4, 2009, 4:56:28 AM3/4/09

to zotero-dev

On Mar 4, 10:41 am, Frank Bennett <biercena...@gmail.com> wrote:
> I've been doing some work on a new CSL processor (citeproc-js), and
> have made some progress. The next academic term is approaching,
> though, and I'll be battening down the work on citeproc-js over the
> next few days. I'll be pretty much leaving the code alone until
> sometime during the summer, but I don't claim ownership of it, and any
> work by others on the project will be very welcome as far as I'm
> concerned. (In fact, it's probably better for the long term if I'm
> not the primary maintainer. I'm a hobbyist, my skill level is not
> that high, and my ability to focus on programming issues varies with
> the season.) Before downing tools, I'll go through the code to update
> the comments and bring them into line with the state of the code. The
> test suites don't show much organization, but I'll probably leave
> those alone for the present.

As part of my tidying up, I've checked in a fun little demo of code
for possible use in a disambiguation/sorting registry. The demo
generates 1,000 random titles using characters from Latin-1 Extended-
A, and presents a sorted list in Rhino using the system locale. Apart
from testing the registry itself (an object with mixed characteristics
of a list and a hashed data store), it shows how well or poorly a
locale sort against arbitrary languages with romance-ish alphabets
works out. It's a little slower than the original demo, but still
very serviceable for once-per-session instantiation.

Enjoy!

Frank

Bruce D'Arcus

unread,

Mar 4, 2009, 12:42:40 PM3/4/09

to zoter...@googlegroups.com, development discussion for xbiblio

Hi Frank,

On Tue, Mar 3, 2009 at 8:41 PM, Frank Bennett <bierc...@gmail.com> wrote:

> I've been doing some work on a new CSL processor (citeproc-js), and
> have made some progress. The next academic term is approaching,
> though, and I'll be battening down the work on citeproc-js over the
> next few days. I'll be pretty much leaving the code alone until
> sometime during the summer,

Whose "summer"; the one down south, or up north? So June, or December?

Also, just a couple of more questions ...

> but I don't claim ownership of it, and any
> work by others on the project will be very welcome as far as I'm
> concerned. (In fact, it's probably better for the long term if I'm
> not the primary maintainer. I'm a hobbyist, my skill level is not
> that high, and my ability to focus on programming issues varies with
> the season.) Before downing tools, I'll go through the code to update
> the comments and bring them into line with the state of the code. The
> test suites don't show much organization, but I'll probably leave
> those alone for the present.
>
> It's been an exciting ride over the past month. There's a lot still
> to do, but most of the seriously worrisome issues have been cleared.

Can you estimate what percentage is complete?

> Here are some of the highlights:
>
> - The commented code is available online at:
> http://gsl-nagoya-u.net/http/pub/citeproc-js-doc/index.html
>
> - The sources are available at: http://xbiblio.svn.sourceforge.net/viewvc/xbiblio/citeproc-js/
>
> - There are 93 tests in the test suite covering the work to date, all
> of which pass.

With test-driven development, you write the tests first, then write
the code until they pass.

But given that all test pass but you've said there's still "a lot to
do" I take it that's not exactly the approach you've taken.

So would it be fair to say that the next step really ought to be to
sort out the remaining tests?

If yes, do you have some input on what they should be?

Or, if you can find some remaining time, can you imagine starting to
put those in place so that others can enter and figure out how to make
them pass?

BTW, I converted you TODO.pdf to a text file in the repo for editing purposes.

Bruce

Sean Takats

unread,

Mar 4, 2009, 1:34:28 PM3/4/09

to zoter...@googlegroups.com

Many, many thanks to Frank for getting the ball rolling on this
important task. I would note that his code includes incredibly
comprehensive comments which will greatly assist anyone who wishes to
join the effort. -Sean

Frank Bennett

unread,

Mar 4, 2009, 6:09:17 PM3/4/09

to zoter...@googlegroups.com, development discussion for xbiblio

On Thu, Mar 5, 2009 at 2:42 AM, Bruce D'Arcus <bda...@gmail.com> wrote:
>
> Hi Frank,
>
> On Tue, Mar 3, 2009 at 8:41 PM, Frank Bennett <bierc...@gmail.com> wrote:
>
>> I've been doing some work on a new CSL processor (citeproc-js), and
>> have made some progress. The next academic term is approaching,
>> though, and I'll be battening down the work on citeproc-js over the
>> next few days. I'll be pretty much leaving the code alone until
>> sometime during the summer,
>
> Whose "summer"; the one down south, or up north? So June, or December?

Why, my summer, of course! Things will quieten down here again in July/August.

> Also, just a couple of more questions ...
>
>> but I don't claim ownership of it, and any
>> work by others on the project will be very welcome as far as I'm
>> concerned. (In fact, it's probably better for the long term if I'm
>> not the primary maintainer. I'm a hobbyist, my skill level is not
>> that high, and my ability to focus on programming issues varies with
>> the season.) Before downing tools, I'll go through the code to update
>> the comments and bring them into line with the state of the code. The
>> test suites don't show much organization, but I'll probably leave
>> those alone for the present.
>>
>> It's been an exciting ride over the past month. There's a lot still
>> to do, but most of the seriously worrisome issues have been cleared.
>
> Can you estimate what percentage is complete?

I tend to be over-optimistic. But in terms of time, I'd say it's
maybe 40% done in the coding. It's about 1600 lines at the moment,
half the size of the csl.js in Zotero. I'd expect it to swell
significantly over the current implementation in total size, because
the definitions of individual attributes in citeproc-js are more
verbose at the compiler level (although the runtime it will generate
will be much more spartan).

>> Here are some of the highlights:
>>
>> - The commented code is available online at:
>> http://gsl-nagoya-u.net/http/pub/citeproc-js-doc/index.html
>>
>> - The sources are available at: http://xbiblio.svn.sourceforge.net/viewvc/xbiblio/citeproc-js/
>>
>> - There are 93 tests in the test suite covering the work to date, all
>> of which pass.
>
> With test-driven development, you write the tests first, then write
> the code until they pass.
>
> But given that all test pass but you've said there's still "a lot to
> do" I take it that's not exactly the approach you've taken.

Oh, darn, I messed up _again_! Sorry about that, I'll try to do
better in the future. :)

But seriously, I felt my way in a spiral, with bits of code, then
tests, then rewriting of the code to make it more readable. Some
parts of the code have been rewritten three or four times as new
issues came up. I've watched XP teams work, it's been a similar
process, except that I didn't have a programming partner and II was a
neophyte in the language when I started writing -- and I wrote a lot
more verbal commentary as I went along because I'm chatty by nature.
You gets what you pays for.

> So would it be fair to say that the next step really ought to be to
> sort out the remaining tests?
>
> If yes, do you have some input on what they should be?

Yep, absolutely. The only big piece of infrastructure still to be
built is the disambiguation/sorting registry. I can certainly provide
internal unit tests for that, if there's need.

For the CSL language, anyone building an engine would want to have at
least one test for each element, attribute and option, and test suites
for known hard cases (like et al., disambiguation, and sorting). It
would be great if you as the language designer could provide the
hard-case items, to be sure behaviour is defined as you intend.

> Or, if you can find some remaining time, can you imagine starting to
> put those in place so that others can enter and figure out how to make
> them pass?

Sure thing. Dividing the work between test-writing and coding is
ideal. There is a proposed generic test layout from Simon (with a
couple of tiny changes by me) at data/README-3.txt in the archive. If
the layout can be agreed and a file hierarchy set up somewhere, I'll
be happy to chip in.

> BTW, I converted you TODO.pdf to a text file in the repo for editing purposes.

Thanks, that was a hasty addition, to be sure I didn't lose the message.

> Bruce
>
> >
>

Reply all

Reply to author

Forward