Serializing and attaching DOM elements in translators (>help!<)

10 views
Skip to first unread message

Frank Bennett

unread,
Mar 25, 2010, 6:36:53 PM3/25/10
to zotero-dev
I have hit a show-stopping issue while building a translator, and I'm
wondering if it can be overcome.

What I'm wanting to do is refactor some content that is presented as a
single monolithic page into subelements that can be attached to a
Zotero item. I have implemented and tested the Javascript refactoring
machinery in a browser page, but there seem to be two obstacles to
using the code inside a translator:

(1) XMLSerializer is not available (unless I'm missing something); and
(2) The content of an attachment of type text/html cannot be supplied
from a JS string variable (unless I'm missing something).

I imagine there are probably very good security reasons for these
restrictions, but this is a very important use case for us, and I'd
very much like to explore some way forward.

The pages I'm working from might help illustrate what I'm trying to
do. The site index is here:

http://www.japaneselawtranslation.go.jp/law/search_nm/?re=02

Clicking on an alphabetical index button in that page gives a list of
statutes. Click on a statute link gives you the statute as a single
flat page, embedded in a tangle of frame wrappers. The site is very
slow, and it is possible but very cumbersome to point at fragments
(provisions) in a given law. The site doesn't do versioning, so when
a law is revised, you lose the old version, which is, well, bad.

This content is the product of a lab at Nagoya University, working
together with government agencies. The content is of high quality,
but as you can see, the clumsiness of the interface is an obstacle to
its use. With a Z translator, we can release the value of the work
behind this site. A statute can be split into individual provisions,
and the HTML can be refactored to give it a more usable interface,
like this:

http://gsl-nagoya-u.net/http/pub/PROVISION.html

Night and day, really. If provisions in this form are attached to a
Zotero item, we get provision-level granularity in searches,
annotation, a graceful bilingual interface, and version snapshots --
it opens up the whole project in a way that, for budgetary and other
reasons, is not going to happen on server side in the foreseeable
future.

So ... this is kind of a plea for help, I guess. Is there a
possibility of opening up XMLSerializer (or a method that uses it) in
the translator sandbox, and of attaching string content to an item?

If Zotero developers are open to the idea, I'd be happy to work up
some code for review. This is so, so close to being there.

Frank Bennett

Frank Bennett

unread,
Mar 25, 2010, 9:29:53 PM3/25/10
to zotero-dev
I've looked at XMLSerializer, and see that it's not a happy tool.
Mozilla project points out that it's not part of a standard, and I see
that other products use other methods with different behavior ... not
a good place to go.

The pages for this project seem to be close enough to well-formed
XHTML, and I've gotten a sample to parse in E4X, and to dump as a
string. I'll have to rework the page refactoring code, but that's no
big deal. With that item out of the way, the only remaining hurdle is
how to get a string variable into an attachment. Is there a method
for doing this that I've missed?

Frank

Dan Stillman

unread,
Mar 26, 2010, 4:20:40 AM3/26/10
to zoter...@googlegroups.com
On 3/25/10 9:29 PM, Frank Bennett wrote:
> The pages for this project seem to be close enough to well-formed
> XHTML, and I've gotten a sample to parse in E4X, and to dump as a
> string. I'll have to rework the page refactoring code, but that's no
> big deal. With that item out of the way, the only remaining hurdle is
> how to get a string variable into an attachment. Is there a method
> for doing this that I've missed?
>

No, there's no method currently for getting strings into attachments
from the translator architecture.

You can, however, pass a DOMDocument via the 'document' property of the
object added to item.attachments. I don't know if anyone has ever tried
this, but there's a chance you could use
doc.implementation.createDocument() and importNode() to create an
individual document from a particular node in the original document and
then pass that. You might run into some security errors, though.

Failing that, this is probably something we could support, and we'd of
course be happy to have you work on this. But it's worth giving the
above a shot before we discuss implementation details.

Frank Bennett

unread,
Mar 26, 2010, 9:58:31 AM3/26/10
to zoter...@googlegroups.com
On Fri, Mar 26, 2010 at 5:20 PM, Dan Stillman <dsti...@zotero.org> wrote:
> On 3/25/10 9:29 PM, Frank Bennett wrote:
>>
>> The pages for this project seem to be close enough to well-formed
>> XHTML, and I've gotten a sample to parse in E4X, and to dump as a
>> string.  I'll have to rework the page refactoring code, but that's no
>> big deal.  With that item out of the way, the only remaining hurdle is
>> how to get a string variable into an attachment.  Is there a method
>> for doing this that I've missed?
>>
>
> No, there's no method currently for getting strings into attachments from
> the translator architecture.
>
> You can, however, pass a DOMDocument via the 'document' property of the
> object added to item.attachments. I don't know if anyone has ever tried
> this, but there's a chance you could use doc.implementation.createDocument()
> and importNode() to create an individual document from a particular node in
> the original document and then pass that. You might run into some security
> errors, though.

Brilliant. No security errors, the document builds without a hitch.
I didn't realize how much horsepower was behind the document: method.
There is one sticking point remaining now, and I have one additional
question.

The sticking point is that in domsaver.js (at about lines 179 and
1060) and in attachments.js (at about line 500) there is code that
calls document.location.href. For document objects created in this
way, .location is null, so the attachment fails there with an error.
If the null value is trapped and, say,
'http://www.example.com/index.html' is passed instead, everything
works.

Not sure how that should be handled. Personally, I'm fine with a
bogus URL, but a pointer back to the original document would be more
meaningful, if a way can be found to pass it. Can't touch
document.location, though, even to overwrite it; any attempt to do
anything other than read its value triggers a call on QueryInterface()
(if I remember correctly), and failure for its .initialize component
being unavailable. If falling back to a bogus URL is acceptable, it's
a very simple fix, and I can put up a patch right away.

The additional question I had is whether literal JS code can be passed
into the DOM object as CDATA (including < and > characters). The
pages I want to build will make use of Javascript, and there's no way
for the translator to install a JS library file for them to link to,
so it looks like a /* <[CDATA[ */ code /* ]]> */ construct will be
necessary. It turns out that E4X clobbers the CDATA wrapper and
corrupts the content; if DOM leaves that stuff alone, that will be
very good news indeed.

Thanks for your help on this, Dan. When it comes together, this is
going to be a very well received by our students.

Frank

>
> Failing that, this is probably something we could support, and we'd of
> course be happy to have you work on this. But it's worth giving the above a
> shot before we discuss implementation details.
>

> --
> You received this message because you are subscribed to the Google Groups
> "zotero-dev" group.
> To post to this group, send email to zoter...@googlegroups.com.
> To unsubscribe from this group, send email to
> zotero-dev+...@googlegroups.com.
> For more options, visit this group at
> http://groups.google.com/group/zotero-dev?hl=en.
>
>

Dan Stillman

unread,
Mar 28, 2010, 6:51:10 PM3/28/10
to zoter...@googlegroups.com

Were you able to actually get it working? Even commenting out the
various checks on the DOMDocument, I wasn't able to create a functional
snapshot.

So rather than adding in lots of try/catch blocks to handle a fake
document�and worrying going forward about any code that touches the
document object�I'm now inclined to say that we should just add a
Zotero.Attachments.importFromString() function to create attachments and
expose that from the translator architecture through a property
('content'?) on the object added to the attachments array.

> The additional question I had is whether literal JS code can be passed
> into the DOM object as CDATA (including< and> characters). The
> pages I want to build will make use of Javascript, and there's no way
> for the translator to install a JS library file for them to link to,
> so it looks like a /*<[CDATA[ */ code /* ]]> */ construct will be
> necessary. It turns out that E4X clobbers the CDATA wrapper and
> corrupts the content; if DOM leaves that stuff alone, that will be
> very good news indeed.

Not sure about this one.

- Dan

Frank Bennett

unread,
Mar 28, 2010, 8:54:07 PM3/28/10
to zoter...@googlegroups.com

Yes, beautifully. Sorry for not posting sooner, I had a response in
draft, and then must have gotten distracted. I've tested the
translator against some fairly large statutes (300+ attachements), and
it runs like a top.

I'll backtrack a little now, and extract a patch of the small changes
I made to Zotero to get it working. I did a little further exploring
yesterday, and it seems that docs created with createDocument() are
indeed expected to have a null location. It comes in from the window,
apparently, and is not part of the document itself.

>
> So rather than adding in lots of try/catch blocks to handle a fake

> document—and worrying going forward about any code that touches the document
> object—I'm now inclined to say that we should just add a


> Zotero.Attachments.importFromString() function to create attachments and
> expose that from the translator architecture through a property ('content'?)
> on the object added to the attachments array.

Now that I'm familiar with the DOM, I kind of like the structural
discipline it imposes. There were several little things that I picked
up from studying the docs that I might have missed if I'd leapt in
with string output.

>
>> The additional question I had is whether literal JS code can be passed
>> into the DOM object as CDATA (including<  and>  characters).  The
>> pages I want to build will make use of Javascript, and there's no way
>> for the translator to install a JS library file for them to link to,
>> so it looks like a /*<[CDATA[ */ code /* ]]>  */ construct will be
>> necessary.  It turns out that E4X clobbers the CDATA wrapper and
>> corrupts the content; if DOM leaves that stuff alone, that will be
>> very good news indeed.
>
> Not sure about this one.

Turns out that createCDATAElement (or whatever) is only available on
XML objects, but we can use a straight createTextNode() inside of a
script node in XHTML, and that works just fine.

I'll post the translator to zotero-dev soon, after I do some code
cleanup and work out that patch. The output pages are mischievous and
pretty, I think you'll enjoy this one. :)

Frank


>
> - Dan

Frank Bennett

unread,
Mar 28, 2010, 11:05:15 PM3/28/10
to zotero-dev
On Mar 29, 9:54 am, Frank Bennett <biercena...@gmail.com> wrote:

> On Mon, Mar 29, 2010 at 7:51 AM, Dan Stillman <dstill...@zotero.org> wrote:
> > On 3/26/10 9:58 AM, Frank Bennett wrote:
>
> >> On Fri, Mar 26, 2010 at 5:20 PM, Dan Stillman<dstill...@zotero.org>

I've uploaded two files to zotero-dev:

http://groups.google.com/group/zotero-dev/web/Japanese%20Law%20Translation.js
http://groups.google.com/group/zotero-dev/web/TRANSLATOR_GENERATED_DOCUMENTS.patch

The patch steps past three points where documents are checked for
their URL, and gives them a bogus one. This looks okay to me, but see
what you think.

The translator contains a short note about what it does, and provides
the URL of a small-ish statute for use in testing. A lab in our Uni
produced the DTD behind the site content, and is involved in the
translation effort. As the translator refactors the pages, I've taken
the liberty of including a non-intrusive attribution button in the
pages so that users can apportion credit (or blame) for what they see
on screen.

If the patch is acceptable, it would be great to have this translator
in mainstream Zotero. Law schools and law firms here will get good
mileage out of this, I think, and it might help influence the attitude
toward open systems within the trade.

Frank

Frank Bennett

unread,
Mar 28, 2010, 11:20:39 PM3/28/10
to zotero-dev
> > >> 'http://www.example.com/index.html'ispassed instead, everything
> http://groups.google.com/group/zotero-dev/web/Japanese%20Law%20Transl...http://groups.google.com/group/zotero-dev/web/TRANSLATOR_GENERATED_DO...

>
> The patch steps past three points where documents are checked for
> their URL, and gives them a bogus one.  This looks okay to me, but see
> what you think.
>
> The translator contains a short note about what it does, and provides
> the URL of a small-ish statute for use in testing.  A lab in our Uni
> produced the DTD behind the site content, and is involved in the
> translation effort.  As the translator refactors the pages, I've taken
> the liberty of including a non-intrusive attribution button in the
> pages so that users can apportion credit (or blame) for what they see
> on screen.
>
> If the patch is acceptable, it would be great to have this translator
> in mainstream Zotero.  Law schools and law firms here will get good
> mileage out of this, I think, and it might help influence the attitude
> toward open systems within the trade.

Double-checked at the files were accessible on zotero-dev, and found
that Google Groups would not deliver the patch file for some reason.
Tried changing the filename in various ways, but it's now telling me
flat-out that the upload fails. Here it is as a github gist:

http://gist.github.com/347326

Frank

Dan Stillman

unread,
Mar 29, 2010, 2:43:02 AM3/29/10
to zoter...@googlegroups.com
On 3/28/10 8:54 PM, Frank Bennett wrote:
> On Mon, Mar 29, 2010 at 7:51 AM, Dan Stillman<dsti...@zotero.org> wrote:
>
>> So rather than adding in lots of try/catch blocks to handle a fake
>> document�and worrying going forward about any code that touches the document
>> object�I'm now inclined to say that we should just add a

>> Zotero.Attachments.importFromString() function to create attachments and
>> expose that from the translator architecture through a property ('content'?)
>> on the object added to the attachments array.
>>
> Now that I'm familiar with the DOM, I kind of like the structural
> discipline it imposes. There were several little things that I picked
> up from studying the docs that I might have missed if I'd leapt in
> with string output.
>

All right. But can you get it to work without the dummy URL? I'd like to
avoid that.

Frank Bennett

unread,
Mar 29, 2010, 2:54:00 AM3/29/10
to zoter...@googlegroups.com
On Mon, Mar 29, 2010 at 3:43 PM, Dan Stillman <dsti...@zotero.org> wrote:
> On 3/28/10 8:54 PM, Frank Bennett wrote:
>>
>> On Mon, Mar 29, 2010 at 7:51 AM, Dan Stillman<dsti...@zotero.org>
>>  wrote:
>>
>>>
>>> So rather than adding in lots of try/catch blocks to handle a fake
>>> document—and worrying going forward about any code that touches the
>>> document
>>> object—I'm now inclined to say that we should just add a

>>> Zotero.Attachments.importFromString() function to create attachments and
>>> expose that from the translator architecture through a property
>>> ('content'?)
>>> on the object added to the attachments array.
>>>
>>
>> Now that I'm familiar with the DOM, I kind of like the structural
>> discipline it imposes.  There were several little things that I picked
>> up from studying the docs that I might have missed if I'd leapt in
>> with string output.
>>
>
> All right. But can you get it to work without the dummy URL? I'd like to
> avoid that.

It's not pretty, I agree. I'll have a look.

Frank Bennett

unread,
Mar 29, 2010, 5:26:53 AM3/29/10
to zotero-dev
On Mar 29, 3:54 pm, Frank Bennett <biercena...@gmail.com> wrote:

> On Mon, Mar 29, 2010 at 3:43 PM, Dan Stillman <dstill...@zotero.org> wrote:
> > On 3/28/10 8:54 PM, Frank Bennett wrote:
>
> >> On Mon, Mar 29, 2010 at 7:51 AM, Dan Stillman<dstill...@zotero.org>

> >>  wrote:
>
> >>> So rather than adding in lots of try/catch blocks to handle a fake
> >>> document—and worrying going forward about any code that touches the
> >>> document
> >>> object—I'm now inclined to say that we should just add a
> >>> Zotero.Attachments.importFromString() function to create attachments and
> >>> expose that from the translator architecture through a property
> >>> ('content'?)
> >>> on the object added to the attachments array.
>
> >> Now that I'm familiar with the DOM, I kind of like the structural
> >> discipline it imposes.  There were several little things that I picked
> >> up from studying the docs that I might have missed if I'd leapt in
> >> with string output.
>
> > All right. But can you get it to work without the dummy URL? I'd like to
> > avoid that.
>
> It's not pretty, I agree.  I'll have a look.

This is a better patch:

http://gist.github.com/347637

By default, it sets the URL for documents that are missing
document.location to "about:blank", which seems the normal
expectation. To provide for internally generated documents, it
attempts to read the href attribute from a base element in the head,
and if that's not present, it just fails in the customary way. This
gives us a way to attached documents with no URL when we intend to do
so, without interfering with normal failures.

Anyway, see what you think. I've tested it, and setting the base
element to the URL of the target page produces a correct link back to
that document, as per normal.

Frank

Frank Bennett

unread,
Mar 31, 2010, 5:00:32 AM3/31/10
to zotero-dev

Dan,

Hope the patch passes muster, but if it's hard to work in at the
moment it's no
biggie; I can run with the patch locally, and serve as the main
provider of this
content for the time being. The translator generates a standard
Zotero item that
will work just fine with a normal 2.0 client.

Just in case the topic comes up, there's no copyright issue with this
one. Despite
the copyright notice on the JLT site, the laws and the translations
are supplied directly
by government, and there is a specific and explicit exception in the
Japanese Copyright
Act for this category of content. I've spoken with the leader of the
team at the center
of the production and publication process for the JLT site, and he was
happy to hear that
I was exploiting his underlying XML structures in a mashup.

In final testing, I've run into a couple of performance issues, which
I'll
post about separately.

Frank

Reply all
Reply to author
Forward
0 new messages