standarizing translators, bookmarklets, etc.

13 views
Skip to first unread message

Bruce D'Arcus

unread,
Jun 19, 2010, 12:11:14 PM6/19/10
to zoter...@googlegroups.com
Prompted by recent discussions around a new-ish project at Sakai [1],
and by this comment ...

<http://feedback.mendeley.com/forums/4941-mendeley-feedback/suggestions/217527-work-with-citeulike-and-zotero-on-citation-import->

... as well as my continued belief that writing translators is too damned hard.

Is there anything that can be done to make it much easier to:

1) share translators among different projects, and so avoiding
duplication of effort

2) author translators (lower barrier to entry; the barrier is too high for me)

3) add Zotero-like import capabilities to other clients (other
browsers, but also mobile devices)

...?

Bruce

[1] <http://confluence.sakaiproject.org/pages/viewpage.action;jsessionid=112E65AEBB675337BB78185216BE457C?pageId=65865751>

Bruce D'Arcus

unread,
Jun 20, 2010, 11:34:19 AM6/20/10
to zoter...@googlegroups.com
On Sat, Jun 19, 2010 at 12:11 PM, Bruce D'Arcus <bda...@gmail.com> wrote:

> Is there anything that can be done to make it much easier to ...

I guess to get a little more concrete, can people with translator
experience point me to three or four exemplars of common types of
translators? E.g. maybe a simple one that can just be coded with
simple CSS selectors to more complex ones that require regex
manipulation and such?

Am still wondering if it's possible to do most of this with some sort
of simple definition that ties into some generic functions.

Bruce

Bruce D'Arcus

unread,
Jun 20, 2010, 11:45:56 AM6/20/10
to zoter...@googlegroups.com

And to cite the previous discussion on this (which never came to anything):

<http://groups.google.com/group/zotero-dev/browse_thread/thread/6da7483217c2b178>

I really want to make this stuff sharable across projects, and easier
to write and debug.

Bruce

Frank Bennett

unread,
Jun 20, 2010, 5:10:42 PM6/20/10
to zotero-dev
On Jun 21, 12:45 am, "Bruce D'Arcus" <bdar...@gmail.com> wrote:
> On Sun, Jun 20, 2010 at 11:34 AM, Bruce D'Arcus <bdar...@gmail.com> wrote:
> > On Sat, Jun 19, 2010 at 12:11 PM, Bruce D'Arcus <bdar...@gmail.com> wrote:
>
> >> Is there anything that can be done to make it much easier to ...
>
> > I guess to get a little more concrete, can people with translator
> > experience point me to three or four exemplars of common types of
> > translators? E.g. maybe a simple one that can just be coded with
> > simple CSS selectors to more complex ones that require regex
> > manipulation and such?
>
> > Am still wondering if it's possible to do most of this with some sort
> > of simple definition that ties into some generic functions.
>
> And to cite the previous discussion on this (which never came to anything):
>
> <http://groups.google.com/group/zotero-dev/browse_thread/thread/6da748...>
>
> I really want to make this stuff sharable across projects, and easier
> to write and debug.

It sounds like what you're after is an encapsulated cross-platform
scraper engine, written in Javascript, that can run the Zotero
translators in other environments?

Re substantial simplification, most of the jiggery-pokery in screen-
scraping translators does serve a purpose. In moving all of that
logic to a higher level of abstraction, there would be a risk, at
least, of ending up with a mix of syntactic sugar that is as
complicated as the original. A shorter path to making things more
accessible to potential translator authors might be to work up better
documentation on the existing utility functions.

It's a more modest step, but there's an additional discussion thread
here, with some comments by Dan about plans for new utility functions
with cleaner syntax:

http://forums.zotero.org/discussion/12086/meta-request-easier-tools-to-write-translators-with/#Comment_58744

Frank



>
> Bruce

Bruce D'Arcus

unread,
Jun 20, 2010, 5:33:46 PM6/20/10
to zoter...@googlegroups.com
On Sun, Jun 20, 2010 at 5:10 PM, Frank Bennett <bierc...@gmail.com> wrote:
> On Jun 21, 12:45 am, "Bruce D'Arcus" <bdar...@gmail.com> wrote:
>> On Sun, Jun 20, 2010 at 11:34 AM, Bruce D'Arcus <bdar...@gmail.com> wrote:
>> > On Sat, Jun 19, 2010 at 12:11 PM, Bruce D'Arcus <bdar...@gmail.com> wrote:
>>
>> >> Is there anything that can be done to make it much easier to ...
>>
>> > I guess to get a little more concrete, can people with translator
>> > experience point me to three or four exemplars of common types of
>> > translators? E.g. maybe a simple one that can just be coded with
>> > simple CSS selectors to more complex ones that require regex
>> > manipulation and such?
>>
>> > Am still wondering if it's possible to do most of this with some sort
>> > of simple definition that ties into some generic functions.
>>
>> And to cite the previous discussion on this (which never came to anything):
>>
>> <http://groups.google.com/group/zotero-dev/browse_thread/thread/6da748...>
>>
>> I really want to make this stuff sharable across projects, and easier
>> to write and debug.
>
> It sounds like what you're after is an encapsulated cross-platform
> scraper engine, written in Javascript, that can run the Zotero
> translators in other environments?

Actually, ideally, not written in any programming language at all. But
if absolutely necessary, then at least requiring substantially less
code. The times when I've needed a translator, the scraping is
conceptually very simple. I could have written them in XSLT very
quickly.

And on the "other environments"; yes, including on servers.

> Re substantial simplification, most of the jiggery-pokery in screen-
> scraping translators does serve a purpose.  In moving all of that
> logic to a higher level of abstraction, there would be a risk, at
> least, of ending up with a mix of syntactic sugar that is as
> complicated as the original.  A shorter path to making things more
> accessible to potential translator authors might be to work up better
> documentation on the existing utility functions.
>
> It's a more modest step, but there's an additional discussion thread
> here, with some comments by Dan about plans for new utility functions
> with cleaner syntax:
>
> http://forums.zotero.org/discussion/12086/meta-request-easier-tools-to-write-translators-with/#Comment_58744

Ah, I never saw that thread.

Bruce

skornblith

unread,
Jun 20, 2010, 6:57:51 PM6/20/10
to zotero-dev
The types of translators are pretty closely related to the types of
pages that databases provide. Off the top of my head, there are three
major ways that sites provide search results:

1) Search results where one can check a bunch of check boxes, click a
button, and download a file to get all the references (e.g., Voyager
Library Catalog)
2) Search results where individual pages for each result have to be
parsed before continuing (e.g., Google Books)
3) Search results that include direct links to reference data, or
where reference data locations can be inferred from the URL (e.g.,
NCBI PubMed, Google Scholar after some cookie manipulation)

Then there are three major ways that one can parse information from
individual pages:

1) Scrape information directly from page content or meta tags (e.g.,
NYTimes.com)
2) Direct link to reference data (e.g., Nature)
3) Link to a link to reference data, or various other levels of
indirection (e.g., EBSCOhost)

I'm skeptical that this can really be done efficiently without any
programming code at all. It should possible to simplify some things
with better utility functions, but a large proportion of major sites
require some non-trivial manipulations.

Simon

On Jun 20, 8:34 am, "Bruce D'Arcus" <bdar...@gmail.com> wrote:

Bruce D'Arcus

unread,
Jun 20, 2010, 7:27:27 PM6/20/10
to zoter...@googlegroups.com
Thanks for the overview analysis Simon. So ...

On Sun, Jun 20, 2010 at 6:57 PM, skornblith <si...@simonster.com> wrote:
> The types of translators are pretty closely related to the types of
> pages that databases provide. Off the top of my head, there are three
> major ways that sites provide search results:
>
> 1) Search results where one can check a bunch of check boxes, click a
> button, and download a file to get all the references (e.g., Voyager
> Library Catalog)
> 2) Search results where individual pages for each result have to be
> parsed before continuing (e.g., Google Books)
> 3) Search results that include direct links to reference data, or
> where reference data locations can be inferred from the URL (e.g.,
> NCBI PubMed, Google Scholar after some cookie manipulation)
>
> Then there are three major ways that one can parse information from
> individual pages:
>
> 1) Scrape information directly from page content or meta tags (e.g.,
> NYTimes.com)
> 2) Direct link to reference data (e.g., Nature)
> 3) Link to a link to reference data, or various other levels of
> indirection (e.g., EBSCOhost)

In all cases that I've come across where I thought I'd like to write a
translator, it's been of the second group, type 1. E.g. I'm reading a
page (at the Christian Science Monitor, or NPR, or the ACLU) and I
need to add the metadata. And as I've said, these are typically
trivial from just a conceptual perspective.

To me, the ability to pull in multiple item metadata from search
results is another order of complexity, and not even strictly
necessary for what I'm talking about.

> I'm skeptical that this can really be done efficiently without any
> programming code at all. It should possible to simplify some things
> with better utility functions, but a large proportion of major sites
> require some non-trivial manipulations.

So I guess given all this I'm wondering if there's some low-hanging
fruit that could be greatly simplified?

Bruce

Sean Takats

unread,
Jun 21, 2010, 12:21:19 AM6/21/10
to zoter...@googlegroups.com
I just want to add to this discussion that there are also decidedly non-technical issues at stake, namely licensing and copyright. It's entirely possible that other applications and services are using Zotero translators in whole or in part, or are deriving their extraction logic from translator code, but there's presently very little way for us to know, since the status of translator code may be considered ambiguous. In other words, one way to ensure that translators emerge as a viable standard would be to motivate all parties to contribute to the code. -Sean

> --
> You received this message because you are subscribed to the Google Groups "zotero-dev" group.
> To post to this group, send email to zoter...@googlegroups.com.
> To unsubscribe from this group, send email to zotero-dev+...@googlegroups.com.
> For more options, visit this group at http://groups.google.com/group/zotero-dev?hl=en.
>

Bruce D'Arcus

unread,
Jun 21, 2010, 7:17:46 PM6/21/10
to zoter...@googlegroups.com
On Sun, Jun 20, 2010 at 7:27 PM, Bruce D'Arcus <bda...@gmail.com> wrote:

> In all cases that I've come across where I thought I'd like to write a
> translator, it's been of the second group, type 1. E.g. I'm reading a
> page (at the Christian Science Monitor, or NPR, or the ACLU) and I
> need to add the metadata. And as I've said, these are typically
> trivial from just a conceptual perspective.

To put a really fine point on this ...

I'm just now reading an article at the Atlantic. I realize Zotero
isn't detecting it, so I figure there's no translator.

Time for me to figure out how to parse the following ...

title = div.articleHead > h1
author = div.articleHead > h5.author > a.author
date = div.articleHead > h4.issueDetails > a.issueTitle [this one
needs more parsing though]

... = two minutes.

Bruce

skornblith

unread,
Jun 21, 2010, 9:43:26 PM6/21/10
to zotero-dev
Well, you could do this (which is entirely untested, and took me about
2 minutes):

function detectWeb(doc) {
return "newspaperArticle";
}

function doWeb(doc) {
var item = new Zotero.Item("newspaperArticle");
item.title = doc.evaluate('//div[@class="articleHead"]/h1', doc,
null, XPathResult.ANY_TYPE, null).iterateNext().textContent;
var author = doc.evaluate('//div[@class="articleHead"]/
h5[@class="author"]/a[@class="author"]', doc, null,
XPathResult.ANY_TYPE, null)).iterateNext().textContent;
item.creators.push(Zotero.Utilities.cleanAuthor(author,
"author"));
item.date = doc.evaluate('//div[@class="articleHead"]/
h4[@class="issueDetails"]/a[@class="issueTitle"]', doc, null,
XPathResult.ANY_TYPE, null).iterateNext().textContent;
[...]
item.complete();
}

which is a little repetitive because we don't have XPath helper
functions (although they would just be syntactic sugar, and they are
on the drawing board), but I don't think it's really that difficult.
Then again, most translators are significantly more complicated than
this because they also scrape search results.

Simon

On Jun 21, 4:17 pm, "Bruce D'Arcus" <bdar...@gmail.com> wrote:

Bruce D'Arcus

unread,
Jun 21, 2010, 11:13:24 PM6/21/10
to zoter...@googlegroups.com
On Mon, Jun 21, 2010 at 9:43 PM, skornblith <si...@simonster.com> wrote:
> Well, you could do this (which is entirely untested, and took me about
> 2 minutes):
>
> function detectWeb(doc) {
>    return "newspaperArticle";
> }
>
> function doWeb(doc) {
>    var item = new Zotero.Item("newspaperArticle");
>    item.title = doc.evaluate('//div[@class="articleHead"]/h1', doc,
> null, XPathResult.ANY_TYPE, null).iterateNext().textContent;
>    var author = doc.evaluate('//div[@class="articleHead"]/
> h5[@class="author"]/a[@class="author"]', doc, null,
> XPathResult.ANY_TYPE, null)).iterateNext().textContent;
>    item.creators.push(Zotero.Utilities.cleanAuthor(author,
> "author"));
>    item.date = doc.evaluate('//div[@class="articleHead"]/
> h4[@class="issueDetails"]/a[@class="issueTitle"]', doc, null,
> XPathResult.ANY_TYPE, null).iterateNext().textContent;
>    [...]
>    item.complete();
> }
>
> which is a little repetitive because we don't have XPath helper
> functions (although they would just be syntactic sugar, and they are
> on the drawing board), but I don't think it's really that difficult.

Maybe it's because I started programming in XSLT/XPATH, but I find the
above code heinously ugly/complex/confusing. Why can't it be so simple
as:

{
'title': '//div[@class="articleHead"]/h4[@class="issueDetails"]/a[@class="issueTitle"]'
}

E.g. just feed the parser a map of xpath (or similarly
simple-but-powerful) expressions?

Yes, I know things immediately get more complicated when you need to
parse the result, so some syntactic sugar on a helper function might
also be more appealing. Of course, xpath has functions as well that
work pretty nicely.

> Then again, most translators are significantly more complicated than
> this because they also scrape search results.

"Most"? Does this include "most" of the yet-to-do translators as well
(examples like I mentioned: NPR, the Atlantic, etc.)?

For users, how important is the functionality compared to the single
page stuff? For me the answer is not much at all (I believe I've used
multiple import once in the past few years).

Is there a way the simpler stuff I'm looking for could complement the
more hairy search results stuff?

Bruce

skornblith

unread,
Jun 22, 2010, 5:36:44 AM6/22/10
to zotero-dev
We can certainly write a helper function to take input that looks like
that—I'll look into it. I think scraping multiple authors could be a
sticky issue, though, and I'd welcome input on that.

> > Then again, most translators are significantly more complicated than
> > this because they also scrape search results.
>
> "Most"? Does this include "most" of the yet-to-do translators as well
> (examples like I mentioned: NPR, the Atlantic, etc.)?
>
> For users, how important is the functionality compared to the single
> page stuff? For me the answer is not much at all (I believe I've used
> multiple import once in the past few years).
>
> Is there a way the simpler stuff I'm looking for could complement the
> more hairy search results stuff?

This would make the Zotero UI inconsistent, which I'm not convinced we
really want to do. If you ask me, we should either drop the ability to
scrape search results (in consultation with the Zotero community), or
it should be there in every translator. Then again, I don't find
writing translators to scrape search results particularly difficult.
Perhaps this is another case where better helper functions could be
useful? We already have one (Zotero.Utilities.getItemArray()), but it
isn't particularly used (about 10% of all translators), perhaps
because there are too many cases it can't handle.

Simon

Bruce D'Arcus

unread,
Jun 22, 2010, 9:12:54 AM6/22/10
to zoter...@googlegroups.com
On Tue, Jun 22, 2010 at 5:36 AM, skornblith <si...@simonster.com> wrote:

...

> We can certainly write a helper function to take input that looks like
> that—I'll look into it. I think scraping multiple authors could be a
> sticky issue, though, and I'd welcome input on that.

Do we have any data on the sorts of contributor list cases we can expect?

I would guess there are some cases where each contributor is wrapped
in its own tag, such that could specific it with xpath of css
selectors, but that the more common case is free text like "John Doe
and Jane Smith"?

Let me stick to xpath to avoid reinventing the wheel and talk about
something that works now.

'//div[@class='author']' pulls in all such elements anywhere in the
tree, and so works for the first case I note.

Something like 'tokenize(' and ', */*/div[@class='author'])' would
work for a simple example of the second case.

Alternately, you could write a function for this such that you could
just do something like:

splitNames(*/*/div[@class='author'])'

E.g. we know certain variables are always simple strings, but that
others are lists. That helps.

We further know that xpath (at least) can represent both. That helps some more.

We just need a way to say "do this magic on this string".

At least that's my idea. Does it make sense?

Bruce

Bruce D'Arcus

unread,
Jun 22, 2010, 12:45:14 PM6/22/10
to zoter...@googlegroups.com
To avoid confusion (because I sometimes get dyslexic when typing
fast!), this ...

On Tue, Jun 22, 2010 at 9:12 AM, Bruce D'Arcus <bda...@gmail.com> wrote:

> I would guess there are some cases where each contributor is wrapped
> in its own tag, such that could specific it with xpath of css

> selectors ...

.. should be:

I would guess there are some cases where each contributor is wrapped

in its own tag, such that one could specify it with xpath or css
selectors ...

Bruce

Erik Hetzner

unread,
Jun 23, 2010, 12:25:15 AM6/23/10
to zoter...@googlegroups.com
At Mon, 21 Jun 2010 23:13:24 -0400,

Bruce D'Arcus wrote:
> Maybe it's because I started programming in XSLT/XPATH, but I find the
> above code heinously ugly/complex/confusing. Why can't it be so simple
> as:
>
> {
> 'title': '//div[@class="articleHead"]/h4[@class="issueDetails"]/a[@class="issueTitle"]'
> }
>
> E.g. just feed the parser a map of xpath (or similarly
> simple-but-powerful) expressions?
>
> Yes, I know things immediately get more complicated when you need to
> parse the result, so some syntactic sugar on a helper function might
> also be more appealing. Of course, xpath has functions as well that
> work pretty nicely.

Hi,

I think Bruce is right here. There is a lot of boilerplate code in
translators, from what I have seen. To quote from “How to write a
zotero translator” [1]:

The following section of code works for 99% of websites; therefore
you can use it as a template.

As a first step of the kind of thing that could be possible, I have
made a translator framework, which works for Digital Humanities
Quarterly and the Atlantic Monthly (though I have not done the
multiple results for the Atlantic, & it only works on the articles,
not the blog). The site specific code looks like this:

function mkScraper(type) {
if (type == "magazineArticle") {
return new Scraper ({
title : new Xpath('//head/meta[@name="title"]/@content').remove(/- Magazine - The Atlantic/).trim().first(),
itemType : 'magazineArticle',
publicationTitle : "The Atlantic Monthly",
date : new Xpath('//div[@class="articleHead"]/h4[@class="issueDetails"]/a[@class="issueTitle"]').match(/([\/A-Za-z\/]+ [0-9]+)/).first(),
creators : new Xpath('//div[@class="articleHead"]/h5[@class="author"]/a[@class="author"]').cleanAuthor("author")
});
}
}

function detectWeb(doc, url) {
return "magazineArticle";
}

and for the more complicated DHQ:

function mkScraper (itemType) {
if (itemType == "journalArticle") {
return new Scraper ({
title : new Xpath('//h1[@class="articleTitle"]').first(),
itemType : 'journalArticle',
publicationTitle : "Digital Humanities Quarterly",
date : new Xpath("//div[@id=\"pubInfo\"]").match(/(.*)Volume\s+\d+\s+Number\s+\d+/).first(),
volume : new Xpath("//div[@id=\"pubInfo\"]").match(/.*Volume\s+(\d+)\s+Number\s+\d+/).first(),
issue : new Xpath("//div[@id=\"pubInfo\"]").match(/.*Volume\s+\d+\s+Number\s+(\d+)/).first(),
creators : new Xpath('//div[@class="author"]/a[1]').cleanAuthor("author"),
attachments : function (doc, url) { return [{ url: url, title:"DHQ Snapshot", mimeType:"text/html" }]; }
});
} else if (itemType == "multiple") {
return new MultiScraper({
itemTrans : mkScraper("journalArticle"),
items : new Xpath('//div[@id="mainContent"]/div/p/a').raw()
});
}
}

function detectWeb(doc, url) {
if (new Xpath('//div[@class="DHQarticle"]').evaluate(doc)) {
return "journalArticle";
} else if (new Xpath('//div[@id="mainContent"]/div/p').evaluate(doc)) {
return "multiple";
} else {
return undefined;
}
}

Most of the magic is in the expressions that look like:

new Xpath('//head/meta[@name="title"]/@content').remove(/- Magazine - The Atlantic/).trim().first()

This is evaluated later to do the following:

1. Select an xpath, map it into an array.
2. Remove instances of the regex in each string in the array
3. Call Zotero.Utilities.trim() on each member of the array.
4. Get the first element of the array, since there should only be one.

I think that a system like this would be easier both for experienced
authors of translators, to avoid boilerplate code, and for new authors
of translators, who would need to learn less (I would hope).

This is only a first stab at an idea. If others like it, I can put it
up somewhere. Something like this could be cut & pasted into every
translator file if Zotero proper does not want to do something along
these lines.

best, Erik

1. http://niche-canada.org/member-projects/zotero-guide/chapter15.html

Erik Hetzner

unread,
Jun 23, 2010, 12:27:36 AM6/23/10
to zoter...@googlegroups.com
At Tue, 22 Jun 2010 21:25:15 -0700,
Erik Hetzner wrote:
>
> […]

Whoops, attached are the translators.

best, Erik

Atlantic.js
Digital Humanities Quarterly.js

Bruce D'Arcus

unread,
Jun 23, 2010, 9:19:18 AM6/23/10
to zoter...@googlegroups.com
On Wed, Jun 23, 2010 at 12:25 AM, Erik Hetzner <ehet...@gmail.com> wrote:

> ... The site specific code looks like this:

Definitely closer to what I was looking for. Death to boilerplate!

> This is only a first stab at an idea. If others like it, I can put it
> up somewhere. Something like this could be cut & pasted into every
> translator file if Zotero proper does not want to do something along
> these lines.

I'd say definitely "put it somewhere."

Bryan, how does this fit with any ideas you had?

Two other questions:

1) so what about my somewhat provocative claim that parsing search
results is not an important feature, and so adds unnecessary
complexity?

2) what is the output of these translators? Is it some Zotero-specific
object, or generic JSON that Zotero ingests? I ask this in part
because we've got a few different JSON bib representations, and some
impetus to bring them together (certainly on the CSL side, but likely
more broadly).

Bruce

trevor.j...@gmail.com

unread,
Jun 23, 2010, 10:11:27 AM6/23/10
to zotero-dev
I am not a frequent commenter on the dev list but I thought I could
add some info on some of the value that translating search results
serves.

1. Aggregation sites, like Google Scholar, really only have search
result pages. Dropping support for search results would effectively
mean no longer supporting sites like this.

2. Search results are critical for Zotero use cases that involve
grabbing in bulk. There are a range of use cases where Zotero is a
valuable tool for pulling together relatively large sets of material.
In my own work I have used Zotero to pull together a set of 300
newspaper articles from Proquest historical newspapers, and in another
case 400 images from flickr. I have spoken to social and natural
scientists who are engaged in similar bulk projects, in their case
with articles in Google Scholar and other big databases of articles.
The frequency of "Google Scholar locked me out after grabbing 100
articles in 2 min" posts that show up in the Zotero forums suggests
that this is a relatively common practice among users.






On Jun 23, 9:19 am, "Bruce D'Arcus" <bdar...@gmail.com> wrote:

Erik Hetzner

unread,
Jun 23, 2010, 11:47:08 AM6/23/10
to zoter...@googlegroups.com
At Wed, 23 Jun 2010 09:19:18 -0400,

Bruce D'Arcus wrote:
> I'd say definitely "put it somewhere."

OK, it is here:

http://e6h.org/~egh/hg/zotero-transfw

> Bryan, how does this fit with any ideas you had?
>
> Two other questions:
>
> 1) so what about my somewhat provocative claim that parsing search
> results is not an important feature, and so adds unnecessary
> complexity?

As I understand it, search results is parsing multiple items per page,
& presenting a dialog? My version of the Digital Humanities Quarterly
translator does this. I am currently looking at the Google Scholar
translator to see how it could fit in - it will certainly require some
changes. From what I have seen I don’t see search results as being
particularly more complicated than single results.

For what it’s worth, I use the Google Scholar feature a lot.

It seems to me that there are a number of reasonably simple
translators that could have boilerplate removed, and a few complex
ones that cannot (MARC, etc.) It would be nice to make the simpler
ones simpler.

> 2) what is the output of these translators? Is it some Zotero-specific
> object, or generic JSON that Zotero ingests? I ask this in part
> because we've got a few different JSON bib representations, and some
> impetus to bring them together (certainly on the CSL side, but likely
> more broadly).

That could be possible. There is no generic output, but different
functions could be used to build the Zotero item, or a generic JSON
representation, from the same Scraper object that is created by the
author.

best, Erik

Duncan Johnson

unread,
Jun 23, 2010, 11:29:56 AM6/23/10
to zotero-dev
Like Trevor, I'm a lurker around here, but I'd like to chime in. I
concur with Trevor that there really are good reasons to keep parsing
search results.

I use Zotero mostly to import search results from my own library
catalog as well as resources like JSTOR and EBSCO. This is especially
handy when I'm starting a research project and want to dump a number
of records into my Zotero store before I go to pull the items from the
shelf or download the PDFs for reading.

The amount of added work involved if I had to import those records one-
by-one would render Zotero practically unusable for me.

I'm just guessing, but we probably could write some nice helper
functions to eliminate the boilerplate code for translator authors,
but I don't want to lose search results parsing along the way.

The Mozilla Ubiquity project did a great job at generalizing a lot of
boilerplate JavaScript so that useful scripts could be written in
under 10 lines of code (although making truly elegant ones still took
more than that). If you guys want to look through that project for
ideas the codebase can be found here: https://wiki.mozilla.org/Labs/Ubiquity

On Jun 23, 10:11 am, "trevor.johnow...@gmail.com"

skornblith

unread,
Jun 23, 2010, 11:51:06 PM6/23/10
to zotero-dev
This looks great, and I would love to integrate it into Zotero. I
think I would prefer it if the framework could be called from the
doWeb() function, although it's not absolutely essential.

Simon

skornblith

unread,
Jun 23, 2010, 11:58:40 PM6/23/10
to zotero-dev
On Jun 23, 6:19 am, "Bruce D'Arcus" <bdar...@gmail.com> wrote:
The Zotero.Item object that translators use (not to be confused with
the similarly named Zotero.Item object in the extension) is almost
JSON. There is an additional complete() method that tells Zotero to
save the item, but beyond that, the data representation is only tied
to Zotero through the names of the fields.

As this relates to using Zotero translators outside of Zotero, it
would be fairly easy to pipe the output from a web translator into an
export translator to get translator output in any format that Zotero
can export (including BIBO RDF).

Simon

Erik Hetzner

unread,
Jun 24, 2010, 2:50:44 AM6/24/10
to zoter...@googlegroups.com
At Wed, 23 Jun 2010 20:51:06 -0700 (PDT),

skornblith wrote:
> This looks great, and I would love to integrate it into Zotero. I
> think I would prefer it if the framework could be called from the
> doWeb() function, although it's not absolutely essential.

Hi Simon,

Thanks. That change seems sensible; all we need is:

function doWeb(doc, url) { fwDoWeb(doc, url); }

for the simple case.

I have made some more changes to try to get Google Scholar to work.
All except attachments is currently working for my google scholar
version. It is a little stranger than a simple scraper, but I think it
makes sense.q

I am sure that if I look at more translators holes in the abstractions
will appear. I have however added as an example a San Francisco
Chronicle scraper, which was very easy to add with the framework.

best, Erik

PS: Code is available here:

http://e6h.org/~egh/hg/zotero-transfw

Bruce D'Arcus

unread,
Jun 24, 2010, 10:57:20 AM6/24/10
to zoter...@googlegroups.com
On Wed, Jun 23, 2010 at 11:51 PM, skornblith <si...@simonster.com> wrote:

> This looks great, and I would love to integrate it into Zotero. I
> think I would prefer it if the framework could be called from the
> doWeb() function, although it's not absolutely essential.

Can I just start from the beginning and ask: what does "doWeb" mean,
and is it really specific/descriptive enough for utmost clarity? I
mean, the documentation page says:

"Once Zotero displays an icon in the browser's address bar, it is
ready to run the piece of code that will create a new Zotero item and
populate its fields with metadata. This function is called doWeb, and
it is usually significantly more complicated than detectWeb."

This kinds of leaves me wondering why we don't call it "parseContent"
or something, and describe it (to tie to my previous questions about
sharing this infrastructure) as:

"Once Zotero displays an icon in the browser's address bar, it is
ready to run the piece of code that will create a JSON object that can
be loaded into Zotero. This function is called parseContent, and it is
usually significantly more complicated than detectWeb."

Bruce

Bruce D'Arcus

unread,
Jun 24, 2010, 11:10:27 AM6/24/10
to zoter...@googlegroups.com
On Wed, Jun 23, 2010 at 11:29 AM, Duncan Johnson <dtk.j...@gmail.com> wrote:

> Like Trevor, I'm a lurker around here, but I'd like to chime in. I
> concur with Trevor that there really are good reasons to keep parsing
> search results.
>
> I use Zotero mostly to import search results from my own library
> catalog as well as resources like JSTOR and EBSCO. This is especially
> handy when I'm starting a research project and want to dump a number
> of records into my Zotero store before I go to pull the items from the
> shelf or download the PDFs for reading.
>
> The amount of added work involved if I had to import those records one-
> by-one would render Zotero practically unusable for me.
>
> I'm just guessing, but we probably could write some nice helper
> functions to eliminate the boilerplate code for translator authors,
> but I don't want to lose search results parsing along the way.

That's fine. My primary argument is that the needs for the complex
case should not hamstring ease-of-use of the simple case.

Bruce

skornblith

unread,
Jun 24, 2010, 7:03:34 PM6/24/10
to zotero-dev
I agree that doWeb() isn't necessarily the most ideal name, but
there's also a doImport() and a doSearch(), which can coexist with
doWeb() in the same translator and are also involved in parsing
content, and a doExport(), which is arguably less involved in parsing
content. Perhaps translateWeb()? mainWeb()? I should just be able to
do a find/replace on the existing translators, so I don't think it
would be problematic to change the name.

Simon

Benjamin M. Miller

unread,
Jul 2, 2010, 5:59:53 PM7/2/10
to zoter...@googlegroups.com
Erik,

I've been playing around with the framework, and I want to verify my impression of how it works: 
As far as I can tell, FW.Scraper always works with a single item-type at a time; for databases with more than one item-type, you just call FW.Scraper several times, each time with unique "detect" criteria. So basically, you never have to write an if {} statement: the if-then behavior of detectWeb is all built into the generic code. 

Is that right? 

Ben


Erik Hetzner

unread,
Jul 2, 2010, 7:10:26 PM7/2/10
to zoter...@googlegroups.com
At Fri, 2 Jul 2010 17:59:53 -0400,

Hi Ben,

Yes, that is correct. The detect is used for both detectWeb and doWeb
so that the right icon is displayed & the right scraper is used.

More exactly: If the detect criteria evaluates to a non-empty array,
that scraper is used for the page. If there is one scraper without a
detect, it is always used. If there are multiple scrapers whose
detects evaluate to non-empty or which have no detect, the behavior is
undefined.

Thanks for asking about this. I have added this information to the
README file.

best, Erik

Benjamin M. Miller

unread,
Jul 3, 2010, 1:10:42 AM7/3/10
to zoter...@googlegroups.com
Cool, thanks.

I've spent a couple of hours playing around with ProQuest Digital Dissertations, and I think I've discovered two hiccups in the framework code. First, the cleanAuthor function doesn't account for useComma - just adding an extra parameter into the definition seems to work fine:

this.cleanAuthor = function(type, useComma) {
        return this.addFilter(function(s) { 
               return Zotero.Utilities.cleanAuthor(s, type, useComma); 
        });
};

Second, when FW._Scraper creates fields as a new Array, the fields it creates assume that we're looking at a journal article. What's the best way to make this more contingent on what item type we're actually detecting?

-Ben

Erik Hetzner

unread,
Jul 3, 2010, 3:02:37 AM7/3/10
to zoter...@googlegroups.com
At Sat, 3 Jul 2010 01:10:42 -0400,

Benjamin M. Miller wrote:
>
> Cool, thanks.
>
> I've spent a couple of hours playing around with ProQuest Digital
> Dissertations, and I think I've discovered two hiccups in the framework
> code. First, the cleanAuthor function doesn't account for useComma - just
> adding an extra parameter into the definition seems to work fine:
>
> this.cleanAuthor = function(type, useComma) {
> return this.addFilter(function(s) {
> return Zotero.Utilities.cleanAuthor(s, type, useComma);
> });
> };

Hi Ben,

Thanks for the tip. I have fixed this.

> Second, when FW._Scraper creates fields as a new Array, the fields it
> creates assume that we're looking at a journal article. What's the best way
> to make this more contingent on what item type we're actually detecting?

Sorry, I had not bothered to put together the list of all item fields.
This list is used to enumerate all the possible fields. If a scraper
does not use one, it is ignored. So it should not be necessary to
initialize it depending on the item type.

I have added all the fields I could find to the list, so they should
all work now.

Thanks for the feedback!

best, Erik

Frank Bennett

unread,
Jul 3, 2010, 3:32:01 AM7/3/10
to zotero-dev
Erik,

Awhile back I put together a per-item-type listing of the mappings
between Zotero and CSL types and fields, covering everything but the
Creator vars. If you think the index would be useful to translator
authors, feel free to link to it in your docs:

http://gsl-nagoya-u.net/http/pub/csl-fields/

It's based on the mappings in current Zotero 2.0.

Frank


>
> Thanks for the feedback!
>
> best, Erik
>
>  application_pgp-signature_part
> < 1KViewDownload
Reply all
Reply to author
Forward
0 new messages