... as well as my continued belief that writing translators is too damned hard.
Is there anything that can be done to make it much easier to:
1) share translators among different projects, and so avoiding
duplication of effort
2) author translators (lower barrier to entry; the barrier is too high for me)
3) add Zotero-like import capabilities to other clients (other
browsers, but also mobile devices)
...?
Bruce
> Is there anything that can be done to make it much easier to ...
I guess to get a little more concrete, can people with translator
experience point me to three or four exemplars of common types of
translators? E.g. maybe a simple one that can just be coded with
simple CSS selectors to more complex ones that require regex
manipulation and such?
Am still wondering if it's possible to do most of this with some sort
of simple definition that ties into some generic functions.
Bruce
And to cite the previous discussion on this (which never came to anything):
<http://groups.google.com/group/zotero-dev/browse_thread/thread/6da7483217c2b178>
I really want to make this stuff sharable across projects, and easier
to write and debug.
Bruce
Actually, ideally, not written in any programming language at all. But
if absolutely necessary, then at least requiring substantially less
code. The times when I've needed a translator, the scraping is
conceptually very simple. I could have written them in XSLT very
quickly.
And on the "other environments"; yes, including on servers.
> Re substantial simplification, most of the jiggery-pokery in screen-
> scraping translators does serve a purpose. In moving all of that
> logic to a higher level of abstraction, there would be a risk, at
> least, of ending up with a mix of syntactic sugar that is as
> complicated as the original. A shorter path to making things more
> accessible to potential translator authors might be to work up better
> documentation on the existing utility functions.
>
> It's a more modest step, but there's an additional discussion thread
> here, with some comments by Dan about plans for new utility functions
> with cleaner syntax:
>
> http://forums.zotero.org/discussion/12086/meta-request-easier-tools-to-write-translators-with/#Comment_58744
Ah, I never saw that thread.
Bruce
On Sun, Jun 20, 2010 at 6:57 PM, skornblith <si...@simonster.com> wrote:
> The types of translators are pretty closely related to the types of
> pages that databases provide. Off the top of my head, there are three
> major ways that sites provide search results:
>
> 1) Search results where one can check a bunch of check boxes, click a
> button, and download a file to get all the references (e.g., Voyager
> Library Catalog)
> 2) Search results where individual pages for each result have to be
> parsed before continuing (e.g., Google Books)
> 3) Search results that include direct links to reference data, or
> where reference data locations can be inferred from the URL (e.g.,
> NCBI PubMed, Google Scholar after some cookie manipulation)
>
> Then there are three major ways that one can parse information from
> individual pages:
>
> 1) Scrape information directly from page content or meta tags (e.g.,
> NYTimes.com)
> 2) Direct link to reference data (e.g., Nature)
> 3) Link to a link to reference data, or various other levels of
> indirection (e.g., EBSCOhost)
In all cases that I've come across where I thought I'd like to write a
translator, it's been of the second group, type 1. E.g. I'm reading a
page (at the Christian Science Monitor, or NPR, or the ACLU) and I
need to add the metadata. And as I've said, these are typically
trivial from just a conceptual perspective.
To me, the ability to pull in multiple item metadata from search
results is another order of complexity, and not even strictly
necessary for what I'm talking about.
> I'm skeptical that this can really be done efficiently without any
> programming code at all. It should possible to simplify some things
> with better utility functions, but a large proportion of major sites
> require some non-trivial manipulations.
So I guess given all this I'm wondering if there's some low-hanging
fruit that could be greatly simplified?
Bruce
> --
> You received this message because you are subscribed to the Google Groups "zotero-dev" group.
> To post to this group, send email to zoter...@googlegroups.com.
> To unsubscribe from this group, send email to zotero-dev+...@googlegroups.com.
> For more options, visit this group at http://groups.google.com/group/zotero-dev?hl=en.
>
> In all cases that I've come across where I thought I'd like to write a
> translator, it's been of the second group, type 1. E.g. I'm reading a
> page (at the Christian Science Monitor, or NPR, or the ACLU) and I
> need to add the metadata. And as I've said, these are typically
> trivial from just a conceptual perspective.
To put a really fine point on this ...
I'm just now reading an article at the Atlantic. I realize Zotero
isn't detecting it, so I figure there's no translator.
Time for me to figure out how to parse the following ...
title = div.articleHead > h1
author = div.articleHead > h5.author > a.author
date = div.articleHead > h4.issueDetails > a.issueTitle [this one
needs more parsing though]
... = two minutes.
Bruce
Maybe it's because I started programming in XSLT/XPATH, but I find the
above code heinously ugly/complex/confusing. Why can't it be so simple
as:
{
'title': '//div[@class="articleHead"]/h4[@class="issueDetails"]/a[@class="issueTitle"]'
}
E.g. just feed the parser a map of xpath (or similarly
simple-but-powerful) expressions?
Yes, I know things immediately get more complicated when you need to
parse the result, so some syntactic sugar on a helper function might
also be more appealing. Of course, xpath has functions as well that
work pretty nicely.
> Then again, most translators are significantly more complicated than
> this because they also scrape search results.
"Most"? Does this include "most" of the yet-to-do translators as well
(examples like I mentioned: NPR, the Atlantic, etc.)?
For users, how important is the functionality compared to the single
page stuff? For me the answer is not much at all (I believe I've used
multiple import once in the past few years).
Is there a way the simpler stuff I'm looking for could complement the
more hairy search results stuff?
Bruce
...
> We can certainly write a helper function to take input that looks like
> that—I'll look into it. I think scraping multiple authors could be a
> sticky issue, though, and I'd welcome input on that.
Do we have any data on the sorts of contributor list cases we can expect?
I would guess there are some cases where each contributor is wrapped
in its own tag, such that could specific it with xpath of css
selectors, but that the more common case is free text like "John Doe
and Jane Smith"?
Let me stick to xpath to avoid reinventing the wheel and talk about
something that works now.
'//div[@class='author']' pulls in all such elements anywhere in the
tree, and so works for the first case I note.
Something like 'tokenize(' and ', */*/div[@class='author'])' would
work for a simple example of the second case.
Alternately, you could write a function for this such that you could
just do something like:
splitNames(*/*/div[@class='author'])'
E.g. we know certain variables are always simple strings, but that
others are lists. That helps.
We further know that xpath (at least) can represent both. That helps some more.
We just need a way to say "do this magic on this string".
At least that's my idea. Does it make sense?
Bruce
On Tue, Jun 22, 2010 at 9:12 AM, Bruce D'Arcus <bda...@gmail.com> wrote:
> I would guess there are some cases where each contributor is wrapped
> in its own tag, such that could specific it with xpath of css
> selectors ...
.. should be:
I would guess there are some cases where each contributor is wrapped
in its own tag, such that one could specify it with xpath or css
selectors ...
Bruce
Hi,
I think Bruce is right here. There is a lot of boilerplate code in
translators, from what I have seen. To quote from “How to write a
zotero translator” [1]:
The following section of code works for 99% of websites; therefore
you can use it as a template.
As a first step of the kind of thing that could be possible, I have
made a translator framework, which works for Digital Humanities
Quarterly and the Atlantic Monthly (though I have not done the
multiple results for the Atlantic, & it only works on the articles,
not the blog). The site specific code looks like this:
function mkScraper(type) {
if (type == "magazineArticle") {
return new Scraper ({
title : new Xpath('//head/meta[@name="title"]/@content').remove(/- Magazine - The Atlantic/).trim().first(),
itemType : 'magazineArticle',
publicationTitle : "The Atlantic Monthly",
date : new Xpath('//div[@class="articleHead"]/h4[@class="issueDetails"]/a[@class="issueTitle"]').match(/([\/A-Za-z\/]+ [0-9]+)/).first(),
creators : new Xpath('//div[@class="articleHead"]/h5[@class="author"]/a[@class="author"]').cleanAuthor("author")
});
}
}
function detectWeb(doc, url) {
return "magazineArticle";
}
and for the more complicated DHQ:
function mkScraper (itemType) {
if (itemType == "journalArticle") {
return new Scraper ({
title : new Xpath('//h1[@class="articleTitle"]').first(),
itemType : 'journalArticle',
publicationTitle : "Digital Humanities Quarterly",
date : new Xpath("//div[@id=\"pubInfo\"]").match(/(.*)Volume\s+\d+\s+Number\s+\d+/).first(),
volume : new Xpath("//div[@id=\"pubInfo\"]").match(/.*Volume\s+(\d+)\s+Number\s+\d+/).first(),
issue : new Xpath("//div[@id=\"pubInfo\"]").match(/.*Volume\s+\d+\s+Number\s+(\d+)/).first(),
creators : new Xpath('//div[@class="author"]/a[1]').cleanAuthor("author"),
attachments : function (doc, url) { return [{ url: url, title:"DHQ Snapshot", mimeType:"text/html" }]; }
});
} else if (itemType == "multiple") {
return new MultiScraper({
itemTrans : mkScraper("journalArticle"),
items : new Xpath('//div[@id="mainContent"]/div/p/a').raw()
});
}
}
function detectWeb(doc, url) {
if (new Xpath('//div[@class="DHQarticle"]').evaluate(doc)) {
return "journalArticle";
} else if (new Xpath('//div[@id="mainContent"]/div/p').evaluate(doc)) {
return "multiple";
} else {
return undefined;
}
}
Most of the magic is in the expressions that look like:
new Xpath('//head/meta[@name="title"]/@content').remove(/- Magazine - The Atlantic/).trim().first()
This is evaluated later to do the following:
1. Select an xpath, map it into an array.
2. Remove instances of the regex in each string in the array
3. Call Zotero.Utilities.trim() on each member of the array.
4. Get the first element of the array, since there should only be one.
I think that a system like this would be easier both for experienced
authors of translators, to avoid boilerplate code, and for new authors
of translators, who would need to learn less (I would hope).
This is only a first stab at an idea. If others like it, I can put it
up somewhere. Something like this could be cut & pasted into every
translator file if Zotero proper does not want to do something along
these lines.
best, Erik
1. http://niche-canada.org/member-projects/zotero-guide/chapter15.html
> ... The site specific code looks like this:
Definitely closer to what I was looking for. Death to boilerplate!
> This is only a first stab at an idea. If others like it, I can put it
> up somewhere. Something like this could be cut & pasted into every
> translator file if Zotero proper does not want to do something along
> these lines.
I'd say definitely "put it somewhere."
Bryan, how does this fit with any ideas you had?
Two other questions:
1) so what about my somewhat provocative claim that parsing search
results is not an important feature, and so adds unnecessary
complexity?
2) what is the output of these translators? Is it some Zotero-specific
object, or generic JSON that Zotero ingests? I ask this in part
because we've got a few different JSON bib representations, and some
impetus to bring them together (certainly on the CSL side, but likely
more broadly).
Bruce
OK, it is here:
http://e6h.org/~egh/hg/zotero-transfw
> Bryan, how does this fit with any ideas you had?
>
> Two other questions:
>
> 1) so what about my somewhat provocative claim that parsing search
> results is not an important feature, and so adds unnecessary
> complexity?
As I understand it, search results is parsing multiple items per page,
& presenting a dialog? My version of the Digital Humanities Quarterly
translator does this. I am currently looking at the Google Scholar
translator to see how it could fit in - it will certainly require some
changes. From what I have seen I don’t see search results as being
particularly more complicated than single results.
For what it’s worth, I use the Google Scholar feature a lot.
It seems to me that there are a number of reasonably simple
translators that could have boilerplate removed, and a few complex
ones that cannot (MARC, etc.) It would be nice to make the simpler
ones simpler.
> 2) what is the output of these translators? Is it some Zotero-specific
> object, or generic JSON that Zotero ingests? I ask this in part
> because we've got a few different JSON bib representations, and some
> impetus to bring them together (certainly on the CSL side, but likely
> more broadly).
That could be possible. There is no generic output, but different
functions could be used to build the Zotero item, or a generic JSON
representation, from the same Scraper object that is created by the
author.
best, Erik
Hi Simon,
Thanks. That change seems sensible; all we need is:
function doWeb(doc, url) { fwDoWeb(doc, url); }
for the simple case.
I have made some more changes to try to get Google Scholar to work.
All except attachments is currently working for my google scholar
version. It is a little stranger than a simple scraper, but I think it
makes sense.q
I am sure that if I look at more translators holes in the abstractions
will appear. I have however added as an example a San Francisco
Chronicle scraper, which was very easy to add with the framework.
best, Erik
PS: Code is available here:
> This looks great, and I would love to integrate it into Zotero. I
> think I would prefer it if the framework could be called from the
> doWeb() function, although it's not absolutely essential.
Can I just start from the beginning and ask: what does "doWeb" mean,
and is it really specific/descriptive enough for utmost clarity? I
mean, the documentation page says:
"Once Zotero displays an icon in the browser's address bar, it is
ready to run the piece of code that will create a new Zotero item and
populate its fields with metadata. This function is called doWeb, and
it is usually significantly more complicated than detectWeb."
This kinds of leaves me wondering why we don't call it "parseContent"
or something, and describe it (to tie to my previous questions about
sharing this infrastructure) as:
"Once Zotero displays an icon in the browser's address bar, it is
ready to run the piece of code that will create a JSON object that can
be loaded into Zotero. This function is called parseContent, and it is
usually significantly more complicated than detectWeb."
Bruce
> Like Trevor, I'm a lurker around here, but I'd like to chime in. I
> concur with Trevor that there really are good reasons to keep parsing
> search results.
>
> I use Zotero mostly to import search results from my own library
> catalog as well as resources like JSTOR and EBSCO. This is especially
> handy when I'm starting a research project and want to dump a number
> of records into my Zotero store before I go to pull the items from the
> shelf or download the PDFs for reading.
>
> The amount of added work involved if I had to import those records one-
> by-one would render Zotero practically unusable for me.
>
> I'm just guessing, but we probably could write some nice helper
> functions to eliminate the boilerplate code for translator authors,
> but I don't want to lose search results parsing along the way.
That's fine. My primary argument is that the needs for the complex
case should not hamstring ease-of-use of the simple case.
Bruce
Hi Ben,
Yes, that is correct. The detect is used for both detectWeb and doWeb
so that the right icon is displayed & the right scraper is used.
More exactly: If the detect criteria evaluates to a non-empty array,
that scraper is used for the page. If there is one scraper without a
detect, it is always used. If there are multiple scrapers whose
detects evaluate to non-empty or which have no detect, the behavior is
undefined.
Thanks for asking about this. I have added this information to the
README file.
best, Erik
Hi Ben,
Thanks for the tip. I have fixed this.
> Second, when FW._Scraper creates fields as a new Array, the fields it
> creates assume that we're looking at a journal article. What's the best way
> to make this more contingent on what item type we're actually detecting?
Sorry, I had not bothered to put together the list of all item fields.
This list is used to enumerate all the possible fields. If a scraper
does not use one, it is ignored. So it should not be necessary to
initialize it depending on the item type.
I have added all the fields I could find to the list, so they should
all work now.
Thanks for the feedback!
best, Erik