Need Help with Translator

9 views
Skip to first unread message

maxximuscool

unread,
Jul 21, 2010, 6:04:50 PM7/21/10
to zotero-dev
Hello guys, I've edited a translator for NZherald.co.nz because it is
broken and I got it to work like it should now. But the only problem
is the regular Expression will not work with multiple authors at all.
I've tried everything and still no luck.

Below is the site where I would like to get multiple authors into
Zotero database.
http://www.nzherald.co.nz/nz/news/article.cfm?c_id=1&objectid=10651102

Below is the code:

function detectWeb(doc, url) {
if (doc.title.indexOf("Search Results") != -1) {
return "multiple";
} else if (doc.location.href.indexOf("/news") != -1) {
return "newspaperArticle";
}
}

function scrape(url) {
Zotero.Utilities.HTTP.doGet(url, function(text) {
var newItem = new Zotero.Item("newspaperArticle");
newItem.url = url;
newItem.publicationTitle = "New Zealand Herald";

//How to get first author? This grab the last author instead???
var aut = /<span class=\"credits\">[^<]*<a href=\".*\">(.*)\<\/a>/;

if (text.match(aut)) {
var author = text.match(aut)[1];
newItem.creators.push(Zotero.Utilities.cleanAuthor(author, "author"));
}

//Get second author
var aut2 = /<span class=\"credits\">[^<]*<a href=\".*\">(.*)\<\/a>/;

if (text.match(aut2)) {
var author2 = text.match(aut2)[1];
newItem.creators.push(Zotero.Utilities.cleanAuthor(author2,
"contributor"));
}


//abstract
var a = /meta name=\"description\" content=\"([^&]*)/;
newItem.abstractNote = text.match(a)[1];

//title and date
var t = /<title>(.*)<\/title>/;
var result = text.match(t)[1].split(" - ");
newItem.title = result[0];
newItem.section= result[1];

//keywords
var k = /<meta name=\"keywords\" content=\"(.*)\"/;
var kwords = Zotero.Utilities.cleanString(text.match(k)[1]).split(",
");
for (var i = 0 ; i < kwords.length ; i++) {
newItem.tags.push(kwords[i]);
}
//-------------------------DATE


var s = /<div class=\"tools\">[^<]*<span>(.*)<\/span>/;
newItem.date = text.match(s)[1];

/*
var s = /class=\"current\"><.*><span>(.*)<\/span>/;
newItem.date= text.match(s)[1];
*/
//--------------------
newItem.complete();
Zotero.debug(newItem);

Zotero.done();
}, function() {});
}

function doWeb(doc, url) {
var articles = new Array();
var names = new Array();
if (doc.title.indexOf("Search Results:") != -1) {
var URLS = new Array();
var titles = new Array();
var xpath = '//p[@class="g"]/a';
var links = doc.evaluate(xpath, doc, null, XPathResult.ANY_TYPE,
null);
var link = links.iterateNext();

while (link) {
URLS.push(link.href);
titles.push(link.textContent);
link = links.iterateNext();
}

Zotero.debug(titles);
Zotero.debug(URLS);

var newItems = new Object();

for (var i = 0 ; i < titles.length ; i++) {
newItems[URLS[i]] = titles[i];
}

newItems = Zotero.selectItems(newItems);

Zotero.debug(newItems);

for (var i in newItems) {
articles.push(i);
names.push(newItems[i]);
}
} else {
articles.push(doc.location.href);
names.push(Zotero.Utilities.cleanString(doc.title.split("-")[0]));
}

Zotero.debug(articles);

Zotero.Utilities.HTTP.doPost(articles, "", function(text) {
for (var i = 0 ; i < articles.length ; i++) {
scrape(articles[i]);
}
});

Zotero.wait();
}


Can anyone help me out?
Thank you.

Best regards
Maxx

skornblith

unread,
Jul 22, 2010, 4:54:33 PM7/22/10
to zotero-dev
Running a regexp on the raw HTML is probably not the best way of doing
this; unless you have to make cross-domain requests, it might be
easier to use XPaths.

If you are sure you want to play with regexps, your regular expression
probably may not work because (.*) is greedy, and so will matching
everything between <span class="credits"> and the last occurrence of </
a> in the entire document, rather than the next occurrence of </a>,
which is probably not what you want. Instead of (.*), try ([^<]*),
which will match until the next < character, or (.*?), which will
match as little as possible (but is slower). This also seems to be an
issue with a few other regexps here.

To get multiple matches, add "g" after the final delimiter for your
regexp, and use the exec method of the regexp object rather than the
match method of the text.

Try something like this (untested):

var aut = /<span class=\"credits\">[^<]*<a href=\".*\">([^<]*)\<\/a>/
g;

// find first match and assign as author
var m;
if(m = aut.exec(text)) {
newItem.creators.push(Zotero.Utilities.cleanAuthor(m[1], "author"));
}

// find subsequent matches and assign as contributors
while(m = aut.exec(text)) {
newItem.creators.push(Zotero.Utilities.cleanAuthor(m[1],
"contributor"));
}

On Jul 21, 3:04 pm, maxximuscool <maxximusc...@gmail.com> wrote:
> Hello guys, I've edited a translator for NZherald.co.nz because it is
> broken and I got it to work like it should now. But the only problem
> is the regular Expression will not work with multiple authors at all.
> I've tried everything and still no luck.
>
> Below is the site where I would like to get multiple authors into
> Zotero database.http://www.nzherald.co.nz/nz/news/article.cfm?c_id=1&objectid=10651102

maxximuscool

unread,
Jul 23, 2010, 6:37:03 PM7/23/10
to zotero-dev
Thank you for your help. I've tried Xpath method but some what I can't
seems to get it at. Can you show me how it is done wit Xpath? Sorry
I'm still learning to write translator at the moment.

Thank you very much.

Best regards,
Maxx

Tom Roche

unread,
Jul 27, 2010, 12:14:35 PM7/27/10
to zoter...@googlegroups.com

maxximuscool Fri, 23 Jul 2010 15:37:03 -0700 (PDT)

> Thank you for your help. I've tried Xpath method but some what I
> can't seems to get it at. Can you show me how it is done wit Xpath?

See links to the translator docs (such as they are) from

http://www.zotero.org/support/dev/creating_translators_for_sites

What would probably most help you:

http://niche-canada.org/member-projects/zotero-guide/chapter5.html
http://niche-canada.org/member-projects/zotero-guide/chapter11.html

Unfortunately that has suffered from lack of maintenance (not being a
wiki resource). If you want to work examples with uplevel tools (notably
Scaffold 2.0) try

http://www.zotero.org/support/dev/how_to_write_a_zotero_translator_plusplus#chapter_5xpath_directions
http://www.zotero.org/support/dev/how_to_write_a_zotero_translator_plusplus#chapter_11xpath_containers

noting that

http://www.zotero.org/support/dev/how_to_write_a_zotero_translator_plusplus#chapter_0introduction
> This page (aka HWZT++) updates and wikifies HWZT. For the moment, it
> is merely a list of deltas to HWZT, organized by HWZT chapter: for
> each HWZT chapter, you must read it, then read the delta(s) if any,
> then execute appropriately.

Also, both are sequential and somewhat cumulative, so it may help to
read from the beginning.

HTH, Tom Roche <Tom_...@pobox.com>

Reply all
Reply to author
Forward
0 new messages