Main aim of the new branch is to improve the way we handle srcids, the
unique ID we derive from article urls to determine if we've already got
them in the database or not.
The current system has a couple of problems:
1) it sometimes gets articles mixed up when multiple papers use the same
website (eg guardian and observer), as in those cases you can't always
determine which newspaper it is just from the URL.
2) we have no uniform way to map an arbitrary url to a srcid (currently
the srcids are just a byproduct of the individual scrapers, and there is
no interface to say "give me the srcid for this url"). This is really
important as we start trying to match up the articles we've got to other
references on the net - eg on digg, del.ico.us, technorati etc...
To address these, I'm going to rejig the scrapers a little.
If anyone wants messy details, let me know :-)
Ben.
The upshot is that it's now much easier to find an article in our
database from a given URL.
The plan is now to link up information from other sites (eg technorati,
digg, del.ico.us etc) to articles we've got - eg "which blogs link to
this article?", kind of thing.
A nice side effect is that the scrapers no longer have to figure out
which newspaper an article is from _before_ scraping it. This was
causing some mixups on sites which host multiple newspapers (eg the
Guardian site also hosts the Observer).
Ben.