My main aims with my osjava time are to get the SVN, JIRA, site and
dist rehosted by the end of Jan.
Either to Robert's machine if he has time, or possibly looking at a
Google or SourceForge project - but neither of those seem very geared
towards a community instead of a project. I did hear that SourceForge
do allow svn imports of dumps, so that would be a plus.
Hen
1. My first suggestion is that, now that dbutils-1.1 is out, can we
change the dependency to dbutils-1.1? That was my motivation in pushing
for the release of dbutils-1.1.
2. I have a number of patches to AbstractHttpFetcher. The basic thing
I'd like to see is factoring out some of the details of the main routine
into protected methods so that they can be overridden...and with that in
place, I have some suggested revisions for those methods. I'll try to
comment on them soon, when I get time. It's been a while since I wrote
this code.
3. Right now I'm thinking about the scrape-for-urls-then-scrape-the-urls
functionality that's now supported with Page.fetch(), which can only
easily be called before the Store takes control. But what if I want to
scrape a list of reports, go all the way to the Store to put them into a
database, then scrape only the indicated resources that aren't already
represented in the database? I think it would be nice for the Store to
be able to get a handle on the engine, and simple-schedule the secondary
scrapes on the fly. Possible?
Alan
Alan B. Canon
Senior Java Developer
Genscape, Inc.
(502) 292-5334
I think at the moment it would mean writing a replacement for
http://svn.osjava.org/cgi-bin/viewcvs.cgi/trunk/simple-jndi/src/java/org/osjava/sj/loader/JndiLoader.java?rev=2221&view=markup
Ideally by creating an abstract parent that enables the general
algorithm to be handled and subclasses that worry about the resource
specific stuff (file/db/url).
Alternatively - should Scraping-Engine be looking to move to Spring or
HiveMind and away from Oscube? Or, as Oscube is really only using
Simple-JNDI through gj-config, we could move to using Commons
Configuration which has many more options. I've been using it a bit
recently and it seems to happily replace gj-config [no great shock,
commons things tend to replace the gj things eventually].
Hen
I also switched over to support HttpClient 3.0.1, specifically in how
timeouts are set.
I know Henri likes to see patches, but this is enough of a refactor that
I thought that sending the whole source was warranted.
Alan B. Canon
Senior Java Developer
Genscape, Inc.
(502) 292-5334
-----Original Message-----
From: osj...@googlegroups.com [mailto:osj...@googlegroups.com] On Behalf
Of Henri Yandell
Sent: Monday, December 04, 2006 1:50 AM
To: osj...@googlegroups.com
Subject: Re: hello
The use case is the following: I scrape a page that has a list of
reports. I parse the reports and store them using a JdbcStore subclass
into a database table of available reports. If the report has been
retrieved before, a uniquness constraint results in a SQLException. If
the report has not been retrieved before, there is no SQLException, and,
then, I want to trigger a whole new fetch-parse-store cycle with the URL
of the new report.
I am thinking about something like the "at" job scheduling command in
UNIX: is there a way to provide the engine, on the fly, with a new
configuration that it's expected to run immediately?
Done.
I've been pondering oscube, the framework for want of a better word
that scraping engine sits on top of. I've been learning more about
Quartz and it has more in the way of framework than I knew it had, a
lot of oscube could vanish and be replaced by what Quartz already has.
Spring is another option. Or... Spring + Quartz.
> 3. Right now I'm thinking about the scrape-for-urls-then-scrape-the-urls
> functionality that's now supported with Page.fetch(), which can only
> easily be called before the Store takes control. But what if I want to
> scrape a list of reports, go all the way to the Store to put them into a
> database, then scrape only the indicated resources that aren't already
> represented in the database? I think it would be nice for the Store to
> be able to get a handle on the engine, and simple-schedule the secondary
> scrapes on the fly. Possible?
I played a fair bit with a CheckScraper concept (see the checking/
subpackage). I had a scraper that needed to look in the database to see
if it was in there before scraping. It never felt right. I don't know
how much of that overlaps with your use case.
Hen
Quartz has this - but I suspect that oscube has hidden away that
functionality; so currently there's nothing. If we dumped oscube and
went to a Spring/Quartz approach it might still feel pretty natural and
yet also give us access to all the power.
Hen