hello

John Gant

unread,

Nov 30, 2006, 5:41:11 PM11/30/06

to osjava

After talking to Hen recently, he seemed surprised that I still use
ScrapingEngine. I do, and would like to contribute to its future
development. If you'd like me to contribute please reply with
instructions or start a discussion.

Henri Yandell

unread,

Dec 4, 2006, 1:50:22 AM12/4/06

to osj...@googlegroups.com

I know Alan Canon was also emailing with thoughts on improvements.
Maybe you'd like to repeat them on here Alan, now that we've a working
list etc?

My main aims with my osjava time are to get the SVN, JIRA, site and
dist rehosted by the end of Jan.

Either to Robert's machine if he has time, or possibly looking at a
Google or SourceForge project - but neither of those seem very geared
towards a community instead of a project. I did hear that SourceForge
do allow svn imports of dumps, so that would be a plus.

Hen

Alan Canon

unread,

Dec 5, 2006, 10:20:09 AM12/5/06

to osj...@googlegroups.com, David Petersheim

Henri asked about my experiences with and ideas regarding osjava
improvements.

1. My first suggestion is that, now that dbutils-1.1 is out, can we
change the dependency to dbutils-1.1? That was my motivation in pushing
for the release of dbutils-1.1.

2. I have a number of patches to AbstractHttpFetcher. The basic thing
I'd like to see is factoring out some of the details of the main routine
into protected methods so that they can be overridden...and with that in
place, I have some suggested revisions for those methods. I'll try to
comment on them soon, when I get time. It's been a while since I wrote
this code.

3. Right now I'm thinking about the scrape-for-urls-then-scrape-the-urls
functionality that's now supported with Page.fetch(), which can only
easily be called before the Store takes control. But what if I want to
scrape a list of reports, go all the way to the Store to put them into a
database, then scrape only the indicated resources that aren't already
represented in the database? I think it would be nice for the Store to
be able to get a handle on the engine, and simple-schedule the secondary
scrapes on the fly. Possible?

Alan

Alan Canon

unread,

Dec 5, 2006, 4:00:06 PM12/5/06

to osj...@googlegroups.com

Another thing I'd like to be able to do is have the scraper load its
configuration from a URL instead of from a directory of property files.
That way I could have a web application which produces the required
configuration on the fly (from a database of scraper configurations.)

Alan B. Canon
Senior Java Developer
Genscape, Inc.
(502) 292-5334

Henri Yandell

unread,

Dec 5, 2006, 6:51:06 PM12/5/06

to osj...@googlegroups.com

Simple-JNDI used to support URLs, but it got dropped in the rewrite
because rather than reacting to user input it preloads. So it went
from not having to really know about how to parse the server, to
having to parse the server. It's not that hard to write such a thing,
but it would be quite server specific.

I think at the moment it would mean writing a replacement for
http://svn.osjava.org/cgi-bin/viewcvs.cgi/trunk/simple-jndi/src/java/org/osjava/sj/loader/JndiLoader.java?rev=2221&view=markup

Ideally by creating an abstract parent that enables the general
algorithm to be handled and subclasses that worry about the resource
specific stuff (file/db/url).

Alternatively - should Scraping-Engine be looking to move to Spring or
HiveMind and away from Oscube? Or, as Oscube is really only using
Simple-JNDI through gj-config, we could move to using Commons
Configuration which has many more options. I've been using it a bit
recently and it seems to happily replace gj-config [no great shock,
commons things tend to replace the gj things eventually].

Hen

Alan Canon

unread,

Dec 11, 2006, 2:15:13 PM12/11/06

to osj...@googlegroups.com

Here's the biggest change I've made to scraping-engine, a patched
version of AbstractHttpFetcher. It factors some of the features of the
old one into separate methods so that these can be overridden. It adds
support for caching the HTTP status code, status text, and all of the
HTTP response fields on the session, so that these can be stored if
desired. It adds POST support for excluding certain HTTP request
parameters from the rewrite which under ordinary circumstances are
subtracted from the URL, and added to the method body, for situations
where one must post to a process that expects some parameters on the
URL, and other parameters in the request body. It allows a configurable
set of content types to be parsed: for example, the older version would
not accept RTF (Rich Text Format) documents.

I also switched over to support HttpClient 3.0.1, specifically in how
timeouts are set.

I know Henri likes to see patches, but this is enough of a refactor that
I thought that sending the whole source was warranted.

Alan B. Canon
Senior Java Developer
Genscape, Inc.
(502) 292-5334
-----Original Message-----
From: osj...@googlegroups.com [mailto:osj...@googlegroups.com] On Behalf
Of Henri Yandell
Sent: Monday, December 04, 2006 1:50 AM
To: osj...@googlegroups.com
Subject: Re: hello

AbstractHttpFetcher.java

Alan Canon

unread,

Dec 11, 2006, 2:24:50 PM12/11/06

to osj...@googlegroups.com

I know we have page.fetch() but is it possible to get the osjava
framework to do the equivalent of an "at" schedule, to get a complete
new fetch-parse-store chain of scrapers going in response to conditions
that are only apparent at the "Store" phase?

The use case is the following: I scrape a page that has a list of
reports. I parse the reports and store them using a JdbcStore subclass
into a database table of available reports. If the report has been
retrieved before, a uniquness constraint results in a SQLException. If
the report has not been retrieved before, there is no SQLException, and,
then, I want to trigger a whole new fetch-parse-store cycle with the URL
of the new report.

I am thinking about something like the "at" job scheduling command in
UNIX: is there a way to provide the engine, on the fly, with a new
configuration that it's expected to run immediately?

flam...@gmail.com

unread,

Jan 22, 2007, 1:10:12 AM1/22/07

to osjava

Alan Canon wrote:
> Henri asked about my experiences with and ideas regarding osjava
> improvements.
>
> 1. My first suggestion is that, now that dbutils-1.1 is out, can we
> change the dependency to dbutils-1.1? That was my motivation in pushing
> for the release of dbutils-1.1.

Done.

I've been pondering oscube, the framework for want of a better word
that scraping engine sits on top of. I've been learning more about
Quartz and it has more in the way of framework than I knew it had, a
lot of oscube could vanish and be replaced by what Quartz already has.

Spring is another option. Or... Spring + Quartz.

> 3. Right now I'm thinking about the scrape-for-urls-then-scrape-the-urls
> functionality that's now supported with Page.fetch(), which can only
> easily be called before the Store takes control. But what if I want to
> scrape a list of reports, go all the way to the Store to put them into a
> database, then scrape only the indicated resources that aren't already
> represented in the database? I think it would be nice for the Store to
> be able to get a handle on the engine, and simple-schedule the secondary
> scrapes on the fly. Possible?

I played a fair bit with a CheckScraper concept (see the checking/
subpackage). I had a scraper that needed to look in the database to see
if it was in there before scraping. It never felt right. I don't know
how much of that overlaps with your use case.

Hen

flam...@gmail.com

unread,

Jan 22, 2007, 1:12:11 AM1/22/07

to osjava

Alan Canon wrote:
> I know we have page.fetch() but is it possible to get the osjava
> framework to do the equivalent of an "at" schedule, to get a complete
> new fetch-parse-store chain of scrapers going in response to conditions
> that are only apparent at the "Store" phase?
>
> The use case is the following: I scrape a page that has a list of
> reports. I parse the reports and store them using a JdbcStore subclass
> into a database table of available reports. If the report has been
> retrieved before, a uniquness constraint results in a SQLException. If
> the report has not been retrieved before, there is no SQLException, and,
> then, I want to trigger a whole new fetch-parse-store cycle with the URL
> of the new report.
>
> I am thinking about something like the "at" job scheduling command in
> UNIX: is there a way to provide the engine, on the fly, with a new
> configuration that it's expected to run immediately?

Quartz has this - but I suspect that oscube has hidden away that
functionality; so currently there's nothing. If we dumped oscube and
went to a Spring/Quartz approach it might still feel pretty natural and
yet also give us access to all the power.

Hen

Reply all

Reply to author

Forward