stripping away ads and such before check

3 views
Skip to first unread message

gaute

unread,
Apr 16, 2009, 5:34:10 AM4/16/09
to Specto
Hi.
Just discovered Specto, looking for alternatives to the firefox
plugins "Update Scanner" and "SiteDelta" neither of which work
sensibly with the pages I am watching the most.

Example: http://www.finn.no/finn/bap/free/result?TYPE=2&SEGMENT=1

I've had a peek at the code, and noticed you trying to be clever with
page sizes, md5 sums and so on in
/usr/share/pyshared/spectlib/watch_web_static.py->update
However, I strongly suspect Specto too will give me lots of false
positives.

If I where to manipulate the page before checking for difference, say
with BeautifulSoup and soupselect, where would I best do that?
In update, in _writeContent?

Is there some other parsing engine included I could use, so as not to
introduce a new dependency?
One that supports css selectors or xpaths?
I guess regexps would be a bit exotic to expect most people to handle?
Or perhaps a simple "search only between start text and stop text"
would be better?
Harder usability wise, that tecnically this i guess..

What are your tougts?

Regards Gaute


Wout Clymans

unread,
Apr 16, 2009, 5:46:27 AM4/16/09
to spe...@googlegroups.com
Hello,

Manipulating the page is not supported, but you can set an "error margin"  in percentage to exclude adds,... as an update.
If you are using specto 0.3 and have debugging enabled, you will see the change percentage in the error log window.
This way you can look what is a good "error margin" value for the adds.

Best regards,
Wout

Jeff Fortin

unread,
Apr 16, 2009, 11:08:49 AM4/16/09
to spe...@googlegroups.com
...Though, technically, I'd be quite interested if there was a feasible way (if you want to implement it) to do that ad removal. But it may cause a lot of issues, I don't know; at least, as you mentionned, the problem of how the user is supposed to interact with this.

The best logical "workaround" I could find at the time (years ago) was to implement an "error margin" (which, ideally, should be possible to "auto-calibrate", see issue #36 in our issue tracker), supposing that "real content updates" would have a significantly higher percentage of change than random ads or timestamps...

Gaute Amundsen

unread,
Apr 16, 2009, 11:16:38 AM4/16/09
to spe...@googlegroups.com, Wout Clymans
Please read again.
I am talking about ME ADDING THIS FEATURE, and supplying you with the patch if
you are interested.

Gaute

Gaute Amundsen

unread,
Apr 16, 2009, 11:45:47 AM4/16/09
to spe...@googlegroups.com
On Thursday 16 April 2009 17:08:49 Jeff Fortin wrote:
> ...Though, technically, I'd be quite interested if there was a feasible
> way (if you want to implement it) to do that ad removal. But it may
> cause a lot of issues, I don't know; at least, as you mentionned, the
> problem of how the user is supposed to interact with this.

Add removal is perhaps not exactly what I had in mind, come to think of it.
Rather, I am talking about selecting only particular parts of the page for
comparison. Extract the good bits, rather than remove the bad bits, you might
say.
But combining usability with power is the problem. ( as always I guess :)
Personally I would prefer css selector or xpath since firebug provides those
very neatly, but most users would maybe not know what to do with those.
Having some feedback that you selected the right parts is an issue too.

Then again perhaps most users of Specto are maybe more technically inclined
than the average...?

> The best logical "workaround" I could find at the time (years ago) was
> to implement an "error margin" (which, ideally, should be possible to
> "auto-calibrate", see issue #36 in our issue tracker), supposing that
> "real content updates" would have a significantly higher percentage of
> change than random ads or timestamps...

That's precisely my problem. The real changes are often smaller than the sites
"featured shit" section.

The simplest solution I can think of is the two fields:
skip everything before xx
skip everything after xx
perhaps with an option for xx to be a regexp.

That would work for me. Only question is whether it would work for enough
pages to be worth including for all users.

Perhaps I should just hack it. That would save me from installing bazaar :)

Gaute

Jeff Fortin

unread,
Apr 16, 2009, 12:15:20 PM4/16/09
to spe...@googlegroups.com
I can't really imagine a way for the typical user to actually select part of a page (except showing a webkit browser window and asking the user to "select" (highlight) the relevant parts. But then I feel this may become a rocket science project :) and I think the typical specto userbase is not necessarily a fan of regular expressions (even I can't understand them!)

Maybe an easier/more transparent solution would be to reuse filterset.g lists of adblock plugins to take out parts of the page using beautifulsoup before comparing, or something like that...

Wout Clymans

unread,
Apr 16, 2009, 12:24:04 PM4/16/09
to spe...@googlegroups.com
There is a windows application that is able to check webpages and also parts of it....For now i don't remember what it is called but maybe you can search for it and see how they do it?

Google for specto windows alternative or something and maybe it will be in the list.

Wout Clymans

unread,
Apr 16, 2009, 12:29:18 PM4/16/09
to spe...@googlegroups.com

Gaute Amundsen

unread,
Apr 16, 2009, 1:11:01 PM4/16/09
to spe...@googlegroups.com, Jeff Fortin
On Thursday 16 April 2009 18:15:20 Jeff Fortin wrote:
> I can't really imagine a way for the typical user to actually select
> part of a page (except showing a webkit browser window and asking the
> user to "select" (highlight) the relevant parts. But then I feel this
> may become a rocket science project :)
I quite agree!

> and I think the typical specto
> userbase is not necessarily a fan of regular expressions (even I can't
> understand them!)

when talking about the "skip before" and "skip after" idea, I only mentioned
regexps because they are a sort of "unobtrusive powertool", you can _allmost_
just pretend it's a simple string match.. Posix regexps even more so.


> Maybe an easier/more transparent solution would be to reuse filterset.g
> lists of adblock plugins to take out parts of the page using
> beautifulsoup before comparing, or something like that...

Uggh. me not like.
Depending on some arbitrary list of filters that may or may not get updated..
Avoiding that is the whole point of the "pick out the good parts" strategy.
And anyway, in my case, the often-changing crap is not ads, but the sites
self-promotion.

Perhaps Wout's idea to copy the "watch_web_static.py" is better.
Then one could have a "advanced web watch" for the brave :)
Why is it called "static" btw? Is there a "dynamic" in the works?
( I using Intrepid's 0.2.2 )

Gaute

Wout Clymans

unread,
Apr 16, 2009, 2:15:55 PM4/16/09
to spe...@googlegroups.com
I think it is called static because it is not able to login into a webpage? but i am not sure.

You should really use version 0.3rc1...there is a deb and a tar.gz available on our homepage.
The new version is almost a complete rewrite of the core and the plugins so be sure to make your changes to the new version!

Gaute Amundsen

unread,
Apr 17, 2009, 12:47:14 AM4/17/09
to spe...@googlegroups.com, Wout Clymans
On Thursday 16 April 2009 20:15:55 Wout Clymans wrote:
> I think it is called static because it is not able to login into a webpage?
> but i am not sure.
>
Plausible

> You should really use version 0.3rc1...there is a deb and a tar.gz
> available on our homepage.
> The new version is almost a complete rewrite of the core and the plugins so
> be sure to make your changes to the new version!
>

Ah. good to know.

Thanks.

Gaute

Jeff Fortin

unread,
Apr 17, 2009, 11:27:53 AM4/17/09
to spe...@googlegroups.com

I think it is called static because it is not able to login into a webpage? but i am not sure.
Actually no, I think the initial reason I called it like that years ago is that it was in opposition to, say, a syndication feed watch... or maybe to highlight its inherent limitation: that it's made to watch static pages instead of dynamically (ever changing) pages, thus the error margin setting...


You should really use version 0.3rc1...there is a deb and a tar.gz available on our homepage.
Well, technically, the RC1 is old and buggy if you use jaunty, and the bazaar version would be better :) it's pretty much solid, and the major thing that keeps me from turning it into a final release is that I'm busy with school at the moment.
Reply all
Reply to author
Forward
0 new messages