[exercise] Detecting web page changes

Matthew Fernandez

unread,

Jan 14, 2012, 7:14:51 PM1/14/12

to code-p...@googlegroups.com

Thanks to the festive season, I missed this month's tools. Instead of
bumping the automated testing discussion to next month, I thought I'd
discuss something a bit more interesting: roaming home directories.
Like most of us, I have multiple machines I work on and I like to have
the same files available everywhere. Next month I'll go into detail
about a homebrew solution I use for keeping everything versioned and
in sync.

I tried to get some work done on this reference database over the
break, but instead was distracted by things like writing about why
software should be free
(http://wikiblog.jugglethis.net/2012/01/Techonomics) and developing a
SQLite server (https://github.com/Smattr/alfred). The reference
database is back on my agenda, but it's unlikely to make its way to
the front of my queue until February.

For this month's exercise I wanted to look at something a bit less
technically challenging and a bit more practical than usual. There are
some websites I read that rarely change and don't offer an RSS feed.
Visiting these sites everyday is an irritating task. Also there are
certain sites where I need to know immediately about changes to them.
To manage these tasks currently I use some cron jobs that rely on a
script (https://github.com/Smattr/mattutils/blob/master/has-changed.sh).
The script is very simple and has several weaknesses. Notably, it's
useless for checking for changes to websites that contain data like
the current time, which changes each time you access the page.

So I was hoping we could have a bit of a brainstorm about better ways
to do this. Implementations are welcome, but I was mostly hoping for
some innovative ideas about solving this problem.

Andrew Hill

unread,

Jan 16, 2012, 8:13:51 PM1/16/12

to code-p...@googlegroups.com

I'll go for creative... I don't expect this to be super reliable or optimal (better than a straight diff, but will likely have false positives and false negatives), but also shouldn't be too hard to implement or maintain.

The general idea is a machine learning approach, but I'm going to suggest the very simplest to implement (using basic unix command-line tools). The idea being you don't want to have to specify what fields or content actually change, because maintaining rules for each site would suck. At most you want a threshold for "changiness" per site (really only to be used for sites that give too many false positive/negatives). But there may be a bunch of rules that add up to this "changiness" metric.

(I'm not going to write code, but this algorithm could probably just be called from line 30 of your current script, replacing the straight-forward diff)

One very simple approach copies a spam filter implementation I learnt about at some point, which uses gzip (or any compression algorithm, really) to look for similarities between data sets (new emails vs known good/bad sample sets).

The process is quite simple.

1. Start with sample sets of known spam or non-spam (or for your problem, with the previous version of the page).

2. Gzip just the samples to get baseline compression ratios.

3. Cat the new data (new version of the page) onto the old one, then gzip the result to get the combined compression ratio.

4. Compare the baseline vs combined compression ratio:

- if the ratio went down (combined compressed better) then they're basically the same data

- if the ratio went up (combined compressed worse) then they're very different

If it stayed about the same (or went up/down a small amount) then you'll probably need to find a nice threshold.

For something more advanced:

* Add in some other metrics besides compression ratio, e.g.

- ( combined compressed size – baseline compressed size ) / ( size of new data alone ) = a slightly different compression ratio

- run a binary diff between old and new and compare the ratio of the diff size to the original (old or new) size (though this will have more false positives for template/structural changes)

- use multiple compression algorithms

- run through some kind of text extractor first ('strings' is probably a bit simple, you probably want to do what the 'reader' apps do and search for the largest block of text in the page... not sure if there's a nice python/ruby/shell module for this...), and do a straight diff or compression comparison on that

- take the binary diff, then compare that against previous sample data sets of known "interesting" and "non-interesting" diffs in the same way you compare the whole page...

* Combine metrics by... weighted addition / maximum / voting (N metrics exceeding threshold) / mean, perhaps rejecting outliers

* Add in some learning (manual or automatic...)

- record when the sites actually change (you have to do this manually, so cache all checks of the site to let you look for the definite changes)

- log all rule/metric outputs for all sites each time you check them for changes

- manually look at the thresholds for each rule to see whether they actually correlate well with change events

- this could just be plotting the metrics and marking change points, and seeing if there's an acceptable threshold between 'change' and 'non-change'

- or, for the super keen, implement some machine learning algorithm to learn the appropriate metric weighting for each rule, and/or the overall combined metric threshold for each site, based on your training data (of whether it actually changed/didn't each time)

Caveats / Limitations:

- if the files are quite small, you'll probably get quite a bit of noise from the compression algorithm's overheads. Some algorithms will probably be better/worse in this limiting case

- if subsequent posts are quite similar in content, even though technically very different posts, or if there's just a small "update" line added to a post, it's still going to compare quite well... that's where tuning gets harder, or where you need the "compare the diffs" rule/metric as well

- if the server goes down for maintenance, you'll probably get a notification when it goes down and when it comes back up... so you might want to add in some extra code to compare against special versions of the site to ignore (i.e. temporary outage / maintenance pages)

I think you'll get about 80% of the way there just by changing your straight up diff to a really basic gzip compression ratio comparison.

cheers

Frog

Patrick Coleman

unread,

Jan 16, 2012, 9:30:00 PM1/16/12

to code-p...@googlegroups.com

Nice question, a very tough problem to solve :)

I worked on something like this aaaages as part of a RSS tracking site; not wanting to spend much time on it,

I went with 'changed' = length has changed and md5 has changed - obviously flawed, but worked for the sites I was following.

If I did it again, it'd really depend on what sites are being tracked.

Assuming stuff like news sites, and that the page can be split into areas like heading / ads / widgets / content,

I'd probably use a headless browser to render the same a number of times (DOM + image) and diff to detect

which parts change each time (e.g. like ads, even when the content is the same) then ignore these sections.

This'd leave the parts that should be static (e.g. header + article) and then I'd apply the same process semi-regularly and diff only those parts.

random bit of trivia, but the Google Eng boss here in Sydney (Alan Noble) also started out his

life writing code that did this: http://en.wikipedia.org/wiki/NetMind

On Sun, Jan 15, 2012 at 11:14 AM, Matthew Fernandez <matthew....@gmail.com> wrote:

Matthew Fernandez

unread,

Jan 17, 2012, 5:51:47 PM1/17/12

to code-p...@googlegroups.com

On 17 January 2012 12:13, Andrew Hill <and...@thefrog.net> wrote:
...

> One very simple approach copies a spam filter implementation I learnt about
> at some point, which uses gzip (or any compression algorithm, really) to
> look for similarities between data sets (new emails vs known good/bad sample
> sets).

I quite like this approach, but it seems like there will be a lot of
false positives and false negatives.

> - if the server goes down for maintenance, you'll probably get a
> notification when it goes down and when it comes back up... so you might
> want to add in some extra code to compare against special versions of the
> site to ignore (i.e. temporary outage / maintenance pages)

Actually this is a case I do not want to ignore. I use this script in
some cases to monitor sites that I need to fix when they go down.

On 17 January 2012 13:30, Patrick Coleman <padst...@gmail.com> wrote:
> Nice question, a very tough problem to solve :)
> I worked on something like this aaaages as part of a RSS tracking site; not
> wanting to spend much time on it,
> I went with 'changed' = length has changed and md5 has changed - obviously
> flawed, but worked for the sites I was following.

Why do the MD5 as well? Under what circumstances would the length
change, but the MD5 not?

> If I did it again, it'd really depend on what sites are being tracked.
> Assuming stuff like news sites, and that the page can be split into areas
> like heading / ads / widgets / content,
> I'd probably use a headless browser to render the same a number of times
> (DOM + image) and diff to detect
> which parts change each time (e.g. like ads, even when the content is the
> same) then ignore these sections.
> This'd leave the parts that should be static (e.g. header + article) and
> then I'd apply the same process semi-regularly and diff only those parts.

This is a great approach that I hadn't thought of! This is definitely
something I'll consider adding to my script.

While we're on the topic, what is Google's policy on automated
connections? Like most people I use "ping google.com" as a synonym for
"is my internet working?". Also I sometimes find I need to fake my
user agent when wgeting pages. I assume they don't care because I
don't do it often or for malicious purposes, but is some Google lawyer
going to be knocking on my door someday?

Reply all

Reply to author

Forward