I tried to get some work done on this reference database over the
break, but instead was distracted by things like writing about why
software should be free
(http://wikiblog.jugglethis.net/2012/01/Techonomics) and developing a
SQLite server (https://github.com/Smattr/alfred). The reference
database is back on my agenda, but it's unlikely to make its way to
the front of my queue until February.
For this month's exercise I wanted to look at something a bit less
technically challenging and a bit more practical than usual. There are
some websites I read that rarely change and don't offer an RSS feed.
Visiting these sites everyday is an irritating task. Also there are
certain sites where I need to know immediately about changes to them.
To manage these tasks currently I use some cron jobs that rely on a
script (https://github.com/Smattr/mattutils/blob/master/has-changed.sh).
The script is very simple and has several weaknesses. Notably, it's
useless for checking for changes to websites that contain data like
the current time, which changes each time you access the page.
So I was hoping we could have a bit of a brainstorm about better ways
to do this. Implementations are welcome, but I was mostly hoping for
some innovative ideas about solving this problem.
> - if the server goes down for maintenance, you'll probably get a
> notification when it goes down and when it comes back up... so you might
> want to add in some extra code to compare against special versions of the
> site to ignore (i.e. temporary outage / maintenance pages)
Actually this is a case I do not want to ignore. I use this script in
some cases to monitor sites that I need to fix when they go down.
On 17 January 2012 13:30, Patrick Coleman <padst...@gmail.com> wrote:
> Nice question, a very tough problem to solve :)
> I worked on something like this aaaages as part of a RSS tracking site; not
> wanting to spend much time on it,
> I went with 'changed' = length has changed and md5 has changed - obviously
> flawed, but worked for the sites I was following.
Why do the MD5 as well? Under what circumstances would the length
change, but the MD5 not?
> If I did it again, it'd really depend on what sites are being tracked.
> Assuming stuff like news sites, and that the page can be split into areas
> like heading / ads / widgets / content,
> I'd probably use a headless browser to render the same a number of times
> (DOM + image) and diff to detect
> which parts change each time (e.g. like ads, even when the content is the
> same) then ignore these sections.
> This'd leave the parts that should be static (e.g. header + article) and
> then I'd apply the same process semi-regularly and diff only those parts.
This is a great approach that I hadn't thought of! This is definitely
something I'll consider adding to my script.
While we're on the topic, what is Google's policy on automated
connections? Like most people I use "ping google.com" as a synonym for
"is my internet working?". Also I sometimes find I need to fake my
user agent when wgeting pages. I assume they don't care because I
don't do it often or for malicious purposes, but is some Google lawyer
going to be knocking on my door someday?