On 5/2/2013 5:06 PM, frou wrote:
> Nope, so thanks for the link. I'll probably just use that then. It could
> still be fun to try, but probably just as a toy that expects all feeds to
> be perfectly formed and served, because I suspect that coping with all the
> subtleties of feeds in the wild is the hard part.
It is. I've had to do that in a Python program.
If you want to poll an RSS feed periodically and get new
items, that often doesn't work the way you would expect.
The "Etag" parameter is supposed to be passed on requests after
the first request, and if the server returns a 304 status, nothing
has changed. Some servers do that right, some don't implement it
all, and on some servers, it works some of the time.
The "some of the time" case comes up with sites that have
multiple RSS servers and a load balancer. The servers may
not be in sync with respect to etag values, guid values, or
pubDate.
Reuters is reasonably well behaved, although when
a story is revised, it gets a new guid even if the
synopsis on the RSS feed didn't change.
Twitter (yes, every Twitter feed has a matching RSS feed, although
Twitter doesn't publicize it much) is awful. Nothing short of
comparing the content will remove duplicates. I had to hash the
content fields and keep a map of previously seen items.
John Nagle