feature request - RSS feed of new data items

101 views
Skip to first unread message

Anna Powell-Smith

unread,
Aug 22, 2011, 8:48:11 AM8/22/11
to scrap...@googlegroups.com, william perrin
Hello

It would be lovely if each scraper had an RSS feed of new data items -
I have two use cases for this where I'm using my own code to generate
RSS feeds from ScraperWiki datasets - licensing applications to
councils, where you want to see new applications as they appear, and
houses for sale, ditto.

thanks
Anna

Henare Degan

unread,
Aug 22, 2011, 7:08:53 PM8/22/11
to scrap...@googlegroups.com
On Mon, Aug 22, 2011 at 22:48, Anna Powell-Smith
<annapow...@gmail.com> wrote:
> It would be lovely if each scraper had an RSS feed of new data items

Although it's not a core feature, you can get an RSS feed for any
scraper using this view[0] (I think you already know that but for
others on the list...).

Cheers,

Henare

[0] https://scraperwiki.com/views/rss_2/
--
Henare Degan

e » henare...@gmail.com
t » @henaredegan
w » www.henaredegan.com

Francis Irving

unread,
Aug 22, 2011, 7:34:31 PM8/22/11
to scrap...@googlegroups.com
Anna - what would the RSS feed show by default? Just all the fields
presented in some arbitary way?

Would this be better/worse than a very slick interface for making an
RSS view using a template (such as below) that you can then at least
edit and customise?

Francis

Anna Powell-Smith

unread,
Aug 23, 2011, 6:11:45 AM8/23/11
to scrap...@googlegroups.com
The RSS view that Henare suggested is pretty much perfect in terms of
parameters etc - I don't find that I need to customize it at all.

But there are two problems with it:

1. It's too slow to render - e.g. the W3C validator times out before
it renders.
2. It's a view, so it has the "Powered by ScraperWiki" HTML pane, so
it isn't valid RSS.

One or other of these means that Google Reader can't cope the RSS it
generates [0].

On reflection what I'd like is a ScraperWiki command that I can call
from within the scraper, that actually generates a static RSS file
when it runs.

Just like the RSS view, it would be ideal to specify title, link,
description, ordering etc.

Does that sound sensible?

[0] https://views.scraperwiki.com/run/rss_2/?scraper=communication_log&link=uri&date=date_submitted_c&title=behalf&table=contact&order=date_submitted_c?scraper=communication_log&link=uri&date=date_submitted_c&title=behalf&table=contact&order=date_submitted_c

Francis Irving

unread,
Aug 23, 2011, 8:25:32 AM8/23/11
to scrap...@googlegroups.com
On Tue, Aug 23, 2011 at 11:11:45AM +0100, Anna Powell-Smith wrote:
> The RSS view that Henare suggested is pretty much perfect in terms of
> parameters etc - I don't find that I need to customize it at all.

Great!

Can you give a URL of using it with one of your scrapers, that you
like the output of?

> But there are two problems with it:
>
> 1. It's too slow to render - e.g. the W3C validator times out before
> it renders.

:( Am just taking a look at this.

> 2. It's a view, so it has the "Powered by ScraperWiki" HTML pane, so
> it isn't valid RSS.

I've added a line to set the mimetype, which stops ScraperWiki
adding that pane. (In https://scraperwiki.com/views/rss_2/)

ScraperWiki.httpresponseheader('Content-Type', 'application/rss_xml')

I've added an FAQ about this too - not deployed yet.

> One or other of these means that Google Reader can't cope the RSS it
> generates [0].
>
> On reflection what I'd like is a ScraperWiki command that I can call
> from within the scraper, that actually generates a static RSS file
> when it runs.
>
> Just like the RSS view, it would be ideal to specify title, link,
> description, ordering etc.
>
> Does that sound sensible?

The trouble is that it needs somewhere to put the file, and have it
served from.

Best bet is really for us to fix the underlying speed problem...

Francis

Alexander Harrowell

unread,
Aug 24, 2011, 11:12:13 AM8/24/11
to scrap...@googlegroups.com


I'd just like to +1 the whole issue - RSS recent items is a fairly common use case for scraping public data.

Anna Powell-Smith

unread,
Aug 29, 2011, 5:30:43 PM8/29/11
to scrap...@googlegroups.com
On 23 August 2011 13:25, Francis Irving <fra...@scraperwiki.com> wrote:
> On Tue, Aug 23, 2011 at 11:11:45AM +0100, Anna Powell-Smith wrote:
>> The RSS view that Henare suggested is pretty much perfect in terms of
>> parameters etc - I don't find that I need to customize it at all.
>
> Great!
>
> Can you give a URL of using it with one of your scrapers, that you
> like the output of?

http://scraperwikiviews.com/run/rss_2/?scraper=islington_business_licences&table=swdata&title=licence_type&link=url&desc=licence_for&date=date_scraped&order=date_scraped&limit=100

NB in the original scraper, I have to use API calls to save the
date_scraped field between runs. It'd be nice to fix this too, at
least if you have other users who want RSS feeds.

>> But there are two problems with it:
>>
>> 1. It's too slow to render - e.g. the W3C validator times out before
>> it renders.
>
> :( Am just taking a look at this.

Thanks, that would be good.

>> 2. It's a view, so it has the "Powered by ScraperWiki" HTML pane, so
>> it isn't valid RSS.
>
> I've added a line to set the mimetype, which stops ScraperWiki
> adding that pane. (In  https://scraperwiki.com/views/rss_2/)
>
>    ScraperWiki.httpresponseheader('Content-Type', 'application/rss_xml')
>
> I've added an FAQ about this too - not deployed yet.

Thanks!

Max Ogden

unread,
Aug 30, 2011, 10:46:46 PM8/30/11
to scrap...@googlegroups.com, annapow...@gmail.com
a CouchDB style _changes feed would also be nice, e.g. http://datacouch.com/db/dcb189ceda9bc479d4ce997b840a7007f2/_changes

or http://datacouch.com/db/dcb189ceda9bc479d4ce997b840a7007f2/_changes?feed=continuous&since=400 so you can do real-time JSON sync as opposed to having to write RSS scrapers

:)

Francis Irving

unread,
Sep 9, 2011, 12:12:32 PM9/9/11
to scrap...@googlegroups.com
Julian has now added RSS support the External API.

It's much neater than I would have expected - you can construct an
arbitary RSS feed using just SQL.

So for example this URL is a feed of Islington Business Licenses:

https://api.scraperwiki.com/api/1.0/datastore/sqlite?format=rss2&name=islington_business_licences&query=select%20licence_for%20as%20description%2C%20applicant%20as%20title%2C%20url%20as%20link%2C%20date_scraped%20as%20date%20from%20swdata%20order%20by%20date_scraped%20desc%20limit%2010

Short link: http://bit.ly/pinSSf

SQL query: select licence_for as description, applicant as title, url as link,
date_scraped as date from swdata order by date_scraped desc limit 10

Of course, you can use fancier queries to concatenate strings and so
on if you need to.

To make one, go to the Web API page for a scraper, and choose "rss2"
as the format. Some help appears to tell you what fields you need to
include.
https://scraperwiki.com/docs/api?name=islington_business_licences#sqlite

Anna, can you have a play with this and see if it meets your needs?

If it seems to be working well, I'll blog about it next week.

Francis

Anna Powell-Smith

unread,
Sep 9, 2011, 12:20:45 PM9/9/11
to scrap...@googlegroups.com
Hooray! Being able to construct it in SQL should be really useful too.
I'll try it out and let you know.

Francis Irving

unread,
Sep 9, 2011, 4:52:59 PM9/9/11
to scrap...@googlegroups.com
Will link to any example you make in the blog post!

Paul Bradshaw

unread,
Sep 10, 2011, 6:06:27 AM9/10/11
to scrap...@googlegroups.com
This is very significant IMHO - will blog about it too.

Paul Bradshaw

http://twitter.com/paulbradshaw
http://onlinejournalismblog.com
http://helpmeinvestigate.com

Sent from my phone

Anna Powell-Smith

unread,
Sep 10, 2011, 7:26:17 PM9/10/11
to scrap...@googlegroups.com, william perrin
It works!

At least, the Islington licensing applications RSS feed [1] showed two
new items this morning. Thank you!

The scraper code [2] still relies on a bit of a hack though. The
pubDate field needs to be set to the date we *first scraped* each
item, and so I need to save that between runs. To do that, I load all
the existing data via the API at the start of the run.

So if you blog about it, I'd mention that this is a necessary step. At
least, I think it's a necessary step. Julian may have suggestions for
improving it.

Also, if it is necessary, I wonder whether I'm doing it the safest way
- would it be better to make all the calls to scraperwiki.sqlite.save
at the very end of the scraper? I want to avoid ending up with
half-written data if the scraper crashes half-way through. I can't
remember if ScraperWiki actually executes calls to the database live,
or queues them all till the scraper finishes.

[1] http://bit.ly/pinSSf
[2] https://scraperwiki.com/scrapers/islington_business_licences/edit/

william perrin

unread,
Sep 11, 2011, 11:16:50 AM9/11/11
to annapow...@gmail.com, scrap...@googlegroups.com
thanks people esp anna and jullian

although i only understand the small words like 'and and 'to' below the end result is really important for me as an activist who can't/won't code.

as you know from planning alerts etc with these systems that pile up applications from businesses the important thing is to know when something has been changed.  councils, largely because they don't see things from a customer viewpoint tend not to produce change alerts on their data sets (there are exceptions).  the officers job is to put data in not take it out.

when you need to keep tabs on several application processes it's really useful to have these as an RSS feed or drop that into feedburner to set up and alerts email

to see the sort of thing the kings cross community tackles using the underlying data here

or
or
one question - is this 'finished' and can i buy anyone a pint to thank them?

cheers


w


cheers


w

Anna Powell-Smith

unread,
Sep 11, 2011, 5:02:13 PM9/11/11
to william perrin, scrap...@googlegroups.com
William - the RSS bit seems to be working well, but it depends on some
code in my scraper that's basically held together with baler twine -
so I would wait for Francis/Julian to confirm that bit is okay before
regarding it as finished :)

http://ifttt.com/wtf might be useful in this context - it can turn new
RSS items into pretty much any action you like, so you could create a
Twitter feed of new applications, for example.

pezholio

unread,
Sep 12, 2011, 2:07:39 AM9/12/11
to ScraperWiki
This is great stuff, I'm having a few problems with my scraper
however, I get the following error:

"Date conversion error: Required argument 'day' (pos 3) not found"

Any ideas? Here's my API call:

https://api.scraperwiki.com/api/1.0/datastore/sqlite?format=rss2&name=walsall_warwickshire_food_safety_inspections&query=select%20name%20||%20%22%2C%20%22%20||%20address1%20||%20%22%2C%20%22%20||%20address3%20as%20title%2C%20rating%20||%20%22%20stars%22%20as%20description%2C%20url%20as%20link%2C%20date%20as%20pubDate%20from%20swdata%20limit%2010

Also, any plans todo GeoRSS too?

Cheers :)

On Sep 11, 10:02 pm, Anna Powell-Smith <annapowellsm...@gmail.com>
wrote:
> William - the RSS bit seems to be working well, but it depends on some
> code in my scraper that's basically held together with baler twine -
> so I would wait for Francis/Julian to confirm that bit is okay before
> regarding it as finished :)
>
> http://ifttt.com/wtfmight be useful in this context - it can turn new
> RSS items into pretty much any action you like, so you could create a
> Twitter feed of new applications, for example.
>
> On 11 September 2011 16:16, william perrin <will...@talkaboutlocal.org> wrote:
>
>
>
>
>
>
>
> > thanks people esp anna and jullian
> > although i only understand the small words like 'and and 'to' below the end
> > result is really important for me as an activist who can't/won't code.
> > as you know from planning alerts etc with these systems that pile up
> > applications from businesses the important thing is to know when something
> > has been changed.  councils, largely because they don't see things from a
> > customer viewpoint tend not to produce change alerts on their data sets
> > (there are exceptions).  the officers job is to put data in not take it out.
> > when you need to keep tabs on several application processes it's really
> > useful to have these as an RSS feed or drop that into feedburner to set up
> > and alerts email
> > to see the sort of thing the kings cross community tackles using the
> > underlying data here
> >http://www.kingscrossenvironment.com/2011/07/star-of-kings-extended-o...
> > or
> >http://www.kingscrossenvironment.com/2011/08/burrito-cafe-10-caledoni...
> > or
> >http://callylabourcouncillors.org.uk/2011/08/26/combined-enforcement-...
>
> > one question - is this 'finished' and can i buy anyone a pint to thank them?
> > cheers
>
> > w
>
> > cheers
>
> > w
> > On 11 September 2011 00:26, Anna Powell-Smith <annapowellsm...@gmail.com>
> > wrote:
>
> >> It works!
>
> >> At least, the Islington licensing applications RSS feed [1] showed two
> >> new items this morning. Thank you!
>
> >> The scraper code [2] still relies on a bit of a hack though. The
> >> pubDate field needs to be set to the date we *first scraped* each
> >> item, and so I need to save that between runs. To do that, I load all
> >> the existing data via the API at the start of the run.
>
> >> So if you blog about it, I'd mention that this is a necessary step. At
> >> least, I think it's a necessary step. Julian may have suggestions for
> >> improving it.
>
> >> Also, if it is necessary, I wonder whether I'm doing it the safest way
> >> - would it be better to make all the calls to scraperwiki.sqlite.save
> >> at the very end of the scraper? I want to avoid ending up with
> >> half-written data if the scraper crashes half-way through. I can't
> >> remember if ScraperWiki actually executes calls to the database live,
> >> or queues them all till the scraper finishes.
>
> >> [1]http://bit.ly/pinSSf
> >> [2]https://scraperwiki.com/scrapers/islington_business_licences/edit/
>
> >> On 10 September 2011 11:06, Paul Bradshaw <paulonhismob...@gmail.com>
> >> wrote:
> >> > This is very significant IMHO - will blog about it too.
>
> >> > Paul Bradshaw
>
> >> >http://twitter.com/paulbradshaw
> >> >http://onlinejournalismblog.com
> >> >http://helpmeinvestigate.com
>
> >> > Sent from my phone
>
> >> > On 9 Sep 2011, at 21:52, Francis Irving <fran...@scraperwiki.com> wrote:
>
> >> >> Will link to any example you make in the blog post!
>
> >> >> On Fri, Sep 09, 2011 at 05:20:45PM +0100, Anna Powell-Smith wrote:
> >> >>> Hooray! Being able to construct it in SQL should be really useful too.
> >> >>> I'll try it out and let you know.
>
> >> >>> On 9 September 2011 17:12, Francis Irving <fran...@scraperwiki.com>
> >> >>> wrote:
> >> >>>> Julian has now added RSS support the External API.
>
> >> >>>> It's much neater than I would have expected - you can construct an
> >> >>>> arbitary RSS feed using just SQL.
>
> >> >>>> So for example this URL is a feed of Islington Business Licenses:
>
> >> >>>>https://api.scraperwiki.com/api/1.0/datastore/sqlite?format=rss2&name...
>
> >> >>>> Short link:http://bit.ly/pinSSf
>
> >> >>>> SQL query: select licence_for as description, applicant as title, url
> >> >>>> as link,
> >> >>>> date_scraped as date from swdata order by date_scraped desc limit 10
>
> >> >>>> Of course, you can use fancier queries to concatenate strings and so
> >> >>>> on if you need to.
>
> >> >>>> To make one, go to the Web API page for a scraper, and choose "rss2"
> >> >>>> as the format. Some help appears to tell you what fields you need to
> >> >>>> include.
>
> >> >>>>https://scraperwiki.com/docs/api?name=islington_business_licences#sqlite
>
> >> >>>> Anna, can you have a play with this and see if it meets your needs?
>
> >> >>>> If it seems to be working well, I'll blog about it next week.
>
> >> >>>> Francis
>
> >> >>>> On Mon, Aug 29, 2011 at 10:30:43PM +0100, Anna Powell-Smith wrote:
> >> >>>>> On 23 August 2011 13:25, Francis Irving <fran...@scraperwiki.com>
> >> >>>>> wrote:
> >> >>>>>> On Tue, Aug 23, 2011 at 11:11:45AM +0100, Anna Powell-Smith wrote:
> >> >>>>>>> The RSS view that Henare suggested is pretty much perfect in terms
> >> >>>>>>> of
> >> >>>>>>> parameters etc - I don't find that I need to customize it at all.
>
> >> >>>>>> Great!
>
> >> >>>>>> Can you give a URL of using it with one of your scrapers, that you
> >> >>>>>> like the output of?
>
> >> >>>>>http://scraperwikiviews.com/run/rss_2/?scraper=islington_business_lic...

Alexander Harrowell

unread,
Sep 12, 2011, 5:26:17 AM9/12/11
to scrap...@googlegroups.com

On Monday 12 Sep 2011 07:07:39 pezholio wrote:

> This is great stuff, I'm having a few problems with my scraper

> however, I get the following error:

>

> "Date conversion error: Required argument 'day' (pos 3) not found"

>


I'm guessing you need to use the strftime/strptime function to format the date and pass a format string. SQLite provides a surprisingly limited range of date/time transforms compared to [your favourite programming language's date-time lib]. This is something SW users in general should be aware of as it's annoying to have to re-scrape later if you stored dates in some way that SQLite doesn't like. FYI, valid RSS requires dates to be in RFC822 format.


As for GeoRSS, the only problem I can think of is that it requires a different DTD at the top. Otherwise you could presumably just select the lat and long fields and alias them as required.

--

The only thing worse than e-mail disclaimers...is people who send e-mail to lists complaining about them

signature.asc

Stuart Harrison

unread,
Sep 12, 2011, 5:40:51 AM9/12/11
to scrap...@googlegroups.com
Ah, cool, gotcha. I assumed Scraperwiki did the relevant conversion at the scraper end. A simple fix though! :)

Nice one on the GeoRSS thing, will give it a try :)

Cheers

pezholio

unread,
Sep 12, 2011, 4:48:01 PM9/12/11
to ScraperWiki
I've made the changes to the scraper now, but I still get the same
error. Seems like there's some caching going on?

On Sep 12, 10:40 am, Stuart Harrison <pezho...@gmail.com> wrote:
> Ah, cool, gotcha. I assumed Scraperwiki did the relevant conversion at the
> scraper end. A simple fix though! :)
>
> Nice one on the GeoRSS thing, will give it a try :)
>
> Cheers
>
> On Mon, Sep 12, 2011 at 10:26 AM, Alexander Harrowell <a.harrow...@gmail.com
>
>
>
>
>
>
>
> > wrote:
> > **
>
> > On Monday 12 Sep 2011 07:07:39 pezholio wrote:
>
> > > This is great stuff, I'm having a few problems with my scraper
>
> > > however, I get the following error:
>
> > > "Date conversion error: Required argument 'day' (pos 3) not found"
>
> > I'm guessing you need to use the strftime/strptime function to format the
> > date and pass a format string. SQLite provides a surprisingly limited range
> > of date/time transforms compared to [your favourite programming language's
> > date-time lib]. This is something SW users in general should be aware of as
> > it's annoying to have to re-scrape later if you stored dates in some way
> > that SQLite doesn't like. FYI, valid RSS requires dates to be in RFC822
> > format.
>
> > As for GeoRSS, the only problem I can think of is that it requires a
> > different DTD at the top. Otherwise you could presumably just select the lat
> > and long fields and alias them as required.
>
> > > Any ideas? Here's my API call:
>
> >https://api.scraperwiki.com/api/1.0/datastore/sqlite?format=rss2&name...|%20%22%2C%20%22%20||%20address1%20||%20%22%2C%20%22%20||%20address3%20as%2 0title%2C%20rating%20||%20%22%20stars%22%20as%20description%2C%20url%20as%2 0link%2C%20date%20as%20pubDate%20from%20swdata%20limit%2010<https://api.scraperwiki.com/api/1.0/datastore/sqlite?format=rss2&name...>
>
> > > Also, any plans todo GeoRSS too?
>
> > > Cheers :)
>
> > > On Sep 11, 10:02 pm, Anna Powell-Smith <annapowellsm...@gmail.com>
>
> > > wrote:
>
> > > > William - the RSS bit seems to be working well, but it depends on some
>
> > > > code in my scraper that's basically held together with baler twine -
>
> > > > so I would wait for Francis/Julian to confirm that bit is okay before
>
> > > > regarding it as finished :)
>
> > > >http://ifttt.com/wtfmightbe useful in this context - it can turn new

Ross Jones

unread,
Sep 12, 2011, 4:58:35 PM9/12/11
to scrap...@googlegroups.com
Hi,

I think the problem is that the scraperwiki api code expects an ISO 8601 date in the YYYY-MM-DD format. Looks like currently it is using something like Friday, 15th Jun instead of 2011-06-15. is the data stored in your scraper in ISO 8601 format?

Ross

Stuart Harrison

unread,
Sep 12, 2011, 5:39:00 PM9/12/11
to scrap...@googlegroups.com
Ah, that could be part of the problem, but I changed the date format to RFC-822 to work with the RSS feed, and the API seems to still be pulling in the old data, which is the raw date, as scraped from the source pages.

That said, I guess the RFC-822 format may cause problems when ordering by date, so the solution may be to have two date fields - one in one format and one in t'other. Would I be correct in that assumption?

--------------
Sent from my Mark II Colossus

Aidan Hobson Sayers

unread,
Sep 12, 2011, 6:04:28 PM9/12/11
to scrap...@googlegroups.com
If you download the CSV file you will notice that you have two different
formats of date in the same column, changing at item 580. This is
presumably from before and after you changed your date formatting in
your scraper.
Did you clear the data of your scraper/rescrape all existing data before
trying to use the API again? If not, that's why you're observing the old
date format - because it's old data that hasn't been modified.

Aidan

Tobias Escher

unread,
Sep 13, 2011, 7:02:47 AM9/13/11
to scrap...@googlegroups.com
On Sun, Sep 11, 2011 at 11:02 PM, Anna Powell-Smith
<annapow...@gmail.com> wrote:
>
> http://ifttt.com/wtf might be useful in this context - it can turn new
> RSS items into pretty much any action you like, so you could create a
> Twitter feed of new applications, for example.

What a great tool - and really simple to use! Thanks for sharing this!

best, tobias

pezholio

unread,
Sep 13, 2011, 7:13:14 AM9/13/11
to ScraperWiki
Excellent, that seems to work now:

https://api.scraperwiki.com/api/1.0/datastore/sqlite?format=rss2&name=walsall_warwickshire_food_safety_inspections&query=select%20name%20||%20%22%2C%20%22%20||%20address1%20||%20%22%2C%20%22%20||%20address2%20as%20title%2C%20rating%20||%20%22%20stars%22%20as%20description%2C%20url%20as%20link%2C%20latlng_lng%20||%20%22%20%22%20||%20latlng_lat%20as%20%22georss%3Apoint%22%2C%20date%20from%20swdata%20%20order%20by%20date%20desc%20limit%2010

I've got it running through the aforementioned ifttt and tweeting
inspections to @EatSafeWalsall - thanks the Anna for giving me the
idea!

Only problem I'm discovering now if that the pubDates are in the past,
as the inspection date is not necessarily the same as the date they're
put on the website. Any ideas how I can get around this?

On Sep 12, 11:04 pm, Aidan Hobson Sayers <aidan...@gmail.com> wrote:
> If you download the CSV file you will notice that you have two different
> formats of date in the same column, changing at item 580. This is
> presumably from before and after you changed your date formatting in
> your scraper.
> Did you clear the data of your scraper/rescrape all existing data before
> trying to use the API again? If not, that's why you're observing the old
> date format - because it's old data that hasn't been modified.
>
> Aidan
>
> On 12/09/2011 22:39, Stuart Harrison wrote:
>
>
>
>
>
>
>
> > Ah, that could be part of the problem, but I changed the date format to RFC-822 to work with the RSS feed, and the API seems to still be pulling in the old data, which is the raw date, as scraped from the source pages.
>
> > That said, I guess the RFC-822 format may cause problems when ordering by date, so the solution may be to have two date fields - one in one format and one in t'other. Would I be correct in that assumption?
>
> > --------------
> > Sent from my Mark II Colossus
>
> >>>>>https://api.scraperwiki.com/api/1.0/datastore/sqlite?format=rss2&name...%20%22%2C%20%22%20||%20address1%20||%20%22%2C%20%22%20||%20address3%20as%2 0title%2C%20rating%20||%20%22%20stars%22%20as%20description%2C%20url%20as%2 0link%2C%20date%20as%20pubDate%20from%20swdata%20limit%2010<https://api.scraperwiki.com/api/1.0/datastore/sqlite?format=rss2&name...>
> >>>>>> Also, any plans todo GeoRSS too?
> >>>>>> Cheers :)
> >>>>>> On Sep 11, 10:02 pm, Anna Powell-Smith<annapowellsm...@gmail.com>
> >>>>>> wrote:
> >>>>>>> William - the RSS bit seems to be working well, but it depends on some
> >>>>>>> code in my scraper that's basically held together with baler twine -
> >>>>>>> so I would wait for Francis/Julian to confirm that bit is okay before
> >>>>>>> regarding it as finished :)
> >>>>>>>http://ifttt.com/wtfmightbeuseful in this context - it can turn new
> ...
>
> read more »

Anna Powell-Smith

unread,
Sep 13, 2011, 1:18:29 PM9/13/11
to scrap...@googlegroups.com
On 13 September 2011 12:13, pezholio <pezh...@gmail.com> wrote:
> Excellent, that seems to work now:
>
> https://api.scraperwiki.com/api/1.0/datastore/sqlite?format=rss2&name=walsall_warwickshire_food_safety_inspections&query=select%20name%20||%20%22%2C%20%22%20||%20address1%20||%20%22%2C%20%22%20||%20address2%20as%20title%2C%20rating%20||%20%22%20stars%22%20as%20description%2C%20url%20as%20link%2C%20latlng_lng%20||%20%22%20%22%20||%20latlng_lat%20as%20%22georss%3Apoint%22%2C%20date%20from%20swdata%20%20order%20by%20date%20desc%20limit%2010
>
> I've got it running through the aforementioned ifttt and tweeting
> inspections to @EatSafeWalsall - thanks the Anna for giving me the
> idea!
>
> Only problem I'm discovering now if that the pubDates are in the past,
> as the inspection date is not necessarily the same as the date they're
> put on the website. Any ideas how I can get around this?

I would do this: add a field called date_scraped, and set it to the
current datetime when you run the scraper.

Then, before you insert a new record, check whether you've scraped it
before - this is how the Australian planning scrapers do it:

das.each do |record|
if ScraperWiki.select("* from swdata where `id`='#{record['id']}'").empty?
ScraperWiki.save_sqlite(['id'], record)
else
puts "Skipping already saved record " + record['id']
end
end

When you generate the RSS, set date to date_scraped and order by
date_scraped, so that more recently scraped items come top. (You can
still output your "date of inspection" field in the description if you
want.)

And then start lobbying for the "insert if new record" SQLite call
that Henare suggested :)

Francis Irving

unread,
Sep 14, 2011, 3:23:49 AM9/14/11
to scrap...@googlegroups.com
Stuart - any idea why no tweets have shown up in the @eatsafewalsall
account yet?

https://twitter.com/#!/eatsafewalsall
@eatsafewalsall hasn't tweeted yet.

Francis

pezholio

unread,
Sep 14, 2011, 4:13:31 AM9/14/11
to ScraperWiki
Hi Francis,

I originally thought it was due to the pubDates being in the past, but
ifttt have assured me that they don't pay any attention to the
pubDate, as long as the urls are unique (thanks for the assistance
with this by the way Anna - I'll add my hat into the ring for
'official' date_scraped functionality), so I guess there's been no
inspections since I first set up the task. To combat this I'm going to
clear the datastore, set up the ifttt task again (so it sees a blank
RSS feed), and then run the scraper again (I've also noticed that I
need to decode the html entities in the urls). This means we should at
least get the latest 10 inspections as a kick off.

Cheers

On Sep 14, 8:23 am, Francis Irving <fran...@scraperwiki.com> wrote:
> Stuart - any idea why no tweets have shown up in the @eatsafewalsall
> account yet?
>
> https://twitter.com/#!/eatsafewalsall
> @eatsafewalsall hasn't tweeted yet.
>
> Francis
>
>
>
>
>
>
>
> On Tue, Sep 13, 2011 at 04:13:14AM -0700, pezholio wrote:
> > Excellent, that seems to work now:
>
> >https://api.scraperwiki.com/api/1.0/datastore/sqlite?format=rss2&name...|%20%22%2C%20%22%20||%20address1%20||%20%22%2C%20%22%20||%20address2%20as%2 0title%2C%20rating%20||%20%22%20stars%22%20as%20description%2C%20url%20as%2 0link%2C%20latlng_lng%20||%20%22%20%22%20||%20latlng_lat%20as%20%22georss%3 Apoint%22%2C%20date%20from%20swdata%20%20order%20by%20date%20desc%20limit%2 010
> > > >>>>>https://api.scraperwiki.com/api/1.0/datastore/sqlite?format=rss2&name...|%20address1%20||%20%22%2C%20%22%20||%20address3%20as%2 0title%2C%20rating%20||%20%22%20stars%22%20as%20description%2C%20url%20as%2 0link%2C%20date%20as%20pubDate%20from%20swdata%20limit%2010<https://api.scraperwiki.com/api/1.0/datastore/sqlite?format=rss2&name...>
> > > >>>>>> Also, any plans todo GeoRSS too?
> > > >>>>>> Cheers :)
> > > >>>>>> On Sep 11, 10:02 pm, Anna Powell-Smith<annapowellsm...@gmail.com>
> > > >>>>>> wrote:
> > > >>>>>>> William - the RSS bit seems to be working well, but it depends on some
> > > >>>>>>> code in my scraper that's basically held together with baler twine -
> > > >>>>>>> so I would wait for Francis/Julian to confirm that bit is okay before
> > > >>>>>>> regarding it as finished :)
> > > >>>>>>>http://ifttt.com/wtfmightbeusefulin this context - it can turn new

pezholio

unread,
Sep 15, 2011, 5:13:52 AM9/15/11
to ScraperWiki
Right. All should be working now! :)

On Sep 14, 9:13 am, pezholio <pezho...@gmail.com> wrote:
> Hi Francis,
>
> I originally thought it was due to the pubDates being in the past, but
> ifttt have assured me that they don't pay any attention to the
> pubDate, as long as the urls are unique (thanks for the assistance
> with this by the way Anna - I'll add my hat into the ring for
> 'official' date_scraped functionality), so I guess there's been no
> inspections since I first set up the task. To combat this I'm going to
> clear the datastore, set up the ifttt task again (so it sees a blank
> RSS feed), and then run the scraper again (I've also noticed that I
> need to decode the html entities in the urls). This means we should at
> least get the latest 10 inspections as a kick off.
>
> Cheers
>
> On Sep 14, 8:23 am, Francis Irving <fran...@scraperwiki.com> wrote:
>
>
>
>
>
>
>
> > Stuart - any idea why no tweets have shown up in the @eatsafewalsall
> > account yet?
>
> >https://twitter.com/#!/eatsafewalsall
> > @eatsafewalsall hasn't tweeted yet.
>
> > Francis
>
> > On Tue, Sep 13, 2011 at 04:13:14AM -0700, pezholio wrote:
> > > Excellent, that seems to work now:
>
> > >https://api.scraperwiki.com/api/1.0/datastore/sqlite?format=rss2&name...%20%22%2C%20%22%20||%20address1%20||%20%22%2C%20%22%20||%20address2%20as%2 0title%2C%20rating%20||%20%22%20stars%22%20as%20description%2C%20url%20as%2 0link%2C%20latlng_lng%20||%20%22%20%22%20||%20latlng_lat%20as%20%22georss%3 Apoint%22%2C%20date%20from%20swdata%20%20order%20by%20date%20desc%20limit%2 010
> > > > >>>>>https://api.scraperwiki.com/api/1.0/datastore/sqlite?format=rss2&name...%20address1%20||%20%22%2C%20%22%20||%20address3%20as%2 0title%2C%20rating%20||%20%22%20stars%22%20as%20description%2C%20url%20as%2 0link%2C%20date%20as%20pubDate%20from%20swdata%20limit%2010<https://api.scraperwiki.com/api/1.0/datastore/sqlite?format=rss2&name...>
> > > > >>>>>> Also, any plans todo GeoRSS too?
> > > > >>>>>> Cheers :)
> > > > >>>>>> On Sep 11, 10:02 pm, Anna Powell-Smith<annapowellsm...@gmail.com>
> > > > >>>>>> wrote:
> > > > >>>>>>> William - the RSS bit seems to be working well, but it depends on some
> > > > >>>>>>> code in my scraper that's basically held together with baler twine -
> > > > >>>>>>> so I would wait for Francis/Julian to confirm that bit is okay before
> > > > >>>>>>> regarding it as finished :)
> > > > >>>>>>>http://ifttt.com/wtfmightbeusefulinthis context - it can turn new
> ...
>
> read more »

Francis Irving

unread,
Sep 15, 2011, 7:13:46 AM9/15/11
to scrap...@googlegroups.com
Awesome, thank you!

Julian Todd

unread,
Sep 15, 2011, 11:56:11 AM9/15/11
to scrap...@googlegroups.com
Anna,

This is a common use case.  When things have settled down we'll be looking to database triggers to action this exact type of feature.  It's pretty well developed technology:
    http://www.sqlite.org/lang_createtrigger.html

Julian.

Francis Irving

unread,
Sep 16, 2011, 2:28:27 PM9/16/11
to scrap...@googlegroups.com
I've done a blog post about it, and Anna's food safety inspections.

Will probably be posted on Wednesday!

Could you add a mention of ScraperWiki and ifttt on the bio of
@EatSafeWalsall, so people know how it was made?

Francis

On Thu, Sep 15, 2011 at 02:13:52AM -0700, pezholio wrote:

Stuart Harrison

unread,
Sep 17, 2011, 2:22:26 AM9/17/11
to scrap...@googlegroups.com
No probs Francis - all done!

william perrin

unread,
Sep 23, 2011, 5:09:34 AM9/23/11
to annapow...@gmail.com, scrap...@googlegroups.com
hi - i saw paul's post - is this robust enough now to share?  will break it to islington gently

cheers

w

Francis Irving

unread,
Sep 23, 2011, 10:10:35 AM9/23/11
to scrap...@googlegroups.com, annapow...@gmail.com
Yes, definitely!

Francis

william perrin

unread,
Nov 7, 2011, 5:26:52 PM11/7/11
to ScraperWiki
hi folks finally got around to setting this up for real world use (got
distracted replatforming my blog). am writing a piece on licensing
local pub. ideally would use feedburner which has a handy feed to
email feature which kings cross users are happy to use. it's also too
late at night for me to figure out how to use ifft

any suggestions for getting around the Feedburner error

>>>Sorry

This feed does not validate.

line 2, column 119: link must be a full and valid URL: /scrapers/
islington_business_licences/ [help]

... k>/scrapers/islington_business_licences/</
link><description></descriptio ...
^

In addition, interoperability with the widest range of feed readers
could be improved by implementing the following recommendations.

Your feed appears to be encoded as "iso-8859-1", but your server
is reporting "utf-8" [help]


line 2, column 8879: Missing atom:link with rel="self" [help]

... 5 Nov 2011 06:44:37 GMT</pubDate></item></channel></rss>



On Sep 23, 2:10 pm, Francis Irving <fran...@scraperwiki.com> wrote:
> Yes, definitely!
>
> Francis
>
>
>
>
>
>
>
> On Fri, Sep 23, 2011 at 10:09:34AM +0100, william perrin wrote:
> > hi - i saw paul's post - is this robust enough now to share?  will break it
> > to islington gently
>
> > cheers
>
> > w
>
> > On 11 September 2011 22:02, Anna Powell-Smith <annapowellsm...@gmail.com>wrote:
>
> > > William - the RSS bit seems to be working well, but it depends on some
> > > code in my scraper that's basically held together with baler twine -
> > > so I would wait for Francis/Julian to confirm that bit is okay before
> > > regarding it as finished :)
>
> > >http://ifttt.com/wtfmight be useful in this context - it can turn new
> > > RSS items into pretty much any action you like, so you could create a
> > > Twitter feed of new applications, for example.
>
> > > On 11 September 2011 16:16, william perrin <will...@talkaboutlocal.org>
> > > wrote:
> > > > thanks people esp anna and jullian
> > > > although i only understand the small words like 'and and 'to' below the
> > > end
> > > > result is really important for me as an activist who can't/won't code.
> > > > as you know from planning alerts etc with these systems that pile up
> > > > applications from businesses the important thing is to know when
> > > something
> > > > has been changed.  councils, largely because they don't see things from a
> > > > customer viewpoint tend not to produce change alerts on their data sets
> > > > (there are exceptions).  the officers job is to put data in not take it
> > > out.
> > > > when you need to keep tabs on several application processes it's really
> > > > useful to have these as an RSS feed or drop that into feedburner to set
> > > up
> > > > and alerts email
> > > > to see the sort of thing the kings cross community tackles using the
> > > > underlying data here
>
> > > > one question - is this 'finished' and can i buy anyone a pint to thank
> > > them?
> > > > cheers
>
> > > > w
>
> > > > cheers
>
> > > > w
> > > > On 11 September 2011 00:26, Anna Powell-Smith <annapowellsm...@gmail.com
>
> > > > wrote:
>
> > > >> It works!
>
> > > >> At least, the Islington licensing applications RSS feed [1] showed two
> > > >> new items this morning. Thank you!
>
> > > >> The scraper code [2] still relies on a bit of a hack though. The
> > > >> pubDate field needs to be set to the date we *first scraped* each
> > > >> item, and so I need to save that between runs. To do that, I load all
> > > >> the existing data via the API at the start of the run.
>
> > > >> So if you blog about it, I'd mention that this is a necessary step. At
> > > >> least, I think it's a necessary step. Julian may have suggestions for
> > > >> improving it.
>
> > > >> Also, if it is necessary, I wonder whether I'm doing it the safest way
> > > >> - would it be better to make all the calls to scraperwiki.sqlite.save
> > > >> at the very end of the scraper? I want to avoid ending up with
> > > >> half-written data if the scraper crashes half-way through. I can't
> > > >> remember if ScraperWiki actually executes calls to the database live,
> > > >> or queues them all till the scraper finishes.
>
> > > >> [1]http://bit.ly/pinSSf
> > > >> [2]https://scraperwiki.com/scrapers/islington_business_licences/edit/
>
> > > >> On 10 September 2011 11:06, Paul Bradshaw <paulonhismob...@gmail.com>
> > > >> wrote:
> > > >> > This is very significant IMHO - will blog about it too.
>
> > > >> > Paul Bradshaw
>
> > > >> >http://twitter.com/paulbradshaw
> > > >> >http://onlinejournalismblog.com
> > > >> >http://helpmeinvestigate.com
>
> > > >> > Sent from my phone
>
> > > >> > On 9 Sep 2011, at 21:52, Francis Irving <fran...@scraperwiki.com>
> > > wrote:
>
> > > >> >> Will link to any example you make in the blog post!
>
> > > >> >> On Fri, Sep 09, 2011 at 05:20:45PM +0100, Anna Powell-Smith wrote:
> > > >> >>> Hooray! Being able to construct it in SQL should be really useful
> > > too.
> > > >> >>> I'll try it out and let you know.
>
> > > >> >>> On 9 September 2011 17:12, Francis Irving <fran...@scraperwiki.com>
> > > >> >>> wrote:
> > > >> >>>> Julian has now added RSS support the External API.
>
> > > >> >>>> It's much neater than I would have expected - you can construct an
> > > >> >>>> arbitary RSS feed using just SQL.
>
> > > >> >>>> So for example this URL is a feed of Islington Business Licenses:
>
> > >https://api.scraperwiki.com/api/1.0/datastore/sqlite?format=rss2&name...
>
> > > >> >>>> Short link:http://bit.ly/pinSSf
>
> > > >> >>>> SQL query: select licence_for as description, applicant as title,
> > > url
> > > >> >>>> as link,
> > > >> >>>> date_scraped as date from swdata order by date_scraped desc limit
> > > 10
>
> > > >> >>>> Of course, you can use fancier queries to concatenate strings and
> > > so
> > > >> >>>> on if you need to.
>
> > > >> >>>> To make one, go to the Web API page for a scraper, and choose
> > > "rss2"
> > > >> >>>> as the format. Some help appears to tell you what fields you need
> > > to
> > > >> >>>> include.
>
> > >https://scraperwiki.com/docs/api?name=islington_business_licences#sqlite
>
> > > >> >>>> Anna, can you have a play with this and see if it meets your needs?
>
> > > >> >>>> If it seems to be working well, I'll blog about it next week.
>
> > > >> >>>> Francis
>
> > > >> >>>> On Mon, Aug 29, 2011 at 10:30:43PM +0100, Anna Powell-Smith wrote:
> > > >> >>>>> On 23 August 2011 13:25, Francis Irving <fran...@scraperwiki.com>
> > > >> >>>>> wrote:
> > > >> >>>>>> On Tue, Aug 23, 2011 at 11:11:45AM +0100, Anna Powell-Smith
> > > wrote:
> > > >> >>>>>>> The RSS view that Henare suggested is pretty much perfect in
> > > terms
> > > >> >>>>>>> of
> > > >> >>>>>>> parameters etc - I don't find that I need to customize it at
> > > all.
>
> > > >> >>>>>> Great!
>
> > > >> >>>>>> Can you give a URL of using it with one of your scrapers, that
> > > you
> > > >> >>>>>> like the output of?
>
> > >http://scraperwikiviews.com/run/rss_2/?scraper=islington_business_lic...

Anna Powell-Smith

unread,
Nov 9, 2011, 4:51:32 AM11/9/11
to scrap...@googlegroups.com
On 7 November 2011 22:26, william perrin <wil...@cankfarm.com> wrote:
> hi folks finally got around to setting this up for real world use (got
> distracted replatforming my blog). am writing a piece on licensing
> local pub.  ideally would use feedburner which has a handy feed to
> email feature which  kings cross users are happy to use. it's also too
> late at night for me to figure out how to use ifft

it's worth it, honest!

> any suggestions for getting around the Feedburner error
>
>>>>Sorry
>
> This feed does not validate.
>
>    line 2, column 119: link must be a full and valid URL: /scrapers/
> islington_business_licences/ [help]

ScraperWiki people - please would you fix <link> to include the domain?

http://bit.ly/vSIljK shows the error.

Ross Jones

unread,
Nov 9, 2011, 5:13:45 AM11/9/11
to scrap...@googlegroups.com
Sorry for the delay on this, slipped through the net.

Have fixed this now - http://bit.ly/vSIljK should show it is valid RSS.  I'll check the charset encoding problem

Ross.
Reply all
Reply to author
Forward
0 new messages