It would be lovely if each scraper had an RSS feed of new data items -
I have two use cases for this where I'm using my own code to generate
RSS feeds from ScraperWiki datasets - licensing applications to
councils, where you want to see new applications as they appear, and
houses for sale, ditto.
thanks
Anna
Although it's not a core feature, you can get an RSS feed for any
scraper using this view[0] (I think you already know that but for
others on the list...).
Cheers,
Henare
[0] https://scraperwiki.com/views/rss_2/
--
Henare Degan
e » henare...@gmail.com
t » @henaredegan
w » www.henaredegan.com
Would this be better/worse than a very slick interface for making an
RSS view using a template (such as below) that you can then at least
edit and customise?
Francis
But there are two problems with it:
1. It's too slow to render - e.g. the W3C validator times out before
it renders.
2. It's a view, so it has the "Powered by ScraperWiki" HTML pane, so
it isn't valid RSS.
One or other of these means that Google Reader can't cope the RSS it
generates [0].
On reflection what I'd like is a ScraperWiki command that I can call
from within the scraper, that actually generates a static RSS file
when it runs.
Just like the RSS view, it would be ideal to specify title, link,
description, ordering etc.
Does that sound sensible?
Can you give a URL of using it with one of your scrapers, that you
like the output of?
> But there are two problems with it:
>
> 1. It's too slow to render - e.g. the W3C validator times out before
> it renders.
:( Am just taking a look at this.
> 2. It's a view, so it has the "Powered by ScraperWiki" HTML pane, so
> it isn't valid RSS.
I've added a line to set the mimetype, which stops ScraperWiki
adding that pane. (In https://scraperwiki.com/views/rss_2/)
ScraperWiki.httpresponseheader('Content-Type', 'application/rss_xml')
I've added an FAQ about this too - not deployed yet.
> One or other of these means that Google Reader can't cope the RSS it
> generates [0].
>
> On reflection what I'd like is a ScraperWiki command that I can call
> from within the scraper, that actually generates a static RSS file
> when it runs.
>
> Just like the RSS view, it would be ideal to specify title, link,
> description, ordering etc.
>
> Does that sound sensible?
The trouble is that it needs somewhere to put the file, and have it
served from.
Best bet is really for us to fix the underlying speed problem...
Francis
NB in the original scraper, I have to use API calls to save the
date_scraped field between runs. It'd be nice to fix this too, at
least if you have other users who want RSS feeds.
>> But there are two problems with it:
>>
>> 1. It's too slow to render - e.g. the W3C validator times out before
>> it renders.
>
> :( Am just taking a look at this.
Thanks, that would be good.
>> 2. It's a view, so it has the "Powered by ScraperWiki" HTML pane, so
>> it isn't valid RSS.
>
> I've added a line to set the mimetype, which stops ScraperWiki
> adding that pane. (In https://scraperwiki.com/views/rss_2/)
>
> ScraperWiki.httpresponseheader('Content-Type', 'application/rss_xml')
>
> I've added an FAQ about this too - not deployed yet.
Thanks!
It's much neater than I would have expected - you can construct an
arbitary RSS feed using just SQL.
So for example this URL is a feed of Islington Business Licenses:
Short link: http://bit.ly/pinSSf
SQL query: select licence_for as description, applicant as title, url as link,
date_scraped as date from swdata order by date_scraped desc limit 10
Of course, you can use fancier queries to concatenate strings and so
on if you need to.
To make one, go to the Web API page for a scraper, and choose "rss2"
as the format. Some help appears to tell you what fields you need to
include.
https://scraperwiki.com/docs/api?name=islington_business_licences#sqlite
Anna, can you have a play with this and see if it meets your needs?
If it seems to be working well, I'll blog about it next week.
Francis
Paul Bradshaw
http://twitter.com/paulbradshaw
http://onlinejournalismblog.com
http://helpmeinvestigate.com
Sent from my phone
At least, the Islington licensing applications RSS feed [1] showed two
new items this morning. Thank you!
The scraper code [2] still relies on a bit of a hack though. The
pubDate field needs to be set to the date we *first scraped* each
item, and so I need to save that between runs. To do that, I load all
the existing data via the API at the start of the run.
So if you blog about it, I'd mention that this is a necessary step. At
least, I think it's a necessary step. Julian may have suggestions for
improving it.
Also, if it is necessary, I wonder whether I'm doing it the safest way
- would it be better to make all the calls to scraperwiki.sqlite.save
at the very end of the scraper? I want to avoid ending up with
half-written data if the scraper crashes half-way through. I can't
remember if ScraperWiki actually executes calls to the database live,
or queues them all till the scraper finishes.
[1] http://bit.ly/pinSSf
[2] https://scraperwiki.com/scrapers/islington_business_licences/edit/
http://ifttt.com/wtf might be useful in this context - it can turn new
RSS items into pretty much any action you like, so you could create a
Twitter feed of new applications, for example.
On Monday 12 Sep 2011 07:07:39 pezholio wrote:
> This is great stuff, I'm having a few problems with my scraper
> however, I get the following error:
>
> "Date conversion error: Required argument 'day' (pos 3) not found"
>
I'm guessing you need to use the strftime/strptime function to format the date and pass a format string. SQLite provides a surprisingly limited range of date/time transforms compared to [your favourite programming language's date-time lib]. This is something SW users in general should be aware of as it's annoying to have to re-scrape later if you stored dates in some way that SQLite doesn't like. FYI, valid RSS requires dates to be in RFC822 format.
As for GeoRSS, the only problem I can think of is that it requires a different DTD at the top. Otherwise you could presumably just select the lat and long fields and alias them as required.
--
The only thing worse than e-mail disclaimers...is people who send e-mail to lists complaining about them
I think the problem is that the scraperwiki api code expects an ISO 8601 date in the YYYY-MM-DD format. Looks like currently it is using something like Friday, 15th Jun instead of 2011-06-15. is the data stored in your scraper in ISO 8601 format?
Ross
That said, I guess the RFC-822 format may cause problems when ordering by date, so the solution may be to have two date fields - one in one format and one in t'other. Would I be correct in that assumption?
--------------
Sent from my Mark II Colossus
Aidan
What a great tool - and really simple to use! Thanks for sharing this!
best, tobias
I would do this: add a field called date_scraped, and set it to the
current datetime when you run the scraper.
Then, before you insert a new record, check whether you've scraped it
before - this is how the Australian planning scrapers do it:
das.each do |record|
if ScraperWiki.select("* from swdata where `id`='#{record['id']}'").empty?
ScraperWiki.save_sqlite(['id'], record)
else
puts "Skipping already saved record " + record['id']
end
end
When you generate the RSS, set date to date_scraped and order by
date_scraped, so that more recently scraped items come top. (You can
still output your "date of inspection" field in the description if you
want.)
And then start lobbying for the "insert if new record" SQLite call
that Henare suggested :)
https://twitter.com/#!/eatsafewalsall
@eatsafewalsall hasn't tweeted yet.
Francis
Will probably be posted on Wednesday!
Could you add a mention of ScraperWiki and ifttt on the bio of
@EatSafeWalsall, so people know how it was made?
Francis
On Thu, Sep 15, 2011 at 02:13:52AM -0700, pezholio wrote:
Francis
it's worth it, honest!
> any suggestions for getting around the Feedburner error
>
>>>>Sorry
>
> This feed does not validate.
>
> line 2, column 119: link must be a full and valid URL: /scrapers/
> islington_business_licences/ [help]
ScraperWiki people - please would you fix <link> to include the domain?
http://bit.ly/vSIljK shows the error.