Re: [guardian-api-talk] 24 hours Guardian data storage limit

98 views
Skip to first unread message

Michael Brunton-Spall

unread,
Mar 25, 2013, 7:11:29 AM3/25/13
to guardian...@googlegroups.com
Hey Ramiro,

I'm afraid for legal reasons we need to ensure that deletions or
amendments of content are represented by all our clients within 24
hours.
That includes any text from the api, incuding headlines and urls.

We therefore require that you regenerate the lists of content at least
once every 24 hours.

The two most common wasy of achieving this are:

1. Run a regular shceduled job, that once every 24 hours, processes all
of the article id's you have stored, and checks to see if any of the
metadata has changed. If so delete or update as necessary

2. (For less regualrly visited sites or pages). When going to render
your page, fetch data from the content api from a cache. Set a maximum
age on the cache to be 24 hours. That way you'll never fetch more than
once every 24 hours, but will only fetch data on demand, not
unecessarily.

I hope that helps

Michael Brunton-Spall

On Sat, Mar 23, 2013 at 03:50:27PM -0700, Ramiro G�mez wrote:
> Hi,
>
> I'm currently working on a visualization project that shows the evolution
> of certain topics covered in the Gurdian over time.
>
> In some parts of the visualization I want to show a list of article titles
> and thumbnail images for a particular topic. The headlines would link to
> the original article on the Guardian website.
>
> To do so I intended to store meta information for those articles, but I've
> read that storing Guardian content for more than 24 hours does not comply
> with the API terms of service.
>
> I'm not sure whether storing article headlines for more than 24 hours would
> be regarded as a breach of the API terms or if by content you mean whole
> article texts. I hope someone from the Guardian team can clarify this and
> in case I have to change my plans advise on possible workarounds.
>
> TIA Ramiro
>
> --
> You received this message because you are subscribed to the Google Groups "Guardian API Talk" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to guardian-api-t...@googlegroups.com.
> To post to this group, send an email to guardian...@googlegroups.com.
> Visit this group at http://groups.google.com/group/guardian-api-talk?hl=en-GB.
> For more options, visit https://groups.google.com/groups/opt_out.
>
>

Michael Brunton-Spall

unread,
Mar 28, 2013, 10:33:03 AM3/28/13
to guardian...@googlegroups.com
Hey Ramiro,

Sorry for the delay, I've been out of the office for a few days.

It would probably be acceptable to cache the counts, although it's a bit of a grey area, so I can't definitively say whether you can or would need to re-request the counts.  It would depend on the words, the context they are displayed in and other issues that might normally affect the updating or adjustment of the content in question.

Hope that helps

Michael Brunton-Spall
Developer Advocate
guardian.co.uk


On 25 March 2013 11:29, Ramiro Gómez <gro...@ramiro.org> wrote:
Thanks for the reply Michael!

I'll see if I can update the data I need via JavaScript without including credentials. I can't use method one right now because of hosting limitations.

One more follow-up question. In the visualization I aggregate counts of articles containing certain words or phrases over time. These aggregated counts could of course be affected by changes to the content. Fetching this data on demand is not feasible because it would take too long and updating all data to show change over time every 24 hrs is not an option either because of too many requests.

Say, I include a statement that the aggregated numbers shown are a snapshot taken at a certain time, would it be okay to calculate these numbers just once?

Best Ramiro


On Monday, March 25, 2013 12:11:29 PM UTC+1, Michael Brunton-Spall wrote:
Hey Ramiro,

I'm afraid for legal reasons we need to ensure that deletions or
amendments of content are represented by all our clients within 24
hours.
That includes any text from the api, incuding headlines and urls.

We therefore require that you regenerate the lists of content at least
once every 24 hours.

The two most common wasy of achieving this are:

1. Run a regular shceduled job, that once every 24 hours, processes all
of the article id's you have stored, and checks to see if any of the
metadata has changed.  If so delete or update as necessary

2. (For less regualrly visited sites or pages). When going to render
your page, fetch data from the content api from a cache.  Set a maximum
age on the cache to be 24 hours.  That way you'll never fetch more than
once every 24 hours, but will only fetch data on demand, not
unecessarily.

I hope that helps

Michael Brunton-Spall


Please consider the environment before printing this email.
------------------------------------------------------------------
Visit guardian.co.uk - website of the year
 
www.guardian.co.uk    www.observer.co.uk     www.guardiannews.com 
 
On your mobile, visit m.guardian.co.uk or download the Guardian
iPhone app www.guardian.co.uk/iphone and iPad edition www.guardian.co.uk/iPad 
 
Save up to 32% by subscribing to the Guardian and Observer - choose the papers you want and get full digital access. 
Visit guardian.co.uk/subscribe
 
---------------------------------------------------------------------
This e-mail and all attachments are confidential and may also
be privileged. If you are not the named recipient, please notify
the sender and delete the e-mail and all attachments immediately.
Do not disclose the contents to another person. You may not use
the information for any purpose, or store, or copy, it in any way.
 
Guardian News & Media Limited is not liable for any computer
viruses or other material transmitted with or as part of this
e-mail. You should employ virus checking software.
 
Guardian News & Media Limited
 
A member of Guardian Media Group plc
Registered Office
PO Box 68164
Kings Place
90 York Way
London
N1P 2AP
 
Registered in England Number 908396

Message has been deleted

Ramiro Gómez

unread,
Apr 5, 2013, 5:36:01 AM4/5/13
to guardian...@googlegroups.com
Thanks for the info Michael, just saw your reply. I published a version were article listings are fetched live and counts are cached. Maybe you find the time to review it, hope its okay the way I've implemented it http://exploringdata.github.com/vis/climate-changes-decade/

Best Ramiro
Reply all
Reply to author
Forward
0 new messages