Query on quantity of content

58 views
Skip to first unread message

klostar7777

unread,
Mar 15, 2016, 7:41:11 AM3/15/16
to Guardian Open Platform API Forum

Hiya,

I'm a researcher at a UK university seeking quite large quantities of textual data (with as much variety as possible) for evaluations of machine learning algorithms.  I'm thinking text content from Guardian articles would be ideal content for this.  Ideally I need 80 to 100GB of textual data.  Is this the sort of quantities I might be able to get from the Guardian open platform (from all article genres) ?  If so that would be great news and I'll proceed with the scala API,

Many thanks for advice,

Karen

RevDanCatt

unread,
Mar 15, 2016, 8:25:51 AM3/15/16
to Guardian Open Platform API Forum
Hey Karen,

From having played with the API a lot my gut feeling is that it's not even close to 80GB, sadly.

A single request of 200 articles will give you body text for those articles of around 1.8MB saved as a text file. 200 articles is the maximum allowed per request ("page"), it'll tell you there are 9,258 "pages" (requests) of data. Back of napkin calculation 9,258 "pages", each with 200 articles = 9,258 x 1.8MB = 16,664MB = 16.6GB.

With the developer level of access you're allowed to make 5,000 calls per day. So on the up side it would only take you 2 days to fetch all the articles, which is somewhere in the region of 1.8 Million articles.

Hope this helps.
-D

klostar7777

unread,
Mar 15, 2016, 8:38:48 AM3/15/16
to Guardian Open Platform API Forum

Hi D, 

Useful to know.  16GB might be worth the effort to retrieve as I may end up adding it to other collections to eventually make up 80GB,

Many thanks for info,
Karen


On Tuesday, March 15, 2016 at 11:41:11 AM UTC, klostar7777 wrote:
Reply all
Reply to author
Forward
0 new messages