Running out of batches for my thesis research

49 views
Skip to first unread message

luuk van der Hout

unread,
Apr 23, 2024, 5:36:52 AMApr 23
to Guardian Open Platform API Forum
Hi, 

Currently, I'm working on my master's thesis and using Large Language models to make predictions. For this, I want to extract all news articles that are about (US) stocks from 2022-01-01 to 2023-12-31. I have successfully connected to the API but I am running out of batches. Could someone please please assist me?

Kind regards, 

Luuk van der Hout

Jonathon Herbert

unread,
Apr 23, 2024, 5:44:01 AMApr 23
to guardian...@googlegroups.com
Hi Luuk,

Thanks for your e-mail! Happy to see you're using our API.

When you say 'running out of batches' – can you provide more information about the problem you're having? Ideally, this would be something we can easily replicate, for example, a query string and its actual response, along with a description of what you'd expect to see.

Best,

Jon.

--
You received this message because you are subscribed to the Google Groups "Guardian Open Platform API Forum" group.
To unsubscribe from this group and stop receiving emails from it, send an email to guardian-api-t...@googlegroups.com.
To view this discussion on the web, visit https://groups.google.com/d/msgid/guardian-api-talk/4993a1b6-bbf5-45ba-9ff5-276e3d6dd584n%40googlegroups.com.


--

Jonathon Herbert · he/him

Senior Developer, Guardian News and Media

jonathon...@theguardian.com


-----

Kings Place, 90 York Way,

London N1 9GU

theguardian.com

-----

Download the Guardian app for Android and iOS​




This e-mail and all attachments are confidential and may also be privileged. If you are not the named recipient, please notify the sender and delete the e-mail and all attachments immediately. Do not disclose the contents to another person. You may not use the information for any purpose, or store, or copy, it in any way.  Guardian News & Media Limited is not liable for any computer viruses or other material transmitted with or as part of this e-mail. You should employ virus checking software.
 
Guardian News & Media Limited is a member of Guardian Media Group plc. Registered Office: PO Box 68164, Kings Place, 90 York Way, London, N1P 2AP.  Registered in England Number 908396


luuk van der Hout

unread,
Apr 24, 2024, 11:18:23 AMApr 24
to Guardian Open Platform API Forum
Hi Jon,

Thank you for the quick response. I am not very good at coding so I will try to elaborate further (but I'm sorry if I'm being vague). I'm trying to run the following code in R:

# Iterate over each row of the dataframe
for (i in seq_along(list$list)) {
  iteration_counter <- iteration_counter + 1                           # Increment the iteration counter
  cat("Iteration:", iteration_counter, "\n")                                # Print the iteration number to the console
  term <- list$list[i]                                                                     # Get the term for the current iteration
  to_search <- gd_search(term, pages = search_pages)    # Perform search for the current term
  results <- gd_call(to_search)                                                # Retrieve search results
  results$search_term <- term                                                # Add a column to indicate the term searched
  all_results <- rbind(all_results, results)                               # Append results to the dataframe
}


Where "list" refers to the words I want to search with your API. Examples of this are "apple inc", "tesla motors inc" etc. This list contains all stock names listed on NASDAQ, NYSE and AMEX with Share code 10 or 11. This is a list of 8456. 
"Search_pages" is the amount of pages I want to search for each iteration. This number is somewhat arbitrary but I have to make sure i get all information in 2022 and 2023. Therefore, i somewhat overshot and is currently set at 300. I do this because when I tried to specify a start and end date it did not work. 

The goal i have in mind is that i want all news articles about stocks (on my list) between January 2022 and December 2023. With this, I want to make a consensus of the stock and make stock predictions. Therefore, the response I need is the webTitle, type, id, and webPublicationDate in a data frame.  So if I iterate through the whole list, the data frame (if search_pages is set to 300) should be 8456*300*10 observations with 4 variables. 

Because this is a lot of data I think your platform does not allow me to run the amount of batches needed. Could you please assist me in getting this to work? (Also, if my code is not efficient I'm open to new idea's). 

Kind regards, 

Luuk van der Hout


Op dinsdag 23 april 2024 om 11:44:01 UTC+2 schreef jonathon...@guardian.co.uk:

Jonathon Herbert

unread,
Apr 24, 2024, 11:43:48 AMApr 24
to guardian...@googlegroups.com
Hi Luuk,

Your explanation 'I think your platform does not allow me to run the amount of batches needed' may be correct – could you share more information about the exact error message you are receiving? This will help us confirm.

Best,

Jon.




--

Jonathon Herbert · he/him

Senior Developer, Guardian News and Media

jonathon...@theguardian.com


-----

Kings Place, 90 York Way,

London N1 9GU

theguardian.com

-----

Download the Guardian app for Android and iOS


luuk van der Hout

unread,
Apr 30, 2024, 6:28:22 AMApr 30
to Guardian Open Platform API Forum
Hi Jon,

I do not get an error. The code will keep running but after a certain amount of time, it stops retrieving the news items. Then when I try it a few days later (without changing the code) it works again. 

Kind regards, 

Luuk van der Hout

Op woensdag 24 april 2024 om 17:43:48 UTC+2 schreef jonathon...@guardian.co.uk:

Jonathon Herbert

unread,
Apr 30, 2024, 11:40:16 AMApr 30
to guardian...@googlegroups.com
Hi Luuk,

If you are being rate limited, the Content API will be returning HTTP error code: 429, but sadly I don't know enough R to be helpful here! Perhaps another user or an R-specific forum could help with that. 

If I understand your current scenario correctly, were we to have 300 pages of content for each of your search terms, you would need to make a lot of requests – 8456 * 300 * 10 = 25,368,000. Your current rate limit is 60 requests a minute and 500 requests a day, making this impractical.

A few things that might help –
  1. We can raise your request limit. It's worth noting that there are two limits, requests per minute (default 60) and requests per day (default 500). Introducing a delay between requests to ensure that you're not asking for too much data at once will help you avoid triggering limits.
  2. You can fetch bigger pages. The default page size is 10 content items, but it can be increased to 200. This will reduce the number of requests you need to make.
  3. You can fetch fewer pages. There are only 20 pages (200 items) of content for "apple inc", and 1 page of content for "tesla motors inc". You mention overfetching – are you certain you are asking for pages that exist for each query? Perhaps you could take the total pages to fetch from the `pages` property of the first result, and use this as a limit when iterating through pages. (It's worth noting the Content API will return an 400 error code if it is asked for a page that doesn't exist.)
With regards to 1., I've raised your request limit to 720 requests a minute and 5000 a day. I hope that helps!

Best,

Jon.

luuk van der Hout

unread,
Apr 30, 2024, 3:54:04 PMApr 30
to Guardian Open Platform API Forum
thanks a lot Jon!

Op dinsdag 30 april 2024 om 17:40:16 UTC+2 schreef jonathon...@guardian.co.uk:
Reply all
Reply to author
Forward
0 new messages