Looping Advice - Scraping Data from webpages

25 views
Skip to first unread message

Cara Daneel

unread,
Oct 26, 2015, 11:53:36 AM10/26/15
to cambridge-r...@googlegroups.com
Dear All

I am working for The Nature Conservancy, based in Cambridge Zoology Department, and am currently scraping TripAdvisor reviews for a study into Mangrove Tourism worldwide. 

In my code I have a 'for' loop which allows me to paginate through multiple pages for each given TripAdvisor attraction to get all of the reviews for that attraction. Each page varies by '-or10-', -or20-, or30- etc... so it was simple to modify the URL using a sequence.

In the above case I manually inputted every new URL and the upper limit of the looping sequence when changing from one attraction to another. However, now my project has gone bigger and I would have to do this 3,500 times. I am hoping to automate the procedure for putting in each new attraction URL when the previous is finished (a problem for another time).

This has introduced a complication: Each attraction will have a different number of pages of reviews so I will need to continually update the upper limit of the pagination loop sequence. If the upper limit is too high it goes back to the beginning and I get repetitions!

Rather than continually updating the upper limit I was wondering if it would be possible to use a different sort of loop with conditions. I thought I could use the fact that each review scraped has a unique ID number to tell the code to stop looping through the pages when it pulls a Review ID which it has already extracted. Does anyone know if this is possible? If so, what sort of terminology would be appropriate? If not, is there a better way to do it?

I apologize if I have not been clear. I truly appreciate your time and am happy to clarify anything if it will help.

Many thanks,

Cara Daneel

Andrew Caines

unread,
Oct 27, 2015, 7:03:05 AM10/27/15
to cambridge-r...@googlegroups.com
Hi Cara,
It sounds to me like you need to keep a list of seen IDs, then check that your current ID is not in this list before proceeding. But this is all a bit pseudo, and if you want practical help please post some reproducible code.
best wishes, Andrew

--
You received this message because you are subscribed to the Google Groups "Cambridge R user group" group.
To unsubscribe from this group and stop receiving emails from it, send an email to cambridge-r-user-...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Michael Moffat

unread,
Oct 27, 2015, 9:53:12 AM10/27/15
to Cambridge R user group
If the first ID encountered will always be the first ID to be repeated then you should only need to store that ID and check against it each time. That would stop you needing to either grow a vector or know the number of IDs before starting.

To stay most in line with what it sounds like you've been doing, you could probably combine the ID check with a while loop - looping through your sites when the current ID differs from the original ID; although there may be nicer alternatives.

Reply all
Reply to author
Forward
0 new messages