Dear All
I am working for The Nature Conservancy, based in Cambridge Zoology Department, and am currently scraping TripAdvisor reviews for a study into Mangrove Tourism worldwide.
In my code I have a 'for' loop which allows me to paginate through multiple pages for each given TripAdvisor attraction to get all of the reviews for that attraction. Each page varies by '-or10-', -or20-, or30- etc... so it was simple to modify the URL using a sequence.
In the above case I manually inputted every new URL and the upper limit of the looping sequence when changing from one attraction to another. However, now my project has gone bigger and I would have to do this 3,500 times. I am hoping to automate the procedure for putting in each new attraction URL when the previous is finished (a problem for another time).
This has introduced a complication: Each attraction will have a different number of pages of reviews so I will need to continually update the upper limit of the pagination loop sequence. If the upper limit is too high it goes back to the beginning and I get repetitions!
Rather than continually updating the upper limit I was wondering if it would be possible to use a different sort of loop with conditions. I thought I could use the fact that each review scraped has a unique ID number to tell the code to stop looping through the pages when it pulls a Review ID which it has already extracted. Does anyone know if this is possible? If so, what sort of terminology would be appropriate? If not, is there a better way to do it?
I apologize if I have not been clear. I truly appreciate your time and am happy to clarify anything if it will help.
Many thanks,
Cara Daneel