Data Scraping with r (TripAdvisor)

1,747 views
Skip to first unread message

Cara Daneel

unread,
Jun 23, 2015, 1:33:14 PM6/23/15
to cambridge-r...@googlegroups.com
I am currently working with Mark Spalding of the Nature Conservancy at the University of Cambridge Zoology department.

In our quest to research tourism in the world's mangroves we are attempting to mine reviews based around mangrove tourism activities from online sources; at the moment - tripadvisor.

I am currently using code found online here: (https://github.com/hadley/rvest/blob/master/demo/tripadvisor.R)
I have modified it by adding a loop to scroll through subsequent webpages (as reviews go over more than one page) and little tweaks so that it works for my needs. For example, scraping review data from attractions like this (http://www.tripadvisor.co.uk/Attraction_Review-g4156412-d628396-Reviews-",i,"Paradise_Island_The_Mangroves_Cayo_Arena-Punta_Rucia_Puerto_Plata_Province_Domini.html#REVIEWS").

For those who are interested I will attach a copy of my current script to this thread.  My immediate problem is that it is not recognizing the date as my page's source code differs to his and I am unsure of how to customize this aspect.

While I would be very excited to hear if you have direct code-altering advice, I would be grateful to just hear if anyone is doing similar work to this? I am relatively new to r and very new to the source code language lying at the back of the websites, and how to use it. 

I promise to keep any questions I have clear in content and limited in number but would truly appreciate some feedback from relevant parties who think they could be of assistance.


I truly appreciate your time.

Regards,

Cara

test_Tripadvisor_ShorterCode.R

Andrew Caines

unread,
Jun 24, 2015, 4:13:18 AM6/24/15
to cambridge-r...@googlegroups.com
Hi Cara,
I'm not working specifically with Trip Advisor data, but mining the web for texts is certainly an interest of mine.
Can you describe the date problem in a little more detail? What do you expect, what happens, can you provide a snippet of code to test just this bit?
regards, Andrew
-- 
Research Associate
ALTA Institute / Dept of Theoretical and Applied Linguistics
University of Cambridge
apc38; 01223 (3)60812

--
You received this message because you are subscribed to the Google Groups "Cambridge R user group" group.
To unsubscribe from this group and stop receiving emails from it, send an email to cambridge-r-user-...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Cara Daneel

unread,
Jun 24, 2015, 7:48:42 AM6/24/15
to cambridge-r...@googlegroups.com
Dear Max and Andrew

Thank you for replying.

Max: Thank you so much for the offer. That would be extremely helpful. I wish I had happened upon this group sooner as I am off on holiday late next week and this work is part of a part time job for me so I am not available everyday. I can possibly pop around next week Wednesday (1st July) before I fly out on Thursday; but, of course, this is at your convenience. Let me know which hours or days will suit you best. If not Wednesday then I could contact you when I return in two weeks? Should we hash out logistics via personal email? I can be found at cl...@cam.ac.uk. Again, my sincere thanks.

Andrew: Your web expertise would be much appreciated. I was able to get the original script to run on the tripadvisor page that the original script-writer used (link in my original message). For mine, I am just receiving NA's for the time information, while all other vectors (eg. rating, review extract) are forming fine. I am not sure how to customize the code to recognize the date data stored in a different source code format. The whole script is attached to my first message, the time-data scraping section is as follows:-

date <- reviews %>%
  html_node(".rating .ratingDate") %>%
  html_attr("title") %>%
  strptime("%b %d, %Y") %>%
  as.POSIXct()

I think the 'attribute' call (3rd line) is the problem but I am not sure of the necessary modification. Any advice would be truly appreciated. Thank you for your time!

--
You received this message because you are subscribed to a topic in the Google Groups "Cambridge R user group" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/cambridge-r-user-group/9H64-HcCqhs/unsubscribe.
To unsubscribe from this group and all its topics, send an email to cambridge-r-user-...@googlegroups.com.

Andrew Caines

unread,
Jun 24, 2015, 8:45:43 AM6/24/15
to cambridge-r...@googlegroups.com
I haven't looked so I'm not sure what's different about the TA page you were mining, but yes you were right the html_attr() call was where the input went to 'null'. So here's another way to get your formatted dates: you need to insert commands [1] to [3] inside your loop instead of the current 'date' line. Hope it helps. 

# load package
library(rvest)

# test URL
url <- 'http://www.tripadvisor.co.uk/Attraction_Review-g4156412-d628396-Reviews-Paradise_Island_The_Mangroves_Cayo_Arena-Punta_Rucia_Puerto_Plata_Province_Domini.html'

# fetch html
reviews <- url %>%
  html() %>%
  html_nodes("#REVIEWS .innerBubble")

## EXTRACT REVIEW DATES

# [1] get text from 'ratingDate' node(s)
rawdates <- reviews %>% html_node(".rating .ratingDate") %>% html_text("ratingDate")

# [2] split date strings on spaces and \n
datebits <- lapply(rawdates, function(x) unlist(strsplit(x, " |\n"))[2:4])

# [3] format as required
date <- unlist(lapply(datebits, function(x) paste0(paste(x[1:2], collapse=" "), paste(",", x[3]))))



Laura Hiemer

unread,
Apr 11, 2016, 5:17:26 AM4/11/16
to Cambridge R user group
Hi Cara,

as I am currently trying to scrap tripadvisor data as well I wanted to ask if you could already solve the problem of getting NA for the date?

I tried a lot of different codes but it never worked....

Thanks,

Laura

Max Conway

unread,
Apr 11, 2016, 8:06:22 AM4/11/16
to Cambridge R user group

Hey Laura,
I've found that when scraping with R it's often best to scrape everything as character strings, then convert after. So scrape just the character string itself, to check that the actual retrieval works, then you can try to find the best way to parse it at your leisure.
I'd recommend using lubrudate for parsing.
Hope that helps,
Max


--

Cara Daneel

unread,
Apr 12, 2016, 5:47:41 AM4/12/16
to cambridge-r...@googlegroups.com
Dear Laura,

I am currently at my other job so I do not have my notes with me. I will check my code and notes and then reply when I am in Zoology department tomorrow. Because TripAdvisor changed recently I had to scrape both relative date and actual date and then simply sort it out in excel. I will double check the specifics and get back to you.

Sorry I am not of immediate help.

Best,

Cara

--
You received this message because you are subscribed to a topic in the Google Groups "Cambridge R user group" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/cambridge-r-user-group/9H64-HcCqhs/unsubscribe.
To unsubscribe from this group and all its topics, send an email to cambridge-r-user-...@googlegroups.com.

Laura Hiemer

unread,
Apr 15, 2016, 12:04:13 PM4/15/16
to Cambridge R user group
Hi,

thank you for your kind replies.

It is working now with:

date <- reviews %>%

  html_node(".innerBubble, .ratingDate") %>%

  html_text()


Greetings,

Laura

Cara Daneel

unread,
Apr 18, 2016, 4:48:17 AM4/18/16
to cambridge-r...@googlegroups.com
Great, sorry I wasn't much help. I just came online to send you what I used. Glad you have solved it.

For interests sake, can I ask what you are working on? I found very few people using r to scrape tripadvisor reviews. 

Best wishes, 

Cara

--

Laura Hiemer

unread,
Apr 21, 2016, 5:43:33 AM4/21/16
to Cambridge R user group
Hi,

Im working on an image analysis for the bavarian destination Konigssee. Therefore it is very useful to have the online reviews.

Just one more question: could you solve the problem of getting the full review?


Kind regards,

Laura
To unsubscribe from this group and all its topics, send an email to cambridge-r-user-group+unsub...@googlegroups.com.

シ YUASA

unread,
Apr 22, 2016, 5:25:52 AM4/22/16
to Cambridge R user group
Hi, I am a student studying statics in Japan.
 I am interesting in your R code, I have a problem. I want scrap full text of reviews, but long reviews are cut and become ”...more”.
Please tell me if you solve it.
Thank you.

Cara Daneel

unread,
Apr 27, 2016, 6:56:44 AM4/27/16
to cambridge-r...@googlegroups.com
Dear Laura,

Unfortunately I did not. In order to gain access to the full review the code needs to interact with the website's javascript; which rvest is not equipped to deal with (as far as I could tell). I spent a lot of time researching alternative ways of doing this using R and the best option seems to be by combining rvest with another package RSelenium. This latter programme allows you to give commands like follow links or 'click this', but requires you to download a Selenium Standalone Server and Opera Chrome Driver. I tried this (even though the coding/computer expertise and knowledge needed were more than I possess) but I could not get it to work. So, for time-efficiency, I had to move on. However, there do seem to be ways of doing it, so don't give up. I am a zoologist and my computer programming knowledge isn't great.

Here are some of the links I read...

Good luck, I hope some of this is helpful to you!

Let me know if you are able to work through it!

Best,

Cara

On Thu, Apr 21, 2016 at 10:43 AM, Laura Hiemer <hiemer...@web.de> wrote:
Hi,

Im working on an image analysis for the bavarian destination Konigssee. Therefore it is very useful to have the online reviews.

Just one more question: could you solve the problem of getting the full review?


Kind regards,

Laura

Am Montag, 18. April 2016 10:48:17 UTC+2 schrieb Cara Daneel:
Great, sorry I wasn't much help. I just came online to send you what I used. Glad you have solved it.

For interests sake, can I ask what you are working on? I found very few people using r to scrape tripadvisor reviews. 

Best wishes, 

Cara
On Fri, Apr 15, 2016 at 5:04 PM, Laura Hiemer <hiemer...@web.de> wrote:
Hi,

thank you for your kind replies.

It is working now with:

date <- reviews %>%
  html_node(".innerBubble, .ratingDate") %>%

  html_text()


Greetings,

Laura

--
You received this message because you are subscribed to a topic in the Google Groups "Cambridge R user group" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/cambridge-r-user-group/9H64-HcCqhs/unsubscribe.
To unsubscribe from this group and all its topics, send an email to cambridge-r-user-...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to a topic in the Google Groups "Cambridge R user group" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/cambridge-r-user-group/9H64-HcCqhs/unsubscribe.
To unsubscribe from this group and all its topics, send an email to cambridge-r-user-...@googlegroups.com.

Cara Daneel

unread,
Apr 27, 2016, 6:58:33 AM4/27/16
to cambridge-r...@googlegroups.com
Dear Yuasa,

The rvest code actually isn't mine - it was written and provided by someone in America called Hadley Wickham. But I have made use of it.

Unfortunately I did not find a way to get the full review beyond 'read more'. In order to gain access to the full review the code needs to interact with the website's javascript; which rvest is not equipped to deal with (as far as I could tell). I spent a lot of time researching alternative ways of doing this using R and the best option seems to be by combining rvest with another package RSelenium. This latter programme allows you to give commands like follow links or 'click this', but requires you to download a Selenium Standalone Server and Opera Chrome Driver. I tried this (even though the coding/computer expertise and knowledge needed were more than I possess) but I could not get it to work. So, for time-efficiency, I had to move on. However, there do seem to be ways of doing it, so don't give up. I am a zoologist and my computer programming knowledge isn't great.

--
Reply all
Reply to author
Forward
0 new messages