How to crawl Youtube data recursively?

482 views
Skip to first unread message

JoeJoe

unread,
Mar 1, 2017, 4:38:43 PM3/1/17
to scrapy-users

Hi guys. First and foremost, I am a total newbie to Scrapy and web crawling in general. I have this crawler I am working on that gets to extract static and dynamic data from youtube videos (such as User/Up-loader's name, Date of video published, Number of views, likes and dislikes etc.). The crawler is doing that perfectly but I am kind of stuck on how to make it crawl continuously without breaking. As you might be familiar with the youtube structure, when viewing a video, there is a list of "related" videos onto the right side of the main window. 



From the image above, the highlighted line is the container am using with the id = "watch-related" to capture the links of these videos so as to extract the same data. From the results am getting, the crawler is evidently not getting all the links on the current seeded url and it finishes crawling after a while. The only way so far that I have tried and succeeded in getting it to crawl recursively is by using the dont_filter=True option which starts to crawl the same page indefinitely after a while and it is not what I need the crawler to do. Below is my very simple crawler. Again, am not good at this so my apologies for my poor coding skills. If Someone could show me a simple way to get the crawler scrape recursively and skip the already extracted urls, I'll be forever grateful. Thank you in advance.



 

Message has been deleted

JoeJoe

unread,
Mar 2, 2017, 4:58:04 AM3/2/17
to scrapy-users
Sorry guys, this is how the code should look.




Message has been deleted

JoeJoe

unread,
Mar 4, 2017, 12:42:27 PM3/4/17
to scrapy-users
youtube.py

Parth Verma

unread,
Mar 4, 2017, 11:44:38 PM3/4/17
to scrapy-users
One change that I did was :

RELATED_SELECTOR = '#watch-related a ::attr(href)'
           
for article in response.css(RELATED_SELECTOR).extract():

               
if article:
                   
yield scrapy.Request(
                        response
.urljoin(article),
                        callback
=self.parse, #dont_filter=True
                       
)

This throws the spider into an infinite loop.


On Saturday, 4 March 2017 23:12:27 UTC+5:30, JoeJoe wrote:

JoeJoe

unread,
Mar 5, 2017, 12:29:27 AM3/5/17
to scrapy-users
I have made the changes and its indeed working. Thank you very much on your help. Much much appreciated.
Reply all
Reply to author
Forward
0 new messages