Rule in LinkExtractor CrawlerSpider

47 views
Skip to first unread message

fabian wolfmann

unread,
Jul 14, 2016, 5:51:04 PM7/14/16
to scrapy-users
Hi i was scraping a page and reading the Learning scrapy book.
my issue is when i want to go to nextPage, it has not a link, its just change the page-number attribute and use that in the javascript to show others elements.
the page is:
http://www.fravega.com/tv-y-video/tv#3, in the bottom of the page its have an number of pages and its make just change in /tv#2 for example.

if any one know how to solve it will be graet for me!
Thanks
Fabian

Travis Leleu

unread,
Jul 14, 2016, 6:09:55 PM7/14/16
to scrapy-users
Fabian,

That's likely a javascript triggered data request or page load.  It may be that the website loads all the data, and appending the anchor (#2) just swaps out the display.  If that's the case, look for the data element with all your data and you're set.

Otherwise, you'll need to reverse engineer the ajax call that gets triggered to load that data.  The nice thing is if this works, you don't really need to scrape/extract the page, since the JSON return from the ajax call will already have some structure to it.  (No need for HTML -> object, because the ajax call will return JSON which is easy to turn into an object).



--
You received this message because you are subscribed to the Google Groups "scrapy-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to scrapy-users...@googlegroups.com.
To post to this group, send email to scrapy...@googlegroups.com.
Visit this group at https://groups.google.com/group/scrapy-users.
For more options, visit https://groups.google.com/d/optout.

fabian wolfmann

unread,
Jul 19, 2016, 10:11:56 AM7/19/16
to scrapy-users, m...@travisleleu.com
Thanks Travis
i look for the element with all the data and it isnt on the page,
now how can i make reverse engineer to the ajax and get de json from the page?
do you know any tool or any way to do it?

Thanks very much!!

Rolando Espinoza

unread,
Jul 19, 2016, 10:17:36 PM7/19/16
to scrapy...@googlegroups.com
Here are a few StackOverflow answers that may help you:




In short, if the content is retrieved via a XHR request, you have to figure out how to reproduce that request in Scrapy (i.e.: building the URL and payload manually) and reading the output, sometimes it's a nice JSON object, others is HTML text, and in extreme cases is obfuscated output.

If the work to reverse the javascript interactions is too much, an alternative approach is using Splash for getting the js-rendered content: https://blog.scrapinghub.com/2015/03/02/handling-javascript-in-scrapy-with-splash/

You can also do the latter with Selenium or PhantomJS.

Best,

Rolando
Reply all
Reply to author
Forward
0 new messages