PhantomJS Downloader Middleware

1,495 views
Skip to first unread message

David Fishburn

unread,
May 13, 2015, 3:32:57 PM5/13/15
to scrapy...@googlegroups.com
I am new to Scrapy and Python.

I have a site I need to scrap but it is all AJAX driven, so will need something like PhantomJS to yield the final page rendering.

I have been searching in vain really for a simple example of a downloader middleware which uses PhantomJS.  It has been around long enough that I am sure someone has already written one.  I can find complete projects for Splash and others, but I am on Windows.

It doesn't need to be fancy, just take the Scrapy request and return the PhantomJS page (most likely using the WaitFor.js, which the PhantomJS dev team wrote, to only return the page after it has stopped making AJAX calls).

I am completely lost trying to get started.  The documentation (http://doc.scrapy.org/en/latest/topics/downloader-middleware.html) talks about the APIs, but they don't give a basic application which I could begin modifying to plugin the PhantomJS calls which I have shown below (which are very simple).

Anyone have something I can use?

This code does what I want when using the Scrapy shell:



>>>from selenium import webdriver
>>>driver = webdriver.PhantomJS()
>>>driver.set_window_size(1024, 768)
-- Wait here for a 30 seconds and let the AJAX calls finish
>>>driver.save_screenshot('screen.png')
>>>print driver.page_source
>>>driver.quit()


The screen shot contains a properly rendered browser.


Thanks for any advice you can give.
David



Travis Leleu

unread,
May 13, 2015, 3:55:15 PM5/13/15
to scrapy...@googlegroups.com
Hi David,

Honestly, I have yet to find a good integration with scrapy / JS browser.  The current methods seem to all download the basic page via urllib3, then send that HTML to render and fetch other resources.

This causes a bottleneck -- the browser process, usually exposed via an API, takes a lot of CPU / time to render the page.  It also doesn't easily use proxies, which means that all subsequent requests will be from one IP address.

I think it would be a lot of work to build this into scrapy.

In my work, I tend to just write my own (scaled down) scraping engine that works more directly with a headless js browser.

--
You received this message because you are subscribed to the Google Groups "scrapy-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to scrapy-users...@googlegroups.com.
To post to this group, send email to scrapy...@googlegroups.com.
Visit this group at http://groups.google.com/group/scrapy-users.
For more options, visit https://groups.google.com/d/optout.

José Ricardo

unread,
May 14, 2015, 12:24:08 AM5/14/15
to scrapy...@googlegroups.com
Hi David, have you given ScrapyJS a try?


Besides rendering the page, it can also take screenshots :)

Regards,

José

David Fishburn

unread,
May 14, 2015, 9:57:32 AM5/14/15
to scrapy...@googlegroups.com
Thanks for the response José.  

That integrates Splash as the JS renderer.  From the documentation I have read, it looks like Splash does not support Windows.

David

Joey Espinosa

unread,
May 14, 2015, 10:13:09 AM5/14/15
to scrapy...@googlegroups.com
David,

I've written middleware to intercept a JS-specific request before it is processed. I haven't used WaitFor.js, so I can't help you there, but I can help get you started with PhantomJS.

    class JSMiddleware(BaseMiddleware):
        def process_request(self, request, spider):
            if request.meta.get('js'): # you probably want a conditional trigger
                driver = webdriver.PhantomJS()
                driver.get(request.url)
                body = driver.page_source
                return HtmlResponse(driver.current_url, body=body, encoding='utf-8', request=request)
            return

That's the simplest approach. You may want to end up adding options to the webdriver.PhantomJS() call, such as desired_capabilities including SSL processing options or a user agent string. You may also want to wrap the driver.get() call in a try/except block. Additionally, you should do something with the cookies that come back from PhantomJS using driver.get_cookies().

Also, if you want every request to go through JS, then you can remove the request.meta['js'] conditional. Otherwise, you could insert that information for initial requests in a spider.make_requests_from_url override, or you could simply have a spider instance method like spider.run_js(request) where the spider can look at the request and determine whether it needs JS on it based on some criteria you come up with.

There are a lot of options for you with PhantomJS, so it's really up to you, but this should be a decent starting point. I hope this answers your question.

--
Respectfully,

Joey Espinosa

Joey Espinosa

unread,
May 14, 2015, 10:16:21 AM5/14/15
to scrapy...@googlegroups.com
Sorry, if you want the import for that example I just sent (I often hate when tutorials leave that out), here you go:

    from scrapy.http import HtmlResponse
    
Also, I realize that you don't need to subclass BaseMiddleware for your middleware class. That's an artifact of my own boilerplate code, because I usually have a base middleware class that contains common things I want all middleware to have. You can just do this in the previous example (my bad):

    class JSMiddleware(object):

--
Respectfully,

Joey Espinosa

Joey Espinosa

unread,
May 14, 2015, 10:17:46 AM5/14/15
to scrapy...@googlegroups.com
Crap, and obviously (I need more coffee):

    from selenium import webdriver

--
Respectfully,

Joey Espinosa

José Ricardo

unread,
May 14, 2015, 1:10:48 PM5/14/15
to scrapy...@googlegroups.com
David, it seems that there shouldn't be problems to run splash from docker on Windows :)

David Fishburn

unread,
May 14, 2015, 1:38:11 PM5/14/15
to scrapy...@googlegroups.com
Thanks Joey.

Since I am new to Python and Scrapy, if I run:

d:\python27\scripts\scrapy startproject ui5 

Which files do I have to create (and in which directory to add the code you supplied) and which files do I have to modify (i.e. settings.py) to have it automatically call this at the outset.

I hope that is not too much to ask.

Thanks,
David

David Fishburn

unread,
May 14, 2015, 1:59:28 PM5/14/15
to scrapy...@googlegroups.com, ro...@josericardo.eti.br
Thanks again José.

I did some more Googling around.  Didn't know what Docker was, but found it here:

The Splash instructions I saw was always referencing Linux paths and I didn't find any Windows references.

So, Docker is essentially a virtual machine.  So if Docker runs on your platform (yes to Windows) then it will be able to run your Python / Scrapy / Splash code.

Thank you.
David

Joey Espinosa

unread,
May 14, 2015, 4:30:21 PM5/14/15
to scrapy...@googlegroups.com
Inside your project directory (in this case, looks like "ui5"), create yourself a middleware directory to hold all middleware modules. Then, create a file for this purpose and name it something relevant (like ui5\middleware\javascript.py). Now you have your middleware module.

Next, you have to let Scrapy know about this middleware. On to settings.py!

    DOWNLOADER_MIDDLEWARES = {
        'ui5.middleware.javascript.JSMiddleware': 99
    }

That's assuming, of course, that you use all the same naming that I suggested in the earlier examples. Change it as appropriate. Also, since you mentioned you're new to Python, that string inside DOWNLOADER_MIDDLEWARES needs to resolve to a class that Python can "see", so the PYTHONPATH needs to be able to find "ui5". So just a heads up on that in case you try this and get an error like "No module named 'ui5'."

That should be it.

A side note about Docker (since you mentioned it in another response)... Windows doesn't have the necessary features to support what Docker is actually doing, so in order to get Docker working on Windows, it actually creates a lightweight Linux VM that then runs Docker within it. So when you run processes within Docker on Windows, you're running an abstraction within an abstraction within an abstraction. Not tremendously efficient simply to make up for a lack of knowledge regarding Scrapy. If you're intent on making use of a project only supported on Linux, I'd rather suggest a VM with Ubuntu, since you'd already be part of the way there with attempting Docker anyway. I don't use Windows at all, but one of my colleagues does, and he set up a cheap Ubuntu computer and uses NX to connect to it from Windows and do development within Docker. I'm not knocking Docker (I use it myself quite heavily), but I'm just cautioning against throwing too many I-need-to-learn-this-from-scratch things at yourself all at once. You'll be overwhelmed. Just my two cents.

sara

unread,
Aug 20, 2015, 9:53:02 AM8/20/15
to scrapy-users
Hi David, 

I just implemented the approach you mentioned as I have the similar requirement (needed to scrape dynamic data as well), so using execute_script() method of the driver to do some processing. Scrapy is setup to keep crawling the extracted links until CLOSESPIDER_TIMEOUT limit. 
This works just fine but sometimes the driver does not respond (MW process stucks randomly). Not sure exactly if the driver has some issues or scrapy becomes unable to handle the requests/responses being passed/processed in the MW. 
Here is the setting of the spider:

DOWNLOADER_MIDDLEWARES = {
        'keywords.phantomMiddleware.PhantomJsMiddleware': 99
    }
#AutoThrottle Settings.
AUTOTHROTTLE_ENABLED = True
AUTOTHROTTLE_START_DELAY = 3
AUTOTHROTTLE_DEBUG = True
AUTOTHROTTLE_MAX_DELAY = 60
CLOSESPIDER_TIMEOUT = 21600

Any thoughts?
Reply all
Reply to author
Forward
0 new messages