scrapy xpath not working.

643 views
Skip to first unread message

Gaurang shah

unread,
Dec 29, 2014, 7:38:24 AM12/29/14
to scrapy...@googlegroups.com

Hi Guys, 

I am trying to scrap the youtube site. And somehow the xpath which fetches the video src is not working in scrapy. 



following xpaths is not working 
//video 
//video[contains(@class,'html5-main-video')]/@src

I am able to retrive xpath till, //div[@id='player-api'], after that it's dead end. scrapy is not able to find any more node in this. However there are nodes inside that as well. 

bruce

unread,
Dec 29, 2014, 11:11:09 AM12/29/14
to scrapy-users
Are you able to effectively create an xpath using your browser's xpath/dev tools?

in firefox, you can use dom inspector, there are others as well, not sure of your browser..

In other words, is the issue with the "video" element, or something else in your xpath?

If you can resolve the xpath with a separate tool, that should give you direction to solve the issue.



--
You received this message because you are subscribed to the Google Groups "scrapy-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to scrapy-users...@googlegroups.com.
To post to this group, send email to scrapy...@googlegroups.com.
Visit this group at http://groups.google.com/group/scrapy-users.
For more options, visit https://groups.google.com/d/optout.

Gaurang shah

unread,
Dec 29, 2014, 11:32:27 AM12/29/14
to scrapy...@googlegroups.com
Sorry guys, Forgot to mentioned. All these xpath is able to identify the elemenet using firepath add-on of firefox. 

//video 
//video[contains(@class,'html5-main-video')]/@src
//div[@class='html5-video-container']/video/@src
//div[@id='movie_player']/div[1]/video/@src
//div[@id='player-api']/div[1]/div[1]/video/@src

However none of them is working in scrapy ???

Gaurang Shah
Blog: qtp-help.blogspot.com
Mobile: +91 738756556

--
You received this message because you are subscribed to a topic in the Google Groups "scrapy-users" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/scrapy-users/nGisMymqofU/unsubscribe.
To unsubscribe from this group and all its topics, send an email to scrapy-users...@googlegroups.com.

bruce

unread,
Dec 29, 2014, 2:54:07 PM12/29/14
to scrapy-users
Hey Gaurang,

What's the OS, version of python, version of scrapy you're using?

Does scrapy use urlib? or better, if you know, what lib does scrapy use for the url/xpath processing?


Gaurang shah

unread,
Dec 30, 2014, 12:49:51 AM12/30/14
to scrapy...@googlegroups.com
Following is the details. 
Os: Windows 7 64 bit
Python 2.7
Scrapy 0.25.1

I don't understand the last question. I am using selector provided by scrapy to get the node using xpath. Following is the code. 

selector = Selector(response)
view_count = selector.xpath("//div[@class='watch-view-count']/text()")[0].extract().strip()
video_url = selector.xpath("//video[contains(@class,'html5-main-video')]/@src").extract()


Gaurang Shah
Blog: qtp-help.blogspot.com
Mobile: +91 738756556

Paul Tremberth

unread,
Dec 30, 2014, 9:30:30 AM12/30/14
to scrapy...@googlegroups.com
YouTube pages rely on Javascript to create the <video> element,
and your browser's XPath tool works because it operates on the rendered page, after Javascript has done its work.

Scrapy itself does not interpret Javascript instructions, it's not a browser,
so it can only work on what's inside the HTML source code when the web page is fetched.

You can see for example that the elements with ID "player-api", which contains "movie-player" in your screenshot,
is empty in the source code

 <div id="player-api" class="player-width player-height off-screen-target player-api"></div>

What you can see also is that this #player-api element is followed by <script> elements.
And while is not straighforward to read what this Javascript code is about,
you can use js2xml (disclaimer: I wrote and maintain js2xml)

Below is an example usage for js2xml using scrapy shell:

it parses Javascript statements from <script> elements in #player, and then extracts dicts.
There's an "args" key in the main script, that itself contains an  url_encoded_fmt_stream_map key with some URLs for the video you may be after:

I'm using urlparse to decode what looks like a query string

(the full scrapy shell session is https://gist.github.com/redapple/8269818915cc2c337dc2)

2014-12-30 15:18:09+0100 [default] DEBUG: Crawled (200) <GET https://www.youtube.com/watch?v=1EFnX1UkXVU> (referer: None)

In [1]: import js2xml
In [2]: import urlparse
In [3]: import pprint

In [4]: for script in response.css('#player script').xpath('string()').extract():
    jstree = js2xml.parse(script)
    data = js2xml.jsonlike.getall(jstree)
    for d in data:
        pprint.pprint(d)
   ...:     
{}
{'args': {'account_playback_token': 'QUFFLUhqa0sweExRZno5OHZEaGcwWVVQaXAxVWh0NUNFZ3xBQ3Jtc0tseE9DRUw3cFVRbkFGN1hub2VmQlNERGl3WjFIQV84aTI0b0lxZnhwdDZKRl96N1g5eWN3dkZER1pFbVM4dS1FeWJoc1FJeTBXdS0tbU5LY1NsWngtSHY1R0hoTl9xdy1iWUNoam1nRFM2czEweVdMNA==',
          'adaptive_fmts': 'size=1280x720&clen=51269588&fps=15&itag=136&init=0-709...bitrate=80798',
          'allow_embed': '1',
          'allow_ratings': '1',
          'atc': 'a=3&b=nhjwMM7ySu8wj8OhutnokFK8Dvs&c=1419949090&d=1&e=1EFnX1UkXVU&c3a=28&c1a=1&hh=hKbH2J9f2WwblpFs2hvo0H17oZo',
          'author': 'Michael Herman',
          'avg_rating': '4.948387146',
          'c': 'WEB',
          'cc3_module': '1',
          'cc_asr': '1',
          'cc_font': 'Arial Unicode MS, arial, verdana, _sans',
          'cc_load_policy': '2',
          'cl': '82697338',
          'cr': 'FR',
          'csi_page_type': 'watch,watch7',
          'dash': '1',
          'dashmpd': 'http://manifest.googlevideo.com/api/...',
          'enablecsi': '1',
          'enablejsapi': 1,
          'eventid': 'IrSiVP-kC4v4cKrwgRg',
          'fexp': '900718,927622,931342,932404,938809,9405699,9406022,940927,940940,941004,943917,947209,947218,948124,952302,952605,952901,955110,955301,957103,957105,957201',
          'fmt_list': '22/1280x720/9/0/115,43/640x360/99/0/0,18/640x360/9/0/115,5/426x240/7/0/0,36/426x240/99/1/0,17/256x144/99/1/0',
          'hl': 'en_US',
          'host_language': 'en',
          'idpj': '-6',
          'iv3_module': '1',
          'iv_load_policy': '1',
          'keywords': 'Scrapy,Python,scraping,python scrapy,web scraping',
          'ldpj': '-25',
          'length_seconds': '717',
          'loaderUrl': 'https://www.youtube.com/watch?v=1EFnX1UkXVU',
          'no_get_video_log': '1',
          'of': 'lNeUuIm8BRrYa4UFYW3Vbw',
          'plid': 'AAULb6kfjbEHoNwt',
          'pltype': 'contentugc',
          'ptk': 'youtube_none',
          'ssl': '1',
          't': '1',
          'thumbnail_url': 'https://i.ytimg.com/vi/1EFnX1UkXVU/default.jpg',
          'timestamp': '1419949090',
          'title': 'Scraping Web Pages with Scrapy',
          'tmi': '1',
          'token': '1',
          'ttsurl': 'https://www.youtube.com/api/timedtext?...',
          'ucid': 'UCt7yOnL7bI7yCa1Xe_GTjJQ',
          'url_encoded_fmt_stream_map': 'fallback_host=tc.v18.cache4.googlevideo.com&quality=hd720...',
          'video_id': '1EFnX1UkXVU',
          'view_count': '52035',
          'vq': 'auto',
            'html': '/html5_player_template',
 'attrs': {'id': 'movie_player'},
 'html5': False,
 'messages': {'player_fallback': ['Adobe Flash Player or an HTML5 supported browser is required for video playback.<br><a href="http://get.adobe.com/flashplayer/">Get the latest Flash Player </a><br><a href="/html5">Learn more about upgrading to an HTML5 browser</a>']},
 'min_version': '8.0.0',
 'params': {'allowfullscreen': 'true',
            'allowscriptaccess': 'always',
            'bgcolor': '#000000'},
 'sts': 16427,
[]

In [5]: for script in response.css('#player script').xpath('string()').extract():
   ...:    jstree = js2xml.parse(script)
   ...:    data = js2xml.jsonlike.getall(jstree)
   ...:    for d in data:
   ...:        try:
   ...:            if d:
   ...:                pprint.pprint(urlparse.parse_qsl(d.get("args", {}).get("url_encoded_fmt_stream_map", "")))
   ...:        except:
   ...:             pass
   ...:         
[('fallback_host', 'tc.v18.cache4.googlevideo.com'),
 ('quality', 'hd720'),
 ('itag', '22'),
 ('type', 'video/mp4; codecs="avc1.64001F, mp4a.40.2"'),
 ('url',
 ('quality', 'medium'),
 ('itag', '43'),
 ('type', 'video/webm; codecs="vp8.0, vorbis"'),
 ('url',
 ('quality', 'medium'),
 ('itag', '18'),
 ('type', 'video/mp4; codecs="avc1.42001E, mp4a.40.2"'),
 ('url',
 ('quality', 'small'),
 ('itag', '5'),
 ('type', 'video/x-flv'),
 ('url',
 ('quality', 'small'),
 ('itag', '36'),
 ('type', 'video/3gpp; codecs="mp4v.20.3, mp4a.40.2"'),
 ('url',
 ('quality', 'small'),
 ('itag', '17'),
 ('type', 'video/3gpp; codecs="mp4v.20.3, mp4a.40.2"'),
 ('url',



bruce

unread,
Dec 30, 2014, 2:17:50 PM12/30/14
to scrapy-users
Hey Paul,

Good catch, I totally missed/forgot to ask if the source displayed the "video" element, or just how he got the "source" he listed..

By the way, the app you mentioned, for the js xml, I'm assuming it only works where it can really rip apart exiting dicts to exract certain data.

It's not really a "jscript parser/compiler" for python is it??!!

That would be wishful thinking!!

I'm looking at different dynamic sites that require combination of straight static parsing, as well as dynamic casperjs parsing.

Thanks

Paul Tremberth

unread,
Dec 30, 2014, 2:28:17 PM12/30/14
to scrapy...@googlegroups.com
js2xml is based on slimit (https://github.com/rspivak/slimit), so it does parse Javascript code (but does not compile it)
It's on PyPI (https://pypi.python.org/pypi/js2xml), so you can pip install js2xml

js2xml.jsonlike module contains methods that proved useful to quickly get things that can be represented as dicts in Python (arrays, strings etc),
that are used as init values for variables, or function arguments.

But in fact js2xml.parse() does build an (lxml) parse tree of the code, so you can use XPath to dig into Javascript source.
So you can for example get arguments for a specific function by it's name

Here are a few example of what you get:

I find it so much easier than regexp (but the performance can be improved)
I should really publish docs to readthedocs

Reply all
Reply to author
Forward
0 new messages