Scrapy shell returns empty list!?

조회수 970회
읽지 않은 첫 메시지로 건너뛰기

DataScience

읽지 않음,
2015. 3. 16. 오전 11:14:2015. 3. 16.
받는사람 scrapy...@googlegroups.com
Hi Scrapy Guys,

Scrapy returns me an empty list while using shell to pick a simple "title" field from this web page: http://goo.gl/dBR8P4
I've used:
  • sel.xpath(‘//div[@id="results-rail"]/ul[@class="jobs"]/li[1]/div[@class="content"]/span/a[@class="title"]/text()’).extract()
  • sel.xpath('html/body/div[3]/div/div[2]/div[2]/div[1]/ul/li[1]/div/span/a').extract()
  • ...
I verified the issue of the POST with XHR using firebug, and I think there are no relationships with information generated using js code (what do you think?).

Can you please help me to figure out with this problem?
Thank you in Advance.

Best Regards,
K.

Travis Leleu

읽지 않음,
2015. 3. 16. 오전 11:26:4115. 3. 16.
받는사람 scrapy...@googlegroups.com
Linkedin can be a tough site to scrape, as they generally don't want their data in other people's hands.  You will need to use a user-agent switcher (you don't mention what UA you are sending), and most likely require a proxy in addition.

If you are looking to scrape the entirety of linkedin, it's > 30 million profiles.  I've found it more economical to purchase a linkedin data dump from scrapinghub.com than to scrape it myself.

--
You received this message because you are subscribed to the Google Groups "scrapy-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to scrapy-users...@googlegroups.com.
To post to this group, send email to scrapy...@googlegroups.com.
Visit this group at http://groups.google.com/group/scrapy-users.
For more options, visit https://groups.google.com/d/optout.

DataScience

읽지 않음,
2015. 3. 16. 오후 12:02:2315. 3. 16.
받는사람 scrapy...@googlegroups.com, m...@travisleleu.com
Thank you Travis for you quick feedback.

I am testing scrapy on this specefic webpage and try to get the job offers (and not profiles).
I read in some forums that it may be due to the website which is using Javascript to build most of the page, so the elements I want do not appear in the HTML source of the page. I've checked by disabling Javascript and reloading the page, but the result has been displayed on the page (I've also checked the network in firbug by filtering XHR and looked into the POST...and nothing).

Any help would be more than welcome.
Thank you.

Travis Leleu

읽지 않음,
2015. 3. 16. 오후 12:15:1415. 3. 16.
받는사람 scrapy...@googlegroups.com
It doesn't look to me like it's writing the HTML to the DOM with j.s., as you noted.

The big concern I have is that you are assuming the HTML content in your browser is the same as in your code.  How have you asserted this?

DataScience

읽지 않음,
2015. 3. 16. 오후 12:19:0115. 3. 16.
받는사람 scrapy...@googlegroups.com, m...@travisleleu.com
Actually, I've checked the "response.body" and it doesn't matches the content that I have in the webpage.
I am really confused, what can I do in this case?

Morad Edwar

읽지 않음,
2015. 3. 17. 오전 6:34:5615. 3. 17.
받는사람 scrapy...@googlegroups.com, m...@travisleleu.com
I used 'scrapy shell' and your xpath worked fine!!
and when i changed 'li[1]' to 'li' it scrapped all the jobs titles.

Kais DAI

읽지 않음,
2015. 3. 17. 오전 7:13:1515. 3. 17.
받는사람 scrapy...@googlegroups.com, Travis Leleu, mo...@bkam.com
This is what I did:
  1. I opened the command line in windows and run the follwing command: scrapy shell https://www.linkedin.com/job/jobs-in-san-francisco-ca/?page_num=1&trk=jserp_pagination_1
  2. Then, I run this command: sel.xpath(‘//div[@id="results-rail"]/ul[@class="jobs"]/li[1]/div[@class="content"]/span/a[@class="title"]/text()’).extract()  In this case, an empty list is returned [] Also, the same thing with this xpath selection: sel.xpath('html/body/div[3]/div/div[2]/div[2]/div[1]/ul/li[1]/div/span/a').extract()
Did you obtained a result by following the same steps?
Thank you for your help.

Regards,
K.

--
You received this message because you are subscribed to a topic in the Google Groups "scrapy-users" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/scrapy-users/BSmdIyfxiC4/unsubscribe.
To unsubscribe from this group and all its topics, send an email to scrapy-users...@googlegroups.com.

Morad Edwar

읽지 않음,
2015. 3. 17. 오전 7:42:0715. 3. 17.
받는사람 Kais DAI, scrapy...@googlegroups.com, Travis Leleu

Please do it again but after step one run the following code :
    print response.url
And make give us the output.

Morad Edwar,
Software Developer | Bkam.com

Kais DAI

읽지 않음,
2015. 3. 17. 오전 7:52:2015. 3. 17.
받는사람 Morad Edwar, scrapy...@googlegroups.com, Travis Leleu

Morad Edwar

읽지 않음,
2015. 3. 17. 오전 7:59:1915. 3. 17.
받는사람 scrapy...@googlegroups.com, mo...@bkam.com, m...@travisleleu.com
do you see the difference??
scrapy shell didn't parse the full url because the special chars in the url, try the following :

Kais DAI

읽지 않음,
2015. 3. 17. 오전 8:54:2115. 3. 17.
받는사람 scrapy...@googlegroups.com, mo...@bkam.com, Travis Leleu
Yes, I saw the difference. In this sense, I've changed the URL with the one you suggested then with another one (https://www.linkedin.com/job/all-jobs/?sort=date)=> I obtained the same output when I run print response URL, but still having an empty list as a result of the sel.xpath.
Please, find a screenshot expalining thre procedure I followed right here:

Regards, 
K.

Morad Edwar

읽지 않음,
2015. 3. 17. 오전 9:15:4815. 3. 17.
받는사람 scrapy...@googlegroups.com, mo...@bkam.com, m...@travisleleu.com
it's the same problem try response.url and you will see that it's another link because of the special chars.

Kais DAI

읽지 않음,
2015. 3. 17. 오전 9:50:2115. 3. 17.
받는사람 scrapy...@googlegroups.com, Morad Edwar, Travis Leleu
I've noticed two things:
  1. The existance of the 999 in the response output: <999 https://linkedin.com/job/jobs-in-san-fransisco-ca/page_num=1>
  2. In the available scrapy obects, there is no xpath in the sel object which is defined as following: <Selector xpath=None data=u'<html><head>\n<script type="text/javascri'>
According to the previous information, how can I address this problem?

Regards,
K.
전체답장
작성자에게 답글
전달
새 메시지 0개