Table doesn't exist on shell or spider crawling - .asp website

41 views
Skip to first unread message

ajrpc

unread,
Apr 7, 2014, 8:45:34 PM4/7/14
to scrapy...@googlegroups.com
Hello,

I'm trying to crawl a website.com/places.asp?id=100

When i'm on the browser i can get the <table> where the information exists, the XPath '//p[@class="txt"]/table/...' with a firebug extension (FirePath)

But when i'm selecting with scrapy shell or crawling the spider, the <table> after the first 'p[@class="txt"]' simply doesn't exist, like if it was not created yet.



class MySpider(BaseSpider):
    name
= 'xpto'
    allowed_domains
= ['website.com']
    start_urls
= [
       
'http://www.website.com/places.asp?id=100'
   
]


   
def parse(self, response):
         sel
= Selector(response)
         places
= sel.xpath('//p[@class="txt"]/table//td[@class="txtm"]/a/@href').extract()
         
for place in places:
         
print place



I've thought that was created by AJAX method, but it doesn't have any one. Then i tried to get the HTML page with:

def parse(self, response):
        open('test.html', 'wb').write(response.body)

And the table exists!

How can i get it to Selector?

Maybe a ASP thing ?

Bill Ebeling

unread,
Apr 8, 2014, 8:23:41 AM4/8/14
to scrapy...@googlegroups.com
can you post the link to the actual page?

Without more information, any suggestions would just be guessing.  If you can't, I'd recommend loading the page in scrapy shell and trying to figure it out that way.

André Campos

unread,
Apr 8, 2014, 8:37:44 AM4/8/14
to scrapy...@googlegroups.com
Ok


You can see in the browser that '//p[@class="txtmedioazulb"]//td[@class="txtmedio"]/a’ it’s available, but not in shell or spider crawling

I’ve tried loading in shell, and the sel.xpath(..) doesn’t retrieve nothing, how can i use shell to figure it out ?

Thanks!
 
On 08 Apr 2014, at 13:23, Bill Ebeling <bille...@gmail.com> wrote:

can you post the link to the actual page?

Without more information, any suggestions would just be guessing.  If you can't, I'd recommend loading the page in scrapy shell and trying to figure it out that way.

--
You received this message because you are subscribed to a topic in the Google Groups "scrapy-users" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/scrapy-users/rvq9fGDPRWI/unsubscribe.
To unsubscribe from this group and all its topics, send an email to scrapy-users...@googlegroups.com.
To post to this group, send email to scrapy...@googlegroups.com.
Visit this group at http://groups.google.com/group/scrapy-users.
For more options, visit https://groups.google.com/d/optout.

Bill Ebeling

unread,
Apr 8, 2014, 8:55:52 AM4/8/14
to scrapy...@googlegroups.com
I didn't drill in entirely to the problem, so I don't know what reindeer games are being played, but I did have some luck with this:

//a[@target='_top']

I'll try to figure it out when I get some time later on.

Bill Ebeling

unread,
Apr 8, 2014, 9:02:27 AM4/8/14
to scrapy...@googlegroups.com
Also, this is how I got there using scrapy shell..

command list:

'Abegoaria de Baixo' in response.body  # returned True, so I knew there was at least a mention somewhere in the page
hxs.select("//*[contains(.,'Abegoaria de Baixo')]").extract() # an attempt to get lucky on the first try...  didn't work out, produced too much
hxs.select("//a[contains(.,'Abegoaria de Baixo')]").extract() # my next attempt to get lucky, this one worked out
hxs.select("//a[@target='_top'").extract() # this produced a list of links about the same length as the list of links your xpath produced on the same page, figured it was about right


Hope that helps, too

André Campos

unread,
Apr 8, 2014, 10:41:14 AM4/8/14
to scrapy...@googlegroups.com
Thank you Bill !

I’m trying to get that list of links. If you see it on the browser they’re within a table inside <p class=“txtmedioazulb> , right after "Lista completa das localidades (em <b>negrito</b>, com informações)”

That’s the only way i figured it out to specify that group of links. I’ve used your method to identify all <a target=“_top”>, but the result has some links from the rest of the page, that are not in that table. 

I can’t understand why in the response they’re not in a table inside that <p> as in the web. If i write it to a .html file, it builds that table quite right.

ajrpc

unread,
Apr 9, 2014, 9:58:31 AM4/9/14
to scrapy...@googlegroups.com
Any ideas ?

To unsubscribe from this group and all its topics, send an email to scrapy-users+unsubscribe@googlegroups.com.

To post to this group, send email to scrapy...@googlegroups.com.
Visit this group at http://groups.google.com/group/scrapy-users.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to a topic in the Google Groups "scrapy-users" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/scrapy-users/rvq9fGDPRWI/unsubscribe.
To unsubscribe from this group and all its topics, send an email to scrapy-users+unsubscribe@googlegroups.com.

To post to this group, send email to scrapy...@googlegroups.com.
Visit this group at http://groups.google.com/group/scrapy-users.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to a topic in the Google Groups "scrapy-users" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/scrapy-users/rvq9fGDPRWI/unsubscribe.
To unsubscribe from this group and all its topics, send an email to scrapy-users+unsubscribe@googlegroups.com.

Bill Ebeling

unread,
Apr 9, 2014, 12:05:41 PM4/9/14
to scrapy...@googlegroups.com
I do.  I've attached an extremely simple spider that crawls those links.  Hopefully the code will answer your questions, if not, feel free to ask any more you may have.

As for why that particular xpath works on the page and not in scrapy shell, my guess is that the data is loaded in with the webpage, so no AJAX. Then some js does something to the dom, there's a lot of ads on those pages, so I wouldn't be surprised.
portugal.py

André Campos

unread,
Apr 10, 2014, 12:30:09 PM4/10/14
to scrapy...@googlegroups.com
Hi Bill 

Thank you very much ! Very smart the width=“25%” 

I think that’s whats going on, any js messing with DOM after content loaded.

Well, you’ve solved my problem. Thank you again

On 09 Apr 2014, at 17:05, Bill Ebeling <bille...@gmail.com> wrote:

I do.  I've attached an extremely simple spider that crawls those links.  Hopefully the code will answer your questions, if not, feel free to ask any more you may have.

As for why that particular xpath works on the page and not in scrapy shell, my guess is that the data is loaded in with the webpage, so no AJAX. Then some js does something to the dom, there's a lot of ads on those pages, so I wouldn't be surprised.

--
You received this message because you are subscribed to a topic in the Google Groups "scrapy-users" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/scrapy-users/rvq9fGDPRWI/unsubscribe.
To unsubscribe from this group and all its topics, send an email to scrapy-users...@googlegroups.com.

To post to this group, send email to scrapy...@googlegroups.com.
Visit this group at http://groups.google.com/group/scrapy-users.
For more options, visit https://groups.google.com/d/optout.
<portugal.py>

Reply all
Reply to author
Forward
0 new messages