Table doesn't exist on shell or spider crawling

ajrpc

unread,

Apr 7, 2014, 8:45:34 PM4/7/14

to scrapy...@googlegroups.com

Hello,

I'm trying to crawl a website.com/places.asp?id=100

When i'm on the browser i can get the <table> where the information exists, the XPath '//p[@class="txt"]/table/...' with a firebug extension (FirePath)

But when i'm selecting with scrapy shell or crawling the spider, the <table> after the first 'p[@class="txt"]' simply doesn't exist, like if it was not created yet.



class MySpider(BaseSpider):
    name = 'xpto'
    allowed_domains = ['website.com']
    start_urls = [
        'http://www.website.com/places.asp?id=100'
    ]


    def parse(self, response):
         sel = Selector(response)
         places = sel.xpath('//p[@class="txt"]/table//td[@class="txtm"]/a/@href').extract()
         for place in places:
          print place

I've thought that was created by AJAX method, but it doesn't have any one. Then i tried to get the HTML page with:

def parse(self, response):
        open('test.html', 'wb').write(response.body)

And the table exists!

How can i get it to Selector?

Maybe a ASP thing ?

Bill Ebeling

unread,

Apr 8, 2014, 8:23:41 AM4/8/14

to scrapy...@googlegroups.com

can you post the link to the actual page?

Without more information, any suggestions would just be guessing. If you can't, I'd recommend loading the page in scrapy shell and trying to figure it out that way.

André Campos

unread,

Apr 8, 2014, 8:37:44 AM4/8/14

to scrapy...@googlegroups.com

Ok

http://www.mapadeportugal.net/concelho.asp?c=1401

You can see in the browser that '//p[@class="txtmedioazulb"]//td[@class="txtmedio"]/a’ it’s available, but not in shell or spider crawling

I’ve tried loading in shell, and the sel.xpath(..) doesn’t retrieve nothing, how can i use shell to figure it out ?

Thanks!

On 08 Apr 2014, at 13:23, Bill Ebeling <bille...@gmail.com> wrote:

can you post the link to the actual page?

Without more information, any suggestions would just be guessing. If you can't, I'd recommend loading the page in scrapy shell and trying to figure it out that way.

--
You received this message because you are subscribed to a topic in the Google Groups "scrapy-users" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/scrapy-users/rvq9fGDPRWI/unsubscribe.
To unsubscribe from this group and all its topics, send an email to scrapy-users...@googlegroups.com.
To post to this group, send email to scrapy...@googlegroups.com.
Visit this group at http://groups.google.com/group/scrapy-users.
For more options, visit https://groups.google.com/d/optout.

Bill Ebeling

unread,

Apr 8, 2014, 8:55:52 AM4/8/14

to scrapy...@googlegroups.com

I didn't drill in entirely to the problem, so I don't know what reindeer games are being played, but I did have some luck with this:

//a[@target='_top']

I'll try to figure it out when I get some time later on.

Bill Ebeling

unread,

Apr 8, 2014, 9:02:27 AM4/8/14

to scrapy...@googlegroups.com

Also, this is how I got there using scrapy shell..

command list:

'Abegoaria de Baixo' in response.body # returned True, so I knew there was at least a mention somewhere in the page
hxs.select("//*[contains(.,'Abegoaria de Baixo')]").extract() # an attempt to get lucky on the first try... didn't work out, produced too much
hxs.select("//a[contains(.,'Abegoaria de Baixo')]").extract() # my next attempt to get lucky, this one worked out
hxs.select("//a[@target='_top'").extract() # this produced a list of links about the same length as the list of links your xpath produced on the same page, figured it was about right

Hope that helps, too

André Campos

unread,

Apr 8, 2014, 10:41:14 AM4/8/14

to scrapy...@googlegroups.com

Thank you Bill !

I’m trying to get that list of links. If you see it on the browser they’re within a table inside <p class=“txtmedioazulb> , right after "Lista completa das localidades (em <b>negrito</b>, com informações)”

That’s the only way i figured it out to specify that group of links. I’ve used your method to identify all <a target=“_top”>, but the result has some links from the rest of the page, that are not in that table.

I can’t understand why in the response they’re not in a table inside that <p> as in the web. If i write it to a .html file, it builds that table quite right.

ajrpc

unread,

Apr 9, 2014, 9:58:31 AM4/9/14

to scrapy...@googlegroups.com

Any ideas ?

To unsubscribe from this group and all its topics, send an email to scrapy-users+unsubscribe@googlegroups.com.

To post to this group, send email to scrapy...@googlegroups.com.
Visit this group at http://groups.google.com/group/scrapy-users.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to a topic in the Google Groups "scrapy-users" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/scrapy-users/rvq9fGDPRWI/unsubscribe.

To unsubscribe from this group and all its topics, send an email to scrapy-users+unsubscribe@googlegroups.com.

To post to this group, send email to scrapy...@googlegroups.com.
Visit this group at http://groups.google.com/group/scrapy-users.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to a topic in the Google Groups "scrapy-users" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/scrapy-users/rvq9fGDPRWI/unsubscribe.

To unsubscribe from this group and all its topics, send an email to scrapy-users+unsubscribe@googlegroups.com.

Bill Ebeling

unread,

Apr 9, 2014, 12:05:41 PM4/9/14

to scrapy...@googlegroups.com

I do. I've attached an extremely simple spider that crawls those links. Hopefully the code will answer your questions, if not, feel free to ask any more you may have.

As for why that particular xpath works on the page and not in scrapy shell, my guess is that the data is loaded in with the webpage, so no AJAX. Then some js does something to the dom, there's a lot of ads on those pages, so I wouldn't be surprised.

portugal.py

André Campos

unread,

Apr 10, 2014, 12:30:09 PM4/10/14

to scrapy...@googlegroups.com

Hi Bill

Thank you very much ! Very smart the width=“25%”

I think that’s whats going on, any js messing with DOM after content loaded.

Well, you’ve solved my problem. Thank you again

On 09 Apr 2014, at 17:05, Bill Ebeling <bille...@gmail.com> wrote:

I do. I've attached an extremely simple spider that crawls those links. Hopefully the code will answer your questions, if not, feel free to ask any more you may have.

As for why that particular xpath works on the page and not in scrapy shell, my guess is that the data is loaded in with the webpage, so no AJAX. Then some js does something to the dom, there's a lot of ads on those pages, so I wouldn't be surprised.

--
You received this message because you are subscribed to a topic in the Google Groups "scrapy-users" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/scrapy-users/rvq9fGDPRWI/unsubscribe.

To unsubscribe from this group and all its topics, send an email to scrapy-users...@googlegroups.com.

To post to this group, send email to scrapy...@googlegroups.com.
Visit this group at http://groups.google.com/group/scrapy-users.
For more options, visit https://groups.google.com/d/optout.

<portugal.py>

Reply all

Reply to author

Forward

Table doesn't exist on shell or spider crawling - .asp website

ajrpc

Bill Ebeling

André Campos

Bill Ebeling

Bill Ebeling

André Campos

ajrpc

Bill Ebeling

André Campos