Matching elements at Xpath depth

138 views
Skip to first unread message

uberpete

unread,
Jun 18, 2011, 12:28:12 AM6/18/11
to scrapy-users
Hey guys, I'm having trouble with this piece of HTML

<dl class="inline dt40">
<dt>APIs</dt>
<dd><a href="/api/flickr">Flickr</a></dd> +
<dd><a href="/api/twitter">Twitter</a></dd>
</dl>
<dl class="inline dt40">
<dt>Added</dt>
<dd>14 Jun 2009</dd>
</dl>

I want to extract all <dd> elements that are inside a <dl> element
that contains a specific <dt> tag

Looking at the documentation I was thinking I could use a relative
xpath , but I don't know how I would match the text inside <dt>
element without regex (which would prevent future navigation)

Am I missing any cool tricks that would make this possible?

Please note the I simplified this example, in reality there are more
elements (http://www.programmableweb.com/mashup/haiku). Also the
pages I'm crawling are sometimes missing certain sections so I can't
base my solution on the order of the <dl> elements

Tanks, Pete

Максим Горковский

unread,
Jun 18, 2011, 12:50:37 AM6/18/11
to scrapy...@googlegroups.com
I don't see problem:
dls = hxs.select('//dl')
for dl in dls:
    dt = dl.select('.//dt')
    if ( smth ):
        tags processing

Or am I missing something?

2011/6/18 uberpete <pete...@gmail.com>

--
You received this message because you are subscribed to the Google Groups "scrapy-users" group.
To post to this group, send email to scrapy...@googlegroups.com.
To unsubscribe from this group, send email to scrapy-users...@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/scrapy-users?hl=en.




--
С уважением,
Максим Горковский

Peter Okma

unread,
Jun 18, 2011, 12:37:03 PM6/18/11
to scrapy...@googlegroups.com
No you didn't miss anything, that solved the problem, I was using relative paths incorrectly.

Thanks
Pete

2011/6/18 Максим Горковский <ragzo...@gmail.com>

Kazimir

unread,
Jun 18, 2011, 12:54:13 PM6/18/11
to scrapy-users
I rock

On Jun 19, 12:37 am, Peter Okma <po...@stevens.edu> wrote:
> No you didn't miss anything, that solved the problem, I was using relative
> paths incorrectly.
>
> Thanks
> Pete
>
> 2011/6/18 Максим Горковский <ragzovs...@gmail.com>
>
>
>
>
>
>
>
> > I don't see problem:
> > dls = hxs.select('//dl')
> > for dl in dls:
> >     dt = dl.select('.//dt')
> >     if ( smth ):
> >         tags processing
>
> > Or am I missing something?
>
> > 2011/6/18 uberpete <peteo...@gmail.com>

Rolando Espinoza La Fuente

unread,
Jun 20, 2011, 11:31:48 AM6/20/11
to scrapy...@googlegroups.com
On Sat, Jun 18, 2011 at 12:28 AM, uberpete <pete...@gmail.com> wrote:
> Hey guys, I'm having trouble with this piece of HTML
>
> <dl class="inline dt40">
>    <dt>APIs</dt>
>    <dd><a href="/api/flickr">Flickr</a></dd> +
>    <dd><a href="/api/twitter">Twitter</a></dd>
> </dl>
> <dl class="inline dt40">
>    <dt>Added</dt>
>    <dd>14 Jun 2009</dd>
> </dl>
>
> I want to extract all <dd> elements that are inside a <dl> element
> that contains a specific <dt> tag

>>> hxs.select('//dl/dt[normalize-space()="APIs"]/following-sibling::dd//text()').extract()
[u'Flickr', u'Twitter']

~Rolando

Reply all
Reply to author
Forward
0 new messages