Matching elements at Xpath depth

uberpete

unread,

Jun 18, 2011, 12:28:12 AM6/18/11

to scrapy-users

Hey guys, I'm having trouble with this piece of HTML

<dl class="inline dt40">
<dt>APIs</dt>
<dd><a href="/api/flickr">Flickr</a></dd> +
<dd><a href="/api/twitter">Twitter</a></dd>
</dl>
<dl class="inline dt40">
<dt>Added</dt>
<dd>14 Jun 2009</dd>
</dl>

I want to extract all <dd> elements that are inside a <dl> element
that contains a specific <dt> tag

Looking at the documentation I was thinking I could use a relative
xpath , but I don't know how I would match the text inside <dt>
element without regex (which would prevent future navigation)

Am I missing any cool tricks that would make this possible?

Please note the I simplified this example, in reality there are more
elements (http://www.programmableweb.com/mashup/haiku). Also the
pages I'm crawling are sometimes missing certain sections so I can't
base my solution on the order of the <dl> elements

Tanks, Pete

Максим Горковский

unread,

Jun 18, 2011, 12:50:37 AM6/18/11

to scrapy...@googlegroups.com

I don't see problem:

dls = hxs.select('//dl')

for dl in dls:

dt = dl.select('.//dt')

if ( smth ):

tags processing

Or am I missing something?

2011/6/18 uberpete <pete...@gmail.com>

--
You received this message because you are subscribed to the Google Groups "scrapy-users" group.
To post to this group, send email to scrapy...@googlegroups.com.
To unsubscribe from this group, send email to scrapy-users...@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/scrapy-users?hl=en.

--
С уважением,
Максим Горковский

Peter Okma

unread,

Jun 18, 2011, 12:37:03 PM6/18/11

to scrapy...@googlegroups.com

No you didn't miss anything, that solved the problem, I was using relative paths incorrectly.

Thanks

Pete

2011/6/18 Максим Горковский <ragzo...@gmail.com>

Kazimir

unread,

Jun 18, 2011, 12:54:13 PM6/18/11

to scrapy-users

I rock

On Jun 19, 12:37 am, Peter Okma <po...@stevens.edu> wrote:
> No you didn't miss anything, that solved the problem, I was using relative
> paths incorrectly.
>
> Thanks
> Pete
>

> 2011/6/18 Максим Горковский <ragzovs...@gmail.com>

>
>
>
>
>
>
>
> > I don't see problem:
> > dls = hxs.select('//dl')
> > for dl in dls:
> > dt = dl.select('.//dt')
> > if ( smth ):
> > tags processing
>
> > Or am I missing something?
>

> > 2011/6/18 uberpete <peteo...@gmail.com>

Rolando Espinoza La Fuente

unread,

Jun 20, 2011, 11:31:48 AM6/20/11

to scrapy...@googlegroups.com

On Sat, Jun 18, 2011 at 12:28 AM, uberpete <pete...@gmail.com> wrote:
> Hey guys, I'm having trouble with this piece of HTML
>
> <dl class="inline dt40">
> <dt>APIs</dt>
> <dd><a href="/api/flickr">Flickr</a></dd> +
> <dd><a href="/api/twitter">Twitter</a></dd>
> </dl>
> <dl class="inline dt40">
> <dt>Added</dt>
> <dd>14 Jun 2009</dd>
> </dl>
>
> I want to extract all <dd> elements that are inside a <dl> element
> that contains a specific <dt> tag

>>> hxs.select('//dl/dt[normalize-space()="APIs"]/following-sibling::dd//text()').extract()
[u'Flickr', u'Twitter']

~Rolando

Reply all

Reply to author

Forward