Trim xpath results

2,194 views
Skip to first unread message

scr...@asia.com

unread,
Jul 14, 2010, 9:02:15 AM7/14/10
to scrapy...@googlegroups.com

Hi again,

How to trim white space at beginning of returned xpath results?

[u'\n\t\t', u'\n\t', u'\n\t\n\t\t\n\tMoto etc...

Here is my code:

item['content'] = hxs.select('.//*[@id="desc"]/text()').extract()

Is there strip of tim function? How to use it?

Thanks for your help


Rishi Singh

unread,
Jul 14, 2010, 9:10:40 AM7/14/10
to scrapy...@googlegroups.com
hxs.select('.//*[@id="desc"]/text()').extract().strip()

.strip() is native to strings in python.

Best,
Rishi

Rishi Singh

unread,
Jul 14, 2010, 9:11:42 AM7/14/10
to scrapy...@googlegroups.com
Apologies, just .strip() every item in the returned results.

scr...@asia.com

unread,
Jul 14, 2010, 10:07:17 AM7/14/10
to scrapy...@googlegroups.com
Thanks for your answer.

It's a solution that works, but when items are put together again there are some words merge together... after trim item by item...

Example:

u' aaa ', u'bbb ' => aaabbb and it should be: aaa bbb

My code is:

sequence = list(item['content'])
    i = 0
    while i < len(sequence):
        it = sequence[i]
        item['content'][i]=it.strip()
        i += 1
 
what's about this normalize-space()? I can't use it?




--
You received this message because you are subscribed to the Google Groups "scrapy-users" group.
To post to this group, send email to scrapy...@googlegroups.com.
To unsubscribe from this group, send email to scrapy-users...@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/scrapy-users?hl=en.

Rishi Singh

unread,
Jul 14, 2010, 11:49:44 AM7/14/10
to scrapy...@googlegroups.com
Hm, that's a good question. Try it out. I believe the syntax is:

hxs.select('normalize-space(.//*[@id="desc"]/text()').extract()

Also, to add the space between individual items do this:
sequence = list(item['content'])
    i = 0
    while i < len(sequence):
        it = sequence[i]
        item['content'][i]=it.strip() + " "
        i += 1

Steven Almeroth

unread,
Jul 16, 2010, 2:41:17 AM7/16/10
to scrapy-users
I use String.join to pull out all the strings from a list:

>>> " ".join([' aaa ', u'bbb '])
u' aaa bbb '

and if you want to strip each element you can use Python's cool list
comprehension:

>>> " ".join(x.strip() for x in [' aaa ', u'bbb '])
u'aaa bbb'

and as Pablo pointed out, you can use a Scrapy processor:

>>> from scrapy.contrib.loader.processor import Join
>>> proc = Join()
>>> proc(x.strip() for x in [' aaa ', u'bbb '])
u'aaa bbb'

and if you have any blank items you can strip the final string as
well:

>>> proc(x.strip() for x in [u'\n\t\t', u'\n\t', u'\n\t\n\t\t\n\tMoto']).strip()
u'Moto'


On Jul 14, 4:07 pm, scr...@asia.com wrote:
> Thanks for your answer.
>
> It's a solution that works, but when items are put together again there are some words merge together... after trim item by item...
>
> Example:
>
> u' aaa ', u'bbb ' => aaabbb and it should be: aaa bbb
>
> My code is:
>
> sequence = list(item['content'])
>     i = 0
>     while i < len(sequence):
>         it = sequence[i]
>         item['content'][i]=it.strip()
>         i += 1
>
> what's about this normalize-space()? I can't use it?
>
> -----Original Message-----
> From: Rishi Singh <rishisi...@gmail.com>
> To: scrapy...@googlegroups.com
> Sent: Wed, Jul 14, 2010 3:11 pm
> Subject: Re: Trim xpath results
>
> Apologies, just .strip() every item in the returned results.
>

Pablo Hoffman

unread,
Jul 16, 2010, 10:30:20 AM7/16/10
to scrapy...@googlegroups.com
On Thu, Jul 15, 2010 at 11:41:17PM -0700, Steven Almeroth wrote:
> and as Pablo pointed out, you can use a Scrapy processor:
>
> >>> from scrapy.contrib.loader.processor import Join
> >>> proc = Join()
> >>> proc(x.strip() for x in [' aaa ', u'bbb '])
> u'aaa bbb'

And you could also write a StripJoin processor that combines both
functionalities into one processor, for convenience.

Example code (untested):

class StripJoin(Join):
def __call__(self, values):
return super(StripJoin, self)(x.strip() for x in values)


Pablo.

Reply all
Reply to author
Forward
0 new messages