Trim xpath results

scr...@asia.com

unread,

Jul 14, 2010, 9:02:15 AM7/14/10

to scrapy...@googlegroups.com

Hi again,

How to trim white space at beginning of returned xpath results?

[u'\n\t\t', u'\n\t', u'\n\t\n\t\t\n\tMoto etc...

Here is my code:

item['content'] = hxs.select('.//*[@id="desc"]/text()').extract()

Is there strip of tim function? How to use it?

Thanks for your help

Rishi Singh

unread,

Jul 14, 2010, 9:10:40 AM7/14/10

to scrapy...@googlegroups.com

hxs.select('.//*[@id="desc"]/text()').extract().strip()

.strip() is native to strings in python.

Best,

Rishi

Rishi Singh

unread,

Jul 14, 2010, 9:11:42 AM7/14/10

to scrapy...@googlegroups.com

Apologies, just .strip() every item in the returned results.

scr...@asia.com

unread,

Jul 14, 2010, 10:07:17 AM7/14/10

to scrapy...@googlegroups.com

Thanks for your answer.

It's a solution that works, but when items are put together again there are some words merge together... after trim item by item...

Example:

u' aaa ', u'bbb ' => aaabbb and it should be: aaa bbb

My code is:

sequence = list(item['content'])
    i = 0
    while i < len(sequence):
        it = sequence[i]
        item['content'][i]=it.strip()
        i += 1

what's about this normalize-space()? I can't use it?

--
You received this message because you are subscribed to the Google Groups "scrapy-users" group.
To post to this group, send email to scrapy...@googlegroups.com.
To unsubscribe from this group, send email to scrapy-users...@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/scrapy-users?hl=en.

Rishi Singh

unread,

Jul 14, 2010, 11:49:44 AM7/14/10

to scrapy...@googlegroups.com

Hm, that's a good question. Try it out. I believe the syntax is:

hxs.select('normalize-space(.//*[@id="desc"]/text()').extract()

Also, to add the space between individual items do this:

sequence = list(item['content'])
    i = 0
    while i < len(sequence):
        it = sequence[i]

item['content'][i]=it.strip() + " "
i += 1

Steven Almeroth

unread,

Jul 16, 2010, 2:41:17 AM7/16/10

to scrapy-users

I use String.join to pull out all the strings from a list:

>>> " ".join([' aaa ', u'bbb '])
u' aaa bbb '

and if you want to strip each element you can use Python's cool list
comprehension:

>>> " ".join(x.strip() for x in [' aaa ', u'bbb '])
u'aaa bbb'

and as Pablo pointed out, you can use a Scrapy processor:

>>> from scrapy.contrib.loader.processor import Join
>>> proc = Join()
>>> proc(x.strip() for x in [' aaa ', u'bbb '])
u'aaa bbb'

and if you have any blank items you can strip the final string as
well:

>>> proc(x.strip() for x in [u'\n\t\t', u'\n\t', u'\n\t\n\t\t\n\tMoto']).strip()
u'Moto'

On Jul 14, 4:07 pm, scr...@asia.com wrote:
> Thanks for your answer.
>
> It's a solution that works, but when items are put together again there are some words merge together... after trim item by item...
>
> Example:
>
> u' aaa ', u'bbb ' => aaabbb and it should be: aaa bbb
>
> My code is:
>
> sequence = list(item['content'])
> i = 0
> while i < len(sequence):
> it = sequence[i]
> item['content'][i]=it.strip()
> i += 1
>
> what's about this normalize-space()? I can't use it?
>
> -----Original Message-----
> From: Rishi Singh <rishisi...@gmail.com>
> To: scrapy...@googlegroups.com
> Sent: Wed, Jul 14, 2010 3:11 pm
> Subject: Re: Trim xpath results
>
> Apologies, just .strip() every item in the returned results.
>

Pablo Hoffman

unread,

Jul 16, 2010, 10:30:20 AM7/16/10

to scrapy...@googlegroups.com

On Thu, Jul 15, 2010 at 11:41:17PM -0700, Steven Almeroth wrote:
> and as Pablo pointed out, you can use a Scrapy processor:
>
> >>> from scrapy.contrib.loader.processor import Join
> >>> proc = Join()
> >>> proc(x.strip() for x in [' aaa ', u'bbb '])
> u'aaa bbb'

And you could also write a StripJoin processor that combines both
functionalities into one processor, for convenience.

Example code (untested):

class StripJoin(Join):
def __call__(self, values):
return super(StripJoin, self)(x.strip() for x in values)

Pablo.

Reply all

Reply to author

Forward