Outputting to CSV, 1 row per element

ara

unread,

Apr 28, 2011, 4:19:32 PM4/28/11

to scrapy-users

Hi,

I am new to Scrapy & am amazed at how quickly I could manage to scrape
a certain website. In my Items file, I have the following fields:
name, description, image (path), nlabel(path to an image), ingredients
(comma separated values).

When I try to out as CSV using FEED_FORMAT = 'csv'
FEED_URI = 'products.csv' in the settings.py file, I get all the
output for names totaling 8 items in all, in 1 excel cell; ditto for
each of the other columns. Can anyone explain in clear steps what I
need to do to get some usable output?

Thanks.
-a

Pablo Hoffman

unread,

Apr 29, 2011, 9:55:48 AM4/29/11

to scrapy...@googlegroups.com

You may need to select which separator to use in Excel when you open the file?.
At least that's how it works in OpenOffice, I don't use Excel much.

> --
> You received this message because you are subscribed to the Google Groups "scrapy-users" group.
> To post to this group, send email to scrapy...@googlegroups.com.
> To unsubscribe from this group, send email to scrapy-users...@googlegroups.com.
> For more options, visit this group at http://groups.google.com/group/scrapy-users?hl=en.

Anjali Arora

unread,

Apr 29, 2011, 11:06:44 AM4/29/11

to scrapy...@googlegroups.com

Hi,

That's right, but I'd like to know how to insert the separator between each item, what should the code be? below is my spider.py:

from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector
from dmoz.items import PamelasItem

class PamelasSpider(BaseSpider):
    name = "pamelas"
    allowed_domains = ["pamelasproducts.com"]
    start_urls = [
        "http://www.pamelasproducts.com/ProductsTRAD.html"

    ]

    def parse(self, response):
        hxs = HtmlXPathSelector(response)
        sites = hxs.select ('//table')
        items = []
        for site in sites:
            item = PamelasItem()
            item['name'] = site.select('tr/td/p/a/text()').extract()
            item['image'] = site.select('tr/td/a/img/@src').extract()
            item['nlabel'] = site.select('tr/td/img/@src').extract()
            item['desc'] = site.select('tr/td/p/text()').extract()
            item['ingredients'] = site.select('tr/td/p/text()').extract()
            items.append(item)
        return items

This fetches all the data great, now it's just a matter of getting it to output correctly. Thanks.
-a

Anjali Arora

unread,

Apr 29, 2011, 11:16:04 AM4/29/11

to scrapy...@googlegroups.com

Actually, the output does seem to have comma separators, however the output in my Terminal window appears odd; it goes like this: item[name1], item[name2],........item[name9], item[image1], item[image2],.....item[image9], item[nlabel1], item[nlabel2]......

I'd like the output order to be like item['name1], item[image1], item[nlabel1], item[desc1], item[ingredients1], item[name2].......

Any help is appreciated.
Thanks.

Pablo Hoffman

unread,

Apr 29, 2011, 10:12:28 PM4/29/11

to scrapy...@googlegroups.com

The output is one column per field, so 5 columns in the final CSV (name, image,
nlabel, desc, ingredients). If any value contains a comma inside, it's quoted
(this logic is all in the csv module from python stdlib).

If you want to store the first value of each field, you have several options to
choose from:

1. add [0] after the extract() - easiest, but could fail if the list is empty

2. use loaders. the TakeFirst() processor does what you need.
http://doc.scrapy.org/topics/loaders.html

3. add a serializer for the field that returns the first value
http://doc.scrapy.org/topics/exporters.html#declaring-a-serializer-in-the-field
for example: Field(serializer=lambda x: x[0])

Reply all

Reply to author

Forward