Where do I define my output format for item dictionaries

Sayth Renshaw

unread,

Jan 24, 2016, 1:02:32 AM1/24/16

to scrapy-users

Hi all

Currently when i output to csv scrapy runspider myxml.py -o ~/items.csv -t csv I get the header items I defined in settings under feed export, however i get the values collected as dictionaries dumped as a dictionary.

Where do i define that dict[0] for each element should be its own line?

So at the moment this is my output

id,num,dist

"209165,209166,209167,209168,209169,209170,209171,209172,209173","1,2,3,4,5,6,7,8,9","1000,1000,1400,1200,1200,1600,1600,1000,2000"

I would want it as

id,num,dist

209165,1,1000

209166,2,1000

...

Looking in feedexporters in the docs for info but feeling I should just be creating a customer function to tidy it up, is that what I do if yes where. Seems like scrapy has thought of most things so expect its done I am just not sure what its called.

My current code.

# -*- coding: utf-8 -*-

import scrapy

from scrapy.selector import Selector

from scrapy.http import HtmlResponse

from scrapy.selector import XmlXPathSelector

from conv_xml.items import ConvXmlItem

# http://stackoverflow.com/a/27391649/461887

import json

class MyxmlSpider(scrapy.Spider):

name = "myxml"

start_urls = (

["file:///home/sayth/Downloads/20160123RAND0.xml"]

)

def parse(self, response):

sel = Selector(response)

sites = sel.xpath('//meeting')

items = []

for site in sites:

item = ConvXmlItem()

# item['venue'] = site.xpath('.//@venue').extract()

item['id'] = site.xpath('.//race/@id').extract()

item['num'] = site.xpath('.//race/@number').extract()

item['dist'] = site.xpath('.//race/@distance').extract()

items.append(item)

return items

Thanks Sayth

Valdir Stumm Junior

unread,

Jan 25, 2016, 6:40:32 AM1/25/16

to scrapy...@googlegroups.com

Looks like the the XPath selectors you are using are returning more than one item for each page, e.g. site.xpath('.//race/@id'). The extract() method returns a SelectorList with all the matching elements inside.

Can you paste an excerpt of the XML file that you are parsing?

--
You received this message because you are subscribed to the Google Groups "scrapy-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to scrapy-users...@googlegroups.com.
To post to this group, send email to scrapy...@googlegroups.com.
Visit this group at https://groups.google.com/group/scrapy-users.
For more options, visit https://groups.google.com/d/optout.

--

Valdir Stumm Junior
Developer Evangelist, Scrapinghub

stummjr

We turn web content into structured data. Lead maintainers of Scrapy.

Sayth Renshaw

unread,

Jan 25, 2016, 7:11:39 AM1/25/16

to scrapy-users

Hi

> Looks like the the XPath selectors you are using are returning more than one item for each page, e.g. site.xpath('.//race/@id').

yes it does, mostly for that selector 8 occurrences though it can vary it wouldn't often, other selectors could have upwards to 24 items in them.

An exert may be messy i will try and edit and small copy the originals are posted on a public website, this is a link (don't click unless you accept to download as its a direct link) http://old.racingnsw.com.au/Site/_content/racebooks/20160130RHIL0.xml

This is an id by itself and yes they love attributes, in the example above for my output though I am trying to filter so that for each .//race/@id I extract I can output the desired attributes so that I am designing a csv or json file which has all the ids and the descriptors from the attributes.

Thanks

Sayth

Paul Tremberth

unread,

Jan 25, 2016, 7:21:32 AM1/25/16

to scrapy-users

Thanks for the sample.

I would suggest that you loop on <race> elements, not <meeting>s.

Something like this will help you work and extract on individual race snippets:

class MyxmlSpider(scrapy.Spider):
    name = "myxml"

    start_urls = (
        ["file:///home/sayth/Downloads/20160123RAND0.xml"]
    )

    def parse(self, response):
        sel = Selector(response)

        items = []
        for race in sel.xpath('//meeting//race'):
            item = ConvXmlItem()
            item['id'] = race.xpath('@id').extract_first()
            item['num'] = race.xpath('@number').extract_first()
            item['dist'] = race.xpath('@distance').extract_first()
            items.append(item)

        return items

Hope this helps.

Paul.

Sayth Renshaw

unread,

Jan 27, 2016, 1:16:44 AM1/27/16

to scrapy-users

Thank you that does help.

Further implementation question, if i want to perform 3 queries on the XML as in the above and create relationships, i would create 3 parses as above instead of saving to file would pipelines provide a mechanism to send the scraped data witb a relation?