Where do I define my output format for item dictionaries

41 views
Skip to first unread message

Sayth Renshaw

unread,
Jan 24, 2016, 1:02:32 AM1/24/16
to scrapy-users

Hi all

Currently when i output to csv scrapy runspider myxml.py -o ~/items.csv -t csv I get the header items I defined in settings under feed export, however i get the values collected as dictionaries dumped as a dictionary.

Where do i define that dict[0] for each element should be its own line?

So at the moment this is my output

id,num,dist
"209165,209166,209167,209168,209169,209170,209171,209172,209173","1,2,3,4,5,6,7,8,9","1000,1000,1400,1200,1200,1600,1600,1000,2000"

I would want it as

id,num,dist
209165,1,1000
209166,2,1000
...

Looking in feedexporters in the docs for info but feeling I should just be creating a customer function to tidy it up, is that what I do if yes where. Seems like scrapy has thought of most things so expect its done I am just not sure what its called.

My current code.

# -*- coding: utf-8 -*-
import scrapy
from scrapy.selector import Selector
from scrapy.http import HtmlResponse
from scrapy.selector import XmlXPathSelector
from conv_xml.items import ConvXmlItem
import json


class MyxmlSpider(scrapy.Spider):
    name = "myxml"

    start_urls = (
        ["file:///home/sayth/Downloads/20160123RAND0.xml"]
    )

    def parse(self, response):
        sel = Selector(response)
        sites = sel.xpath('//meeting')
        items = []

        for site in sites:
            item = ConvXmlItem()
            # item['venue'] = site.xpath('.//@venue').extract()
            item['id'] = site.xpath('.//race/@id').extract()
            item['num'] = site.xpath('.//race/@number').extract()
            item['dist'] = site.xpath('.//race/@distance').extract()
            items.append(item)

        return items


Thanks Sayth

Valdir Stumm Junior

unread,
Jan 25, 2016, 6:40:32 AM1/25/16
to scrapy...@googlegroups.com
Looks like the the XPath selectors you are using are returning more than one item for each page, e.g. site.xpath('.//race/@id'). The extract() method returns a SelectorList with all the matching elements inside.

Can you paste an excerpt of the XML file that you are parsing?

--
You received this message because you are subscribed to the Google Groups "scrapy-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to scrapy-users...@googlegroups.com.
To post to this group, send email to scrapy...@googlegroups.com.
Visit this group at https://groups.google.com/group/scrapy-users.
For more options, visit https://groups.google.com/d/optout.



--
Scrapinghub

Valdir Stumm Junior 
Developer Evangelist, Scrapinghub 

Skypestummjr
TwitterGithub
TwitterLinkedInGithub

We turn web content into structured data. Lead maintainers of Scrapy.

Sayth Renshaw

unread,
Jan 25, 2016, 7:11:39 AM1/25/16
to scrapy-users

Hi

> Looks like the the XPath selectors you are using are returning more than one item for each page, e.g. site.xpath('.//race/@id'). 

yes it does, mostly for that selector 8 occurrences though it can vary it wouldn't often, other selectors could have upwards to 24 items in them. 

An exert may be messy i will try and edit and small copy the originals are posted on a public website, this is a link (don't click unless you accept to download as its a direct link) http://old.racingnsw.com.au/Site/_content/racebooks/20160130RHIL0.xml

This is an id by itself and yes they love attributes, in the example above for my output though I am trying to filter so that for each .//race/@id I extract I can output the desired attributes so that I am designing a csv or json file which has all the ids and the descriptors from the attributes.

<race id="209165" number="1" nomnumber="2" division="0" name="SCHWEPPES QUALITY" mediumname="WILKES" shortname="WILKES" stage="Acceptances" distance="1000" minweight="55" raisedweight="1" class="~         " age="3         " grade="4" weightcondition="QLT       " trophy="0" owner="0" trainer="0" jockey="0" strapper="0" totalprize="85000" first="48750" second="16750" third="8350" fourth="4150" fifth="2000" time="2016-01-23T12:40:00" bonustype="BOB7      " nomsfee="0" acceptfee="0" trackcondition="          " timingmethod="          " fastesttime="          " sectionaltime="          " formavailable="0" racebookprize="Of $85000. First $48750, second $16750, third $8350, fourth $4150, fifth $2000, sixth $1000, seventh $1000, eighth $1000, ninth $1000, tenth $1000">

Thanks
Sayth

Paul Tremberth

unread,
Jan 25, 2016, 7:21:32 AM1/25/16
to scrapy-users
Thanks for the sample.

I would suggest that you loop on <race> elements, not <meeting>s.
Something like this will help you work and extract on individual race snippets:

class MyxmlSpider(scrapy.Spider):
    name = "myxml"

    start_urls = (
        ["file:///home/sayth/Downloads/20160123RAND0.xml"]
    )

    def parse(self, response):
        sel = Selector(response)
        items = []
        for race in sel.xpath('//meeting//race'):
            item = ConvXmlItem()
            item['id'] = race.xpath('@id').extract_first()
            item['num'] = race.xpath('@number').extract_first()
            item['dist'] = race.xpath('@distance').extract_first()
            items.append(item)

        return items



Hope this helps.

Paul.

Sayth Renshaw

unread,
Jan 27, 2016, 1:16:44 AM1/27/16
to scrapy-users
Thank you that does help.

Further implementation question, if i want to perform 3 queries on the XML as in the above and create relationships, i would create 3 parses as above instead of saving to file would pipelines provide a mechanism to send the scraped data witb a relation?

Thanks

Sayth

Reply all
Reply to author
Forward
0 new messages