Re: using list as an item field?

3,494 views
Skip to first unread message

shadow

unread,
Jun 27, 2012, 6:14:01 AM6/27/12
to scrapy...@googlegroups.com
hasn't anyone encountered this problem before? or it's just me? I still haven't figured it out....please help me out, thanks..

ScrapMe

unread,
Jun 27, 2012, 11:43:49 AM6/27/12
to scrapy...@googlegroups.com
Hi,

I'm new at this but here's a shot.

Have you done this at the prompt?

scrapy crawl your_spiders_name -o airports.json -t json

This runs your spider and outputs airports.json.



On Saturday, June 16, 2012 2:18:09 PM UTC-4, shadow wrote:
Hi all,
I am scraping some data with complex hierarchical info, say, I want the result to be json like:
family = 
{
  "position" : "grandpa",
  "name" : "Dave",
  "sons" : [
     {
         "name" : "John",
         "grandsons" : [
              .......
         ]
    },
     {
         "name" : "Matt"
          "grandsons" : [
              .......
         ]
      },
   ]
}
FYI, this is just some random data I made up, so please don't chase the logic and necessity here... :)

so I defined the items in items.py, as 
class FamilyItem():
    name=Field()
    sons = Field()

class SonsItem():
    name = Field()
    grandsons = Field()

class GrandsonsItem():
    name = Field()
    ......

then I scraped, got the data, populated into the item. In the end when I tried to export the result into json (--set FEED_URI=airports.json --set FEED_FORMAT=json), I always got exceptions saying " it's not JSON serializable".

this has been puzzling me for days...sorry if it's stupid.

shadow

unread,
Jun 27, 2012, 2:19:59 PM6/27/12
to scrapy...@googlegroups.com
thanks but I've already tried that... 

jsonlines, json, xml don't work, only csv could export something without error, though apparently, the output of csv is useless.

Roberto Fuentes

unread,
Jun 27, 2012, 2:35:19 PM6/27/12
to scrapy...@googlegroups.com
Can you post your parse function?

In the tutorial it's pretty straight forward.

--
You received this message because you are subscribed to the Google Groups "scrapy-users" group.
To view this discussion on the web visit https://groups.google.com/d/msg/scrapy-users/-/0jID6h7OADkJ.
To post to this group, send email to scrapy...@googlegroups.com.
To unsubscribe from this group, send email to scrapy-users...@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/scrapy-users?hl=en.

shadow

unread,
Jun 29, 2012, 6:15:55 AM6/29/12
to scrapy...@googlegroups.com
I just made a test bot based on dirbot from https://github.com/scrapy/dirbot. Code as follows

# items.py

from scrapy.item import Item, Field


class DirbotItem(Item):

    name = Field()
    description = Field()

class Category(Item):

    name = Field()
    sites = Field()

class Website(DirbotItem):

    url = Field()

    def __str__(self):
        return "Website: name=%s url=%s" % (self.get('name'), self.get('url'))

# dirbot/spider/dmoz.py
from scrapy.spider import BaseSpider
from scrapy.http import Request
from scrapy.selector import HtmlXPathSelector

from dirbot.items import Website, Category


class DmozSpider(BaseSpider):
    name = "dmoz"
    allowed_domains = ["dmoz.org"]
    start_urls = [
    ]
    base_uri = "http://www.dmoz.org"

    def parse(self, response):
        hxs = HtmlXPathSelector(response)
        cats = hxs.select('//ul[contains(@class, "directory")][2]/li/a')
        for cat in cats:
            link = cat.select('@href').extract()[0]
            meta = {}
            meta['cat'] = cat.select('text()').extract()[0]
            yield Request(self.base_uri+link, meta=meta, callback=self.parseSites)

    def parseSites(self, response):
        hxs = HtmlXPathSelector(response)
        sites = hxs.select('//ul[contains(@class, "directory-url")]/li')
        catname = response.meta['cat']
        items = []
        cat = Category()
        cat['name'] = catname
        print cat

        for site in sites:
            item = Website()
            item['name'] = site.select('a/text()').extract()
            item['url'] = site.select('a/@href').extract()
            item['description'] = "".join(site.select('text()').extract()).strip()
            items.append(item)
        cat['sites'] = items
        return cat

# comment out pipeline in settings
ITEM_PIPELINES = ['dirbot.pipelines.FilterWordsPipeline']

run the spider, you could see the error. 
To unsubscribe from this group, send email to scrapy-users+unsubscribe@googlegroups.com.

Steven Almeroth

unread,
Jun 29, 2012, 10:49:01 AM6/29/12
to scrapy...@googlegroups.com
Works for me (no errors):

2012-06-29 09:46:34-0500 [dmoz] INFO: Dumping spider stats:
{'downloader/request_bytes': 482,
'downloader/request_count': 2,
'downloader/request_method_count/GET': 2,
'downloader/response_bytes': 13063,
'downloader/response_count': 2,
'downloader/response_status_count/200': 2,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2012, 6, 29, 14, 46, 34, 824402),
'item_scraped_count': 43,
'scheduler/memory_enqueued': 2,
'start_time': datetime.datetime(2012, 6, 29, 14, 46, 34, 226671)}
2012-06-29 09:46:34-0500 [dmoz] INFO: Spider closed (finished)

Roberto Fuentes

unread,
Jun 29, 2012, 11:45:18 AM6/29/12
to scrapy...@googlegroups.com
I get the errors...

not sure why...

I'll get back to you in a bit

--
You received this message because you are subscribed to the Google Groups "scrapy-users" group.
To view this discussion on the web visit https://groups.google.com/d/msg/scrapy-users/-/GzqFfBvplJwJ.

To post to this group, send email to scrapy...@googlegroups.com.
To unsubscribe from this group, send email to scrapy-users...@googlegroups.com.

Roberto Fuentes

unread,
Jun 29, 2012, 5:25:00 PM6/29/12
to scrapy...@googlegroups.com
The pipeline in the dirbot example is:

from scrapy.exceptions import DropItem
class FilterWordsPipeline(object):
    """A pipeline for filtering out items which contain certain words in their
description"""
    # put all words in lowercase
    words_to_filter = ['politics', 'religion']
    def process_item(self, item, spider):
        for word in self.words_to_filter:
            if word in unicode(item['description']).lower():
                raise DropItem("Contains forbidden word: %s" % word)
        else:
            return item


The test bot does not work because cat is not just a dictionary.
Its a dictionary-list within a dictionary.

cat = {
        'name': u'Personal Pages',
        'sites': [
                    {
                        'description': u'- News and information related to Python and its promotion as "most powerful language you can still read."',
                        'name': [u'Altis, Kevin'],
                        'url': [u'http://radio-weblogs.com/0102677/categories/python/']
                    },
                    {
                        'description': u"- A Python programmer's weblog.",
                        'name': [u'Anand Pillai -  Random bytes on technology and open source'],
                        'url': [u'http://randombytes.blogspot.com/']
                    },
                    {   'description': u'- Programming weblog for a small, newly formed, and curious company; covers mostly Python.',
                        'name': [u'Blended Technologies'],
                        'url': [u'http://www.blendedtechnologies.com/']
                    }
                ]
    }

When you return cat, instead of processing one item at a time in the pipeline, you are 'clogging' (no pun intended)
it with a dictionary-list turd. 

In the dirbot example, when return items is executed in the spider, the pipeline receives a list of dictionaries like so:

[
        {
            'description': [u'- News and information related to Python and its promotion as "most powerful language you can still read."'],
            'name': [u'Altis, Kevin'],
            'url': [u'http://radio-weblogs.com/0102677/categories/python/']
        },
        {
            'description': [u'- A Python programmer's weblog."],
            'name': [u'Anand Pillai -  Random bytes on technology and open source'],
            'url': [u'http://randombytes.blogspot.com/']
        },
        {  
            'description': [u'- Programming weblog for a small, newly formed, and curious company; covers mostly Python.'],
            'name': [u'Blended Technologies'],
            'url': [u'http://www.blendedtechnologies.com/']
        }
]

And the pipeline takes that list and goes through each dictionary item['description'] and filters it.

Your test bot doesn't know how to process a list-dictionary.

That's the problem with the test bot. 

I'm working on a solution but its taking a while


To view this discussion on the web visit https://groups.google.com/d/msg/scrapy-users/-/uxhiIcn34X8J.

To post to this group, send email to scrapy...@googlegroups.com.
To unsubscribe from this group, send email to scrapy-users...@googlegroups.com.

shadow

unread,
Jun 30, 2012, 12:22:41 AM6/30/12
to scrapy...@googlegroups.com
exactly, that's where the problem lies in. actually, I am currently using a stupid fallback to implement a pipeline just to do json.loads(json.dumps(item))

Though I didn't really know how to solve this in scrapy, where I think the problem should really be solved, not in specific bot implementation. As I said before, csv output does not incur any errors(but the result is useles), so I think the feedExporter for json/xml is the one should be revisited. But this is difficult for me.

Or maybe there is a better way of handling nested items like this in scrapy? 

Roberto Fuentes

unread,
Jun 30, 2012, 10:46:46 AM6/30/12
to scrapy...@googlegroups.com
One item per spider... that's what it appears to be on first pass.

I'm still testing some stuff out.  I think I'll post a similar question to SOF.

I was thinking of using scrapy inside another python script under its own class.

There are a lot of nuances in scrapy I'm beginning to learn. 

To view this discussion on the web visit https://groups.google.com/d/msg/scrapy-users/-/lgc3Vfkt3fAJ.

To post to this group, send email to scrapy...@googlegroups.com.
To unsubscribe from this group, send email to scrapy-users...@googlegroups.com.

shadow

unread,
Jul 1, 2012, 2:40:40 AM7/1/12
to scrapy...@googlegroups.com
I happen to be thinking of something similar too, but may not be as complicated as yours. 

please drop me a note when you work something out.

Roberto Fuentes

unread,
Jul 2, 2012, 3:33:59 PM7/2/12
to scrapy...@googlegroups.com
Shadow,

take a look at the solution to this problem.
The end result is a .json output that can be saved.
It's a simple 'hack' that gets the job done. 
I'm sure there are cleaner more sophisticated solutions but check it out and let me know your thoughts.

Remember, this solution is specific to the website being crawled in this problem but the end results
are what you were looking for (json output).

Walk-Through:


# items.py
from scrapy.item import Item, Field


class DirbotItem(Item):
    name = Field()
    description = Field()


class Website(DirbotItem):
    url = Field()



class Category(Item):
    name = Field()
    url = Field()
    description = Field()

# I use XPathItemLoader() to pack the items
# the_spider.py

from scrapy.contrib.loader import XPathItemLoader
from scrapy.spider import BaseSpider
from dirbot.items_solution import Website
from scrapy import log


class DmozSpider(BaseSpider):
    name = "dmoz_solution"

    allowed_domains = ["dmoz.org"]
    start_urls = [
                "http://www.dmoz.org/Computers/Programming/Languages/Python/Personal_Pages/"
                ]

    def parse(self, response):
        sites = XPathItemLoader(item=Website(), response=response)

        log.msg('Adding SITES', level=log.INFO)
        sites.add_xpath('name', "//ul[contains(@class, 'directory-url')]/li/a/text()")
        sites.add_xpath('description', "//ul[contains(@class, 'directory-url')]/li/text()")
        sites.add_xpath('url', '//ul[contains(@class, "directory-url")]/li/a/@href')

        log.msg('Loading Items', level=log.INFO)
        # dictionary = { 'name' : list[] }
        items = sites.load_item()

        # specific to this example
        # need to eliminate empty '' scraped items

        for i in items['description']:
            if i.strip() == '':
                items['description'].pop(items['description'].index(i))

        # if you want to see the items going into the pipelines file
        # print items


        log.msg("returning items", level=log.INFO)
        return items

# the pipeline is specific to the problem, which needs to be recreated for other crawling projects
# pipelines.py
from dirbot.items_solution import Category
from scrapy import log



class FilterWordsPipeline(object):
    """A pipeline for filtering out items which contain certain words in their
    description"""
    log.msg('Checking the pipes', level=log.INFO)

    # put all words in lowercase
    # this example can only take one word to filter.  If you want more words you must edit the
    # process_item()
    words_to_filter = ['weblog']


    '''
        The process_item() is specific to this example.
        Depending on what you want to do you will have to adjust it accordingly.
        In this case, we want to process a dictionary-list as a result of the XPathItemLoader() call.
        We then want .json type output from the filtered results
    '''
    def process_item(self, item, spider):
            log.msg('Items in Pipe: %s' % item, level=log.INFO)

            # set of indices that will be used to mark the items that pass the filter
            keep = set(range(0, len(item['description'])))

            # construct lists to pass to the final dictionary output
            description = []
            url = []
            name = []
            cat = Category()

            for word in self.words_to_filter:

                # identify indices to filter out that contain the word
                # make a set to figure out which ones are out
                to_pop = set([i for i, j in enumerate(item['description']) if word in unicode(j).lower()])

                # select which items to keep
                keep = list(to_pop.symmetric_difference(keep))
                log.msg('keeping: %s' % keep, level=log.INFO)

                for i in keep:
                    description.append(item['description'][i])
                    url.append(item['url'][i])
                    name.append(item['name'][i])

                # populate Category()
                cat['description'] = description
                cat['url'] = url
                cat['name'] = name

            log.msg('Returning Item from Pipe', level=log.INFO)
            log.msg('%s' % cat, level=log.INFO)
            return cat

Notice that in the pipeline I import the Category class from items.py.
This acts just like the other items classes in the spider file.
As the data packet flows through the pipeline, the word is filtered for each item['description']
and repackaged into the Category() field cat.

The output is then a dictionary-list just like how you wanted, which can be saved as .json.

What I learned is that the pipeline is specific to what the type of solution you are looking for.
Since you want a specific output then the pipeline has to be constructed as so.
The itemloaders and pipelines are still confusing to me so this is what it is, a 'hack'.

To view this discussion on the web visit https://groups.google.com/d/msg/scrapy-users/-/cgm8uybPVt0J.

To post to this group, send email to scrapy...@googlegroups.com.
To unsubscribe from this group, send email to scrapy-users...@googlegroups.com.

shadow

unread,
Jul 3, 2012, 8:46:39 AM7/3/12
to scrapy...@googlegroups.com
Hi, I've tested out your code, and it works great! While looking back at tens of other spiders I've been working on and will be put into production later, I actually think a "universally implemented" feedexporter class would be more clean. Don't you think? 
Unfortunately, I haven't got the time to read the involved logic in scrapy and develop it. 

Please let me know your thoughts

Roberto Fuentes

unread,
Jul 3, 2012, 12:30:27 PM7/3/12
to scrapy...@googlegroups.com
After going through the motions of constructing a pipeline the main issue that I encountered was how the spider was sending the data down the pipeline, and how the pipeline was constructed to deal with the data for output.

The spider should be used to extract the data and package it as simple as possible, so no nests of any kind (except for the l.item_loader()).
The pipeline is then used to deal with the data coming from the spider and customizable outputs would be configured there.

The feed-exports are pretty general in my opinion.  But if the data isn't packaged a certain way, then the feed-exports won't help you much.
In other words, I don't think the feed-exports are the problems, its lack of knowledge that the user needs to package the data a certain way.

Scrapy has a lot of nuances and this is just one of them. 
I'm still learning it and the best way to test scrapy is just scrape scrape the web.

To view this discussion on the web visit https://groups.google.com/d/msg/scrapy-users/-/O5Q8tKmuwBoJ.

To post to this group, send email to scrapy...@googlegroups.com.
To unsubscribe from this group, send email to scrapy-users...@googlegroups.com.

shadow

unread,
Jul 5, 2012, 11:29:13 PM7/5/12
to scrapy...@googlegroups.com
I see, this makes sense. At first I just cannot believe such a seemingly simple problem hasn't been solved properly in scrapy for so many years, it cannot be that everyone is scraping "linear" data.... 

Pablo Hoffman

unread,
Aug 22, 2012, 12:53:30 PM8/22/12
to scrapy...@googlegroups.com
This is now fixed in master branch, with this change.
To unsubscribe from this group, send email to scrapy-users...@googlegroups.com.

For more options, visit this group at http://groups.google.com/group/scrapy-users?hl=en.

--
You received this message because you are subscribed to the Google Groups "scrapy-users" group.
To view this discussion on the web visit https://groups.google.com/d/msg/scrapy-users/-/uxhiIcn34X8J.
To post to this group, send email to scrapy...@googlegroups.com.
To unsubscribe from this group, send email to scrapy-users...@googlegroups.com.

For more options, visit this group at http://groups.google.com/group/scrapy-users?hl=en.

--
You received this message because you are subscribed to the Google Groups "scrapy-users" group.
To view this discussion on the web visit https://groups.google.com/d/msg/scrapy-users/-/lgc3Vfkt3fAJ.

To post to this group, send email to scrapy...@googlegroups.com.
To unsubscribe from this group, send email to scrapy-users...@googlegroups.com.

For more options, visit this group at http://groups.google.com/group/scrapy-users?hl=en.

--
You received this message because you are subscribed to the Google Groups "scrapy-users" group.
To view this discussion on the web visit https://groups.google.com/d/msg/scrapy-users/-/cgm8uybPVt0J.

To post to this group, send email to scrapy...@googlegroups.com.
To unsubscribe from this group, send email to scrapy-users...@googlegroups.com.

For more options, visit this group at http://groups.google.com/group/scrapy-users?hl=en.

--
You received this message because you are subscribed to the Google Groups "scrapy-users" group.
To view this discussion on the web visit https://groups.google.com/d/msg/scrapy-users/-/O5Q8tKmuwBoJ.

To post to this group, send email to scrapy...@googlegroups.com.
To unsubscribe from this group, send email to scrapy-users...@googlegroups.com.
Reply all
Reply to author
Forward
0 new messages