Django/Scrapy Model foreign keys?

825 views
Skip to first unread message

Paul

unread,
Jun 18, 2013, 2:28:00 AM6/18/13
to scrapy...@googlegroups.com
Currently I can't figure out a way to handle Django model foreign keys. 

I set up my django models in a fashion similar to this: http://blog.just2us.com/2012/07/setting-up-django-with-scrapy/

Now when I try to "yield" a course item that is dependent on a Django model as the foreign key, I get:
#items.py
from mydjangoapp.models import School

class School(DjangoItem):
    django_model = University

#spider code 
course = CourseItem(name=course_name, dept_name=dept_name,professors=course_professor,url=url,school=School.objects.filter(name=response.meta['school'])[0])
yield course

#result
exceptions.TypeError: <University: University object> is not JSON serializable

I have also tried yielding the actual Django object instead of using a DjangoItem, but as you might imagine I also get an error.
#spider code
from mydjangoapp.models import School, Course
course = Course(name=course_name, dept_name=dept_name,professors=course_professor,url=url, school=School.objects.filter(name=response.meta['school'])[0])

#result
2013-06-18 01:23:20-0500 [my_spider] ERROR: Spider must return Request, BaseItem or None, got 'Course' in (url)

I have been wrestling with this problem for quite a while, any ideas? There isn't much that I could find about DjangoItem (or raw django models) with Scrapy and foreignkeys. 

Thank you!
Paul

Paul Tremberth

unread,
Jun 18, 2013, 5:12:29 AM6/18/13
to scrapy...@googlegroups.com
Hi,
I dont see the definition of University (nor where it's imported from)
Also I would rename School(DjangoItem) to SchoolItem(DhangoItem) to avoid clashing with your Django model
In my (little) experience with Django and Scrapy, the tricky thing is to configure access to your Django models inside Scrapy
I havent played with DjangoItem much though
Does your
    School.objects.filter(name=response.meta['school'])[0]
give you the objects you want?

Could you share more code and/or logs perhaps? (remove all sensitive/proprietary code and anonymize at will)

Paul.

Paul

unread,
Jun 18, 2013, 10:02:41 AM6/18/13
to scrapy...@googlegroups.com
University should be "School", sorry, I was in the process of switching the name. 

So this should  be:
from mydjangoapp.models import School
class School(DjangoItem):
    django_model = School

And the error is actually:
#result
exceptions.TypeError: <School: School object> is not JSON serializable

I am actually intentionally using my Django model for "School" in this case because that is how I retrieve the school object as the foreign key. If I use a DjangoItem in that field as the key, it won't let me retrieve the item from the database ("SchoolItem has no property 'objects'").

Here is the Django Model definition for School:
class School(models.Model):
    #id is implicit
    name = models.CharField(max_length=100)
    ...(more fields)
    date_updated = models.DateTimeField(default=datetime.now)

And for Course:
class Course(models.Model):
    #id is implicit
    university = models.ForeignKey(University,max_length=100)  #notice the foreign key attribute
    name = models.CharField(max_length=100)
    ...(other)
    date_updated = models.DateTimeField(default=datetime.now)

School.objects.filter(name=response.meta['school'])[0] does in fact give me the object I need. I pass the current school name via response.meta, and retrieve it using Djangos syntax for object lookup. 

As you can see, School is a foreign key field within Course. 

I have full access to my Django models and have imported my Django settings correctly as far as I can tell (meaning that I can say from mydjangoapp.models import Course within my Scrapy app, and it succeeds). 

Does that help? Thank you so much for your time!
Paul

Paul Tremberth

unread,
Jun 18, 2013, 10:06:19 AM6/18/13
to scrapy...@googlegroups.com
Would renaming to SchoolItem change anything?

class SchoolItem(DjangoItem):
    django_model = School

On Tuesday, June 18, 2013 8:28:00 AM UTC+2, Paul wrote:

Paul

unread,
Jun 18, 2013, 10:06:25 AM6/18/13
to scrapy...@googlegroups.com
Note: my problem is pretty much the same as this one:

Paul Tremberth

unread,
Jun 18, 2013, 10:10:26 AM6/18/13
to scrapy...@googlegroups.com
Sorry I hadnt fully read " If I use a DjangoItem in that field as the key, it won't let me retrieve the item from the database ("SchoolItem has no property 'objects'")."

Could you post your spider code (partial) to some http://pastebin.com of some kind?


On Tuesday, June 18, 2013 8:28:00 AM UTC+2, Paul wrote:

Paul

unread,
Jun 18, 2013, 10:13:04 AM6/18/13
to scrapy...@googlegroups.com
I get the same issue when I rename to "SchoolItem", 

Note that I have to use 'School' (which is a django model, not a scrapy DjangoItem) to retrieve the foreign key object from my database in Django.
course = CourseItem(name=course_name, ...., school=School.objects.filter(name=response.meta['school'])[0])

Doing that yields the "School is not JSON Serializable" error: full trace is:
2013-06-18 09:07:39-0500 [spider] ERROR: Error caught on signal handler: <
bound method ?.item_scraped of <scrapy.contrib.feedexport.FeedExporter object at
 0xa5bbd0c>>
        Traceback (most recent call last):
          File "/usr/lib/python2.6/dist-packages/twisted/internet/defer.py", lin
e 371, in _runCallbacks
            self.result = callback(self.result, *args, **kw)
          File "/usr/local/lib/python2.6/dist-packages/scrapy/core/scraper.py",
line 213, in _itemproc_finished
            item=output, response=response, spider=spider)
          File "/usr/local/lib/python2.6/dist-packages/scrapy/signalmanager.py",
 line 23, in send_catch_log_deferred
            return signal.send_catch_log_deferred(*a, **kw)
          File "/usr/local/lib/python2.6/dist-packages/scrapy/utils/signal.py",
line 53, in send_catch_log_deferred
            *arguments, **named)
        --- <exception caught here> ---
          File "/usr/lib/python2.6/dist-packages/twisted/internet/defer.py", lin
e 117, in maybeDeferred
            result = f(*args, **kw)
          File "/usr/local/lib/python2.6/dist-packages/scrapy/xlib/pydispatch/ro
bustapply.py", line 47, in robustApply
            return receiver(*arguments, **named)
          File "/usr/local/lib/python2.6/dist-packages/scrapy/contrib/feedexport
.py", line 191, in item_scraped
            slot.exporter.export_item(item)
          File "/usr/local/lib/python2.6/dist-packages/scrapy/contrib/exporter/_
_init__.py", line 110, in export_item
            self.file.write(self.encoder.encode(itemdict))
          File "/usr/local/lib/python2.6/dist-packages/scrapy/utils/serialize.py
", line 89, in encode
            return super(ScrapyJSONEncoder, self).encode(o)
          File "/usr/lib/python2.6/json/encoder.py", line 367, in encode
            chunks = list(self.iterencode(o))
          File "/usr/lib/python2.6/json/encoder.py", line 309, in _iterencode
            for chunk in self._iterencode_dict(o, markers):
          File "/usr/lib/python2.6/json/encoder.py", line 275, in _iterencode_di
ct
            for chunk in self._iterencode(value, markers):
          File "/usr/lib/python2.6/json/encoder.py", line 317, in _iterencode
            for chunk in self._iterencode_default(o, markers):
          File "/usr/lib/python2.6/json/encoder.py", line 323, in _iterencode_de
fault
            newobj = self.default(o)
          File "/usr/local/lib/python2.6/dist-packages/scrapy/utils/serialize.py
", line 109, in default
            return super(ScrapyJSONEncoder, self).default(o)
          File "/usr/lib/python2.6/json/encoder.py", line 344, in default
            raise TypeError(repr(o) + " is not JSON serializable")
        exceptions.TypeError: <School: School object> is not JSON serial
izable

Paul

unread,
Jun 18, 2013, 10:19:30 AM6/18/13
to scrapy...@googlegroups.com
I don't think it's a problem so much with the spider (there's a lot of code, but it worked until this foreign key isssue) but more with handling foreign keys in Scrapy models in general.


And I have the same error. 

Mainly, I just need to know if I have two Django models, and one is a foreign key for the other, how can I insert into them in Scrapy? Maybe there is some way to override the Course after I have created it, and make the foreign key object JSON serializable before yielding it?

If you want to chat offline/screenshare for speed purposes let me know. Thank you for your time and quick replies!

Paul Tremberth

unread,
Jun 18, 2013, 10:21:00 AM6/18/13
to scrapy...@googlegroups.com
I dont know how DjangoItem handles that but what about passing the school_id in CourseItem ?
CourseItem(name=course_name, ...., school_id=School.objects.filter(name=response.meta['school'])[0].id

You probably need to tweak the Pipeline after that

Paul

unread,
Jun 18, 2013, 10:31:10 AM6/18/13
to scrapy...@googlegroups.com
Should I forego the use of DjangoItem's altogether? I was looking for a way but couldn't figure out how to do that.

My current pipeline is:

class DjangoPipeline(object):
    def process_item(self, item, spider):
    item.save()
        return item

I'm not sure how I would change that though.

Unfortunately, though the school_id is the primary key of that foreign key, Django expects an object instead of an ID, so I get:         
exceptions.ValueError: Cannot assign "5L": "Course.school" must be a "School" instance.

Paul Tremberth

unread,
Jun 18, 2013, 10:41:38 AM6/18/13
to scrapy...@googlegroups.com
Normally you can create Django objects with the "_id" suffix for your referenced objects
Course(name=..., school_id=5)

If that doesnt work with DjangoItem you should look into the implementation

Otherwise, you could indeed do the insert in a second stage processing, having the school_id in your regular Item CourseItem

Paul

unread,
Jun 18, 2013, 10:52:24 AM6/18/13
to scrapy...@googlegroups.com
I'm going to need to do the second stage processing, unfortunately the other two options did not pan out (though I had no idea about the _id suggestion).

Can you suggest how I would do the second stage of processing? I know that I could use a pipeline, but given the fact that many more objects than just "Course" will be passing through it, I am hesitant to use it.

I assume what I would do is create a DjangoItem called "SchoolItem", override the "school_id" to just a Field, and then when it hits the pipeline the pipeline would convert it to a django object and save it? My main question is how to do this efficiently, knowing that it will need to be done with other foreign key dependent objects in my project.

I feel close!
Reply all
Reply to author
Forward
0 new messages