Bulk Insert Broken for Polymorphism?

530 views
Skip to first unread message

Alex Hewson

unread,
Feb 29, 2016, 5:38:22 PM2/29/16
to sqlalchemy
Hello All,

I'm trying to use the new bulk_save_objects() to improve performance on bulk inserts, and have run into a problem.  If bulk_save_objects() is used to save objects of a polymorphic class..
  1. They are created correctly in the DB, with polymorphic type column populated correctly
  2. BUT queries for the new objects will return one of incorrect type.  In my case I'm getting instances of Child1 back when I would expect to get a Child2.
The following code demonstrates the problem:

#!/usr/bin/env python3
# -*- coding: utf-8 -*-

from sqlalchemy import create_engine
from sqlalchemy import Column, Integer, SmallInteger, String, ForeignKey
from sqlalchemy.orm import sessionmaker
from sqlalchemy.ext.declarative import declarative_base

Base = declarative_base()

class Entity(Base):
  __tablename__
= 'Entity'
 
Id              = Column(Integer, primary_key=True, nullable=False)
 
Content         = Column(String)
  _polytype      
= Column(SmallInteger, nullable=False)

  __mapper_args__
= {
   
'polymorphic_identity':1,
   
'polymorphic_on':_polytype
 
}

class Child1(Entity):
  __tablename__  
= 'Child1'
 
MyId            = Column(ForeignKey("Entity.Id"), primary_key=True)
  __mapper_args__
= {'polymorphic_identity':11}

class Child2(Entity):
  __tablename__  
= 'Child2'
 
MyId            = Column(ForeignKey("Entity.Id"), primary_key=True)
  __mapper_args__
= {'polymorphic_identity':12}


if __name__ == '__main__':
 
# engine = create_engine('sqlite:///:memory:', echo=False)
  engine
= create_engine('sqlite:///test.db', echo=False)
 
Session = sessionmaker(bind=engine)
  sess
= Session()
 
Base.metadata.create_all(engine)
  c1_many
= [Child1(Content="c1inst_%d"%i) for i in range(0,1000)]
  c2_many
= [Child2(Content="c2inst_%d"%i) for i in range(0,1000)]
  sess
.bulk_save_objects(c1_many)
  sess
.bulk_save_objects(c2_many)
 
# sess.add_all(c1_many)
 
# sess.add_all(c2_many)
  sess
.flush()
  sess
.commit()
 
for c in sess.query(Child1):
   
assert isinstance(c, Child1)
 
for c in sess.query(Child2):
   
assert isinstance(c, Child2)


All the calls to assert isinstance(c, Child1) complete successfully.  But once we start checking for Child2 - boom, we are still getting back Child1 instances.

At first I wondered if I was misunderstanding SA's implementation of polymorphism, so tried inserting rows the traditional way with sess.add_all().  But that works fine so I think I've exposed a bug in the new bulk_save_objects() code.

My environment is Python 3.5.1, SQLAlchemy==1.0.12, SQLite 3.8.10.2 on OSX.

Mike Bayer

unread,
Feb 29, 2016, 6:05:26 PM2/29/16
to sqlal...@googlegroups.com


On 02/29/2016 05:38 PM, Alex Hewson wrote:
> Hello All,
>
> I'm trying to use the new bulk_save_objects() to improve performance on
> bulk inserts, and have run into a problem. If bulk_save_objects() is
> used to save objects of a polymorphic class..
>
> 1. They are created correctly in the DB, with polymorphic type column
> populated correctly
> 2. BUT queries for the new objects will return one of incorrect type.
> In my case I'm getting instances of Child1 back when I would expect
> to get a Child2.


turn on echo=True, and you'll see this:


INSERT INTO "Child1" DEFAULT VALUES
2016-02-29 17:48:11,349 INFO sqlalchemy.engine.base.Engine ((), (), (),
(), (), (), (), () ... displaying 10 of 1000 total bound parameter sets
... (), ())

what you will notice here is that this is the Child1 table receiving
entirely empty rows; the primary key values from Entity are nowhere to
be found. SQLite does not enforce foreign keys by default so it's just
auto-generating identifiers here, something that wouldn't happen on most
other databases where this column wouldn't work as an autoincrement by
default. If you run it on Postgresql you get:

sqlalchemy.exc.IntegrityError: (psycopg2.IntegrityError) null value in
column "MyId" violates not-null constraint
DETAIL: Failing row contains (null).
[SQL: 'INSERT INTO "Child1" DEFAULT VALUES'] [parameters: ({}, {}, {},
{}, {}, {}, {}, {} ... displaying 10 of 1000 total bound parameter sets
... {}, {})]

What's happening here is documented, however the documentation for bulk
is a little long and the documentation referring to the use case here
might benefit from a little more boldface and probably should be more
clearly listed as "will not work", instead of a somewhat casual
"however". At
http://docs.sqlalchemy.org/en/rel_1_0/orm/persistence_techniques.html#orm-compatibility
(emphasis added):

Multi-table mappings, such as joined-inheritance - **however**, an
object to be inserted across multiple tables either **needs to have
primary key identifiers fully populated ahead of time**, else the
Session.bulk_save_objects.return_defaults flag must be used, which will
greatly reduce the performance benefits

what we mean here is this:

c1_many = [Child1(Id=i+1, MyId=i+1, Content="c1inst_%d"%i) for i in
range(0,1000)]
c2_many = [Child2(Id=i+1001, MyId=i+1001, Content="c2inst_%d"%i)
for i in range(0,1000)]


In SQLAlchemy 1.1, things are much easier to spot, even if you're using
a non-FK/non-autoincrement enforcing database like SQLite; running this
program immediately catches the problem on the Python side:

sqlalchemy.exc.CompileError: Column 'Child1.MyId' is marked as a
member of the primary key for table 'Child1', but has no Python-side or
server-side default generator indicated, nor does it indicate
'autoincrement=True' or 'nullable=True', and no explicit value is
passed. Primary key columns typically may not store NULL.

This is because 1.1 has changed the logic of the "autoincrement" flag
and adds deeper checks for NULL primary key values as described at
http://docs.sqlalchemy.org/en/latest/changelog/migration_11.html#the-autoincrement-directive-is-no-longer-implicitly-enabled-for-a-composite-primary-key-column.


So the mitigation for the fact that your specific test case silently
fails include:

1. this whole issue only silently passes on SQLite, not on any of the
higher volume databases where you'd want to use bulk operations in the
first place

2. documentation here should be spruced up to list this practice as a
**warning**, including that we should also have a boldface up in the
earlier paragraph talking about fetching of inserted primary keys being
disabled (this is the slowest part of the INSERT so has no place within
bulk inserts, hence you must populate columns dependent on a PK up front
which means the PK itself needs to be populated up front in those cases
where you need it)

3. SQLAlchemy 1.1 won't let these INSERTs without a primary key value
when the column is not configured as an "autoincrement" proceed

Thanks for the clear test case here.





>
> The following code demonstrates the problem:
>
> |
> #!/usr/bin/env python3
> # -*- coding: utf-8 -*-
>
> fromsqlalchemy importcreate_engine
> fromsqlalchemy importColumn,Integer,SmallInteger,String,ForeignKey
> fromsqlalchemy.orm importsessionmaker
> fromsqlalchemy.ext.declarative importdeclarative_base
>
> Base=declarative_base()
>
> classEntity(Base):
> __tablename__ ='Entity'
> Id=Column(Integer,primary_key=True,nullable=False)
> Content=Column(String)
> _polytype =Column(SmallInteger,nullable=False)
>
> __mapper_args__ ={
> 'polymorphic_identity':1,
> 'polymorphic_on':_polytype
> }
>
> classChild1(Entity):
> __tablename__ ='Child1'
> MyId=Column(ForeignKey("Entity.Id"),primary_key=True)
> __mapper_args__ ={'polymorphic_identity':11}
>
> classChild2(Entity):
> __tablename__ ='Child2'
> MyId=Column(ForeignKey("Entity.Id"),primary_key=True)
> __mapper_args__ ={'polymorphic_identity':12}
>
>
> if__name__ =='__main__':
> --
> You received this message because you are subscribed to the Google
> Groups "sqlalchemy" group.
> To unsubscribe from this group and stop receiving emails from it, send
> an email to sqlalchemy+...@googlegroups.com
> <mailto:sqlalchemy+...@googlegroups.com>.
> To post to this group, send email to sqlal...@googlegroups.com
> <mailto:sqlal...@googlegroups.com>.
> Visit this group at https://groups.google.com/group/sqlalchemy.
> For more options, visit https://groups.google.com/d/optout.

Alex Hewson

unread,
Feb 29, 2016, 6:15:35 PM2/29/16
to sqlalchemy
Hi Mike,

Thanks for the quick response.  If that's the intended behaviour I'll go back to non-bulk inserts for my inherited types.  Doubtless I could work around it by inserting N new Entities, fetching their autoincrement ID's then using them to make Child1 and Child2's but I don't trust myself with the added complexity.


Cheers,
Alex.

Mike Bayer

unread,
Feb 29, 2016, 6:31:23 PM2/29/16
to sqlal...@googlegroups.com


On 02/29/2016 06:15 PM, Alex Hewson wrote:
> Hi Mike,
>
> Thanks for the quick response. If that's the intended behaviour I'll go
> back to non-bulk inserts for my inherited types. Doubtless I could work
> around it by inserting N new Entities, fetching their autoincrement ID's
> then using them to make Child1 and Child2's but I don't trust myself
> with the added complexity.

Well it's the fetching of these autoincrement identifiers that takes all
the time since you need to do it immediately on a per-row basis,
otherwise, at least on most databases besides SQLite, you can never know
if other transactions have also inserted rows at the same time. Once
you're doing a SELECT or RETURNING for every row you no longer can do an
efficient bulk insert.


>
>
> Cheers,
> Alex.
>
>
>
> On Monday, February 29, 2016 at 10:38:22 PM UTC, Alex Hewson wrote:
>
> Hello All,
>
> I'm trying to use the new bulk_save_objects() to improve performance
> on bulk inserts, and have run into a problem. If
> bulk_save_objects() is used to save objects of a polymorphic class..
>
> 1. They are created correctly in the DB, with polymorphic type
> column populated correctly
> 2. BUT queries for the new objects will return one of incorrect
> type. In my case I'm getting instances of Child1 back when I
> would expect to get a Child2.
>
> The following code demonstrates the problem:
>
> |
> #!/usr/bin/env python3
> # -*- coding: utf-8 -*-
>
> fromsqlalchemy importcreate_engine
> fromsqlalchemy importColumn,Integer,SmallInteger,String,ForeignKey
> fromsqlalchemy.orm importsessionmaker
> fromsqlalchemy.ext.declarative importdeclarative_base
>
> Base=declarative_base()
>
> classEntity(Base):
> __tablename__ ='Entity'
> Id=Column(Integer,primary_key=True,nullable=False)
> Content=Column(String)
> _polytype =Column(SmallInteger,nullable=False)
>
> __mapper_args__ ={
> 'polymorphic_identity':1,
> 'polymorphic_on':_polytype
> }
>
> classChild1(Entity):
> __tablename__ ='Child1'
> MyId=Column(ForeignKey("Entity.Id"),primary_key=True)
> __mapper_args__ ={'polymorphic_identity':11}
>
> classChild2(Entity):
> __tablename__ ='Child2'
> MyId=Column(ForeignKey("Entity.Id"),primary_key=True)
> __mapper_args__ ={'polymorphic_identity':12}
>
>
> if__name__ =='__main__':
> # engine = create_engine('sqlite:///:memory:', echo=False)
> engine =create_engine('sqlite:///test.db',echo=False)
> Session=sessionmaker(bind=engine)
> sess =Session()
> Base.metadata.create_all(engine)
> c1_many =[Child1(Content="c1inst_%d"%i)fori inrange(0,1000)]
> c2_many =[Child2(Content="c2inst_%d"%i)fori inrange(0,1000)]
> sess.bulk_save_objects(c1_many)
> sess.bulk_save_objects(c2_many)
> # sess.add_all(c1_many)
> # sess.add_all(c2_many)
> sess.flush()
> sess.commit()
> forc insess.query(Child1):
> assertisinstance(c,Child1)
> forc insess.query(Child2):
> assertisinstance(c,Child2)
>
> |
>
> All the calls to assert isinstance(c, Child1) complete
> successfully. But once we start checking for Child2 - boom, we are
> still getting back Child1 instances.
>
> At first I wondered if I was misunderstanding SA's implementation of
> polymorphism, so tried inserting rows the traditional way with
> sess.add_all(). But that works fine so I think I've exposed a bug
> in the new bulk_save_objects() code.
>
> My environment is Python 3.5.1, SQLAlchemy==1.0.12, SQLite 3.8.10.2
> on OSX.
>
Reply all
Reply to author
Forward
0 new messages