attempting something like a bulk insert ignore

1,790 views
Skip to first unread message

strattonbrazil

unread,
Oct 21, 2009, 2:28:56 PM10/21/09
to sqlalchemy
From what I've read of sqlalchemy, I originally wanted to have a main
table with one attribute foreign keyed to another table's
autoincremented integer primary key userId and a userName. I thought
I could join the tables together and set that as the mapper. The save
every object in my session. I thought I had it all figured out until
I found out that specifically for sqlite, autoincrement (or
autogenerated primary keys) don't work with "composite" tables/using a
join.

I've tried to do it a number of different ways and was hoping to get a
better way to do it.

Option #1
Insert it into the tables separately. Without using a session, insert
every user name into the users table catching every uniqueness error
and ignoring it. Run a select to get the primary keys back, and when
lookup the userId before inserting an entry into the first table.
I've tried something like this, but it was pretty slow, but I think I
was doing a selection every time instead of one big one at the end, so
I assume this was why it was so slow. Plus catching unique errors
seems like a hack, but I don't see any support for an insert-ignore.
I've tried using a bulk insert, but it seems if there are any
duplicates and throws an error, nothing is inserted. I could filter
before hand out all the duplicates, but I just assume sqlalchemy/
sqlite would be able to handle this for me. It might get hairy, too,
with other instances of the program running on the same database. If
read back the names from the database, another instance could add
another name that might kill my filtered bulk insert.

Option #2
Somehow get it to work using a session. Since I can't do the join
like I wanted (including the sqlite autoincrement feature), add
everything like option #1 per table creating a mapper for each table.
This seems like a hack having to map and unmap. And I still have to
catch duplicates errors.

Is there a better way to go about doing this?

Michael Bayer

unread,
Oct 21, 2009, 2:35:01 PM10/21/09
to sqlal...@googlegroups.com
strattonbrazil wrote:
>
> From what I've read of sqlalchemy, I originally wanted to have a main
> table with one attribute foreign keyed to another table's
> autoincremented integer primary key userId and a userName. I thought
> I could join the tables together and set that as the mapper. The save
> every object in my session. I thought I had it all figured out until
> I found out that specifically for sqlite, autoincrement (or
> autogenerated primary keys) don't work with "composite" tables/using a
> join.

as long as the primary key of table #1 is a single-column primary key,
SQLite autoincrements it. To get the pk into your joined table, set the
two columns to the same attribute. e.g. as in
http://www.sqlalchemy.org/docs/05/mappers.html#mapping-a-class-against-multiple-tables

Josh Stratton

unread,
Oct 21, 2009, 4:45:44 PM10/21/09
to sqlal...@googlegroups.com
>> From what I've read of sqlalchemy, I originally wanted to have a main
>> table with one attribute foreign keyed to another table's
>> autoincremented integer primary key userId and a userName.  I thought
>> I could join the tables together and set that as the mapper.  The save
>> every object in my session.  I thought I had it all figured out until
>> I found out that specifically for sqlite, autoincrement (or
>> autogenerated primary keys) don't work with "composite" tables/using a
>> join.
>
> as long as the primary key of table #1 is a single-column primary key,
> SQLite autoincrements it.  To get the pk into your joined table, set the
> two columns to the same attribute.   e.g. as in
> http://www.sqlalchemy.org/docs/05/mappers.html#mapping-a-class-against-multiple-tables

Ah, I was going to reference another page, but I see you are the author. Heh.

http://www.mail-archive.com/sqlal...@googlegroups.com/msg09503.html

Anyway, isn't this the same problem? If I don't use foreign keys
pointing the main table to the user lookup, I get an error:

<class 'sqlalchemy.exceptions.ArgumentError'>: Can't find any foreign
key relationships between 'jobs' and 'users'

Maybe I'm explaining my problem wrong.

I have table A that has a few preassigned primary keys and
non-primary_key column 'userId', which references table B.
I have table B that has an autoincrementing primary key 'userId' and a
userName column is unique. Based on the linked message you wrote
earlier, it seems like what you're now proposing isn't possible in
sqlite.

Michael Bayer

unread,
Oct 21, 2009, 4:52:11 PM10/21/09
to sqlal...@googlegroups.com
Josh Stratton wrote:
>
>>> From what I've read of sqlalchemy, I originally wanted to have a main
>>> table with one attribute foreign keyed to another table's
>>> autoincremented integer primary key userId and a userName.  I thought
>>> I could join the tables together and set that as the mapper.  The save
>>> every object in my session.  I thought I had it all figured out until
>>> I found out that specifically for sqlite, autoincrement (or
>>> autogenerated primary keys) don't work with "composite" tables/using a
>>> join.
>>
>> as long as the primary key of table #1 is a single-column primary key,
>> SQLite autoincrements it.  To get the pk into your joined table, set the
>> two columns to the same attribute.   e.g. as in
>> http://www.sqlalchemy.org/docs/05/mappers.html#mapping-a-class-against-multiple-tables
>
> Ah, I was going to reference another page, but I see you are the author.
> Heh.
>
> http://www.mail-archive.com/sqlal...@googlegroups.com/msg09503.html
>
> Anyway, isn't this the same problem?

that email refers to a composite primary key in a single table. this is a
primary key that consists of more than one column. your original email
didn't seem to describe that usage.

> If I don't use foreign keys
> pointing the main table to the user lookup, I get an error:
>
> <class 'sqlalchemy.exceptions.ArgumentError'>: Can't find any foreign
> key relationships between 'jobs' and 'users'

this is always easily worked around by specifing the "ON" clause to your
join(), as the second argument of table1.join(table2,
and_(table1.c.foo==table2.c.bar, table1.c.bat==table2.c.hoho, ...)).

> I have table A that has a few preassigned primary keys and

here's a terminiology problem. A table in relational databases can only
have one primary key. The primary key may be composed of multiple
columns, in which case its called a "composite primary key". So maybe
that is what you're describing here.

> non-primary_key column 'userId', which references table B.
> I have table B that has an autoincrementing primary key 'userId' and a
> userName column is unique. Based on the linked message you wrote
> earlier, it seems like what you're now proposing isn't possible in
> sqlite.

as long as the sqlite table has only one column that is declared as
primary key in the CREATE TABLE statement, sqlite can autoincrment. a
"userId" column declared as the PRIMARY KEY and a second "userName" column
that only has a unique constraint on it does not impact SQLite's
autoincrement capability.






>
> >
>

Josh Stratton

unread,
Oct 21, 2009, 6:00:13 PM10/21/09
to sqlal...@googlegroups.com
> this is always easily worked around by specifing the "ON" clause to your
> join(), as the second argument of table1.join(table2,
> and_(table1.c.foo==table2.c.bar, table1.c.bat==table2.c.hoho, ...)).

Ah, okay. No foreign key, just a join. For reference, this was what
I ended up going with.
j = jobsTable.join(usersTable, jobsTable.c.userId.like(usersTable.c.userId))

> as long as the sqlite table has only one column that is declared as
> primary key in the CREATE TABLE statement, sqlite can autoincrment.  a
> "userId" column declared as the PRIMARY KEY and a second "userName" column
> that only has a unique constraint on it does not impact SQLite's
> autoincrement capability.

Ah, I see. I think the only thing I don't understand still is using
this mapping with a session and handling inserts that aren't unique.
Using a session, the problem presents itself at the commit, which
seems to mess up the entire insertion instead of the single collision.
Maybe it's more of an SQL question, but when using a session/mapper
configuration like this and one were continually trying to insert data
into the database that might have been added already, how do you
efficiently skip that insertion?

Michael Bayer

unread,
Oct 21, 2009, 6:11:00 PM10/21/09
to sqlal...@googlegroups.com
Josh Stratton wrote:
> Ah, I see. I think the only thing I don't understand still is using
> this mapping with a session and handling inserts that aren't unique.
> Using a session, the problem presents itself at the commit, which
> seems to mess up the entire insertion instead of the single collision.
> Maybe it's more of an SQL question, but when using a session/mapper
> configuration like this and one were continually trying to insert data
> into the database that might have been added already, how do you
> efficiently skip that insertion?

typically you have to work your code such that you've already loaded the
existing rows you'd care about. or you can do a load-each-row-at-a-time
type of thing. session.merge() can actually accomplish either result.

1. load everything first, then check for existing

my_stuff = session.query(MyStuff).filter(<criterion you care about>).all()

now everything is present in "my_stuff", as well as in the session. if
you then say:

obj = session.merge(MyStuff(id=12, username='jack'))

it will lookup MyStuff with that primary key (either "12" or "12, jack",
depending on what attributes comprise your primary key) in the session's
identity map. Since everything was already loaded, it shouldn't issue
any further database calls, and it will give you either the existing
"MyStuff" or return to you the new one you just made.

2. load-each-row-at-a-time

just remove the first query. the session will then look up things using
merge() as you ask.


There's other ways to do this on other databases, such as on PG you can
use savepoints to attempt the insert, fail, and then try again, or MySQL
provides a "REPLACE" construct (SQLA ORM doesn't support that directly);
but the above approach tends to be the most straightforward.

Josh Stratton

unread,
Oct 21, 2009, 7:03:01 PM10/21/09
to sqlal...@googlegroups.com
> 2. load-each-row-at-a-time
>
> just remove the first query.  the session will then look up things using
> merge() as you ask.

But wouldn't this still cause an insertion error when I try merging a
job object, who's userName is already in the table?

Michael Bayer

unread,
Oct 21, 2009, 8:34:17 PM10/21/09
to sqlal...@googlegroups.com
you'd need to establish the "primary key" from the mapper's point of
view in terms of both userId and userName. mapper() accepts a
"primary_key" argument for this purpose.

Alternatively you can just issue a query() for each object as you
insert, or like the "load everything first" load them all into a dict,
i.e. dict(((e.userId, e.userName), e) for e in sess.query(MyObject).all
()). Basically mimic the same functionality adding userName as part
of the key.


>
> >

Josh Stratton

unread,
Oct 22, 2009, 12:51:28 PM10/22/09
to sqlal...@googlegroups.com
> you'd need to establish the "primary key" from the mapper's point of
> view in terms of both userId and userName.   mapper() accepts a
> "primary_key" argument for this purpose.

That kind of surprises me sqlalchemy isn't aware of what's a primary
key and what isn't. Looking at the docs, it states mapper accepts a
primary_key argument as a list of Column objects. I assume that's the
objects stored in the 'c' attribute.

mapper(Job, joinOfJobsAndUsers,
properties={
'jobId' : jobTable.c.farmJobId, # part
of the primary key of jobs table
'try' : jobTable.c.tryIndex, # part of
the primary key of the jobs table
'userId' : [jobsTable.c.userId,
usersTable.c.userId], # the 2nd is the primary key of usersTable
'user' : usersTable.c.userName
},
primary_key=[usersTable.c.userId, usersTable.c.userName])

Is this what you mean? I've tried this and
primary_key=[usersTable.c.userId] and all the attributes contributing
to the primary key and still see IntegrityErrors when inserting into
the users table.

<class 'sqlalchemy.exceptions.IntegrityError'>: (IntegrityError)
column userName is not unique u'INSERT INTO users ("userName") VALUES
(?)' ['totsuka']

According to the 0.5.6 docs, which I assume is pretty close to .4,

primary_key
A list of Column objects which define the primary key to be used
against this mapper’s selectable unit. This is normally simply the
primary key of the local_table, but can be overridden here.

Looking at this, I'm not sure why userName has to be included at all.

Michael Bayer

unread,
Oct 22, 2009, 1:24:43 PM10/22/09
to sqlal...@googlegroups.com
Josh Stratton wrote:
>
> I've tried this and
> primary_key=[usersTable.c.userId] and all the attributes contributing
> to the primary key and still see IntegrityErrors when inserting into
> the users table.
>
> <class 'sqlalchemy.exceptions.IntegrityError'>: (IntegrityError)
> column userName is not unique u'INSERT INTO users ("userName") VALUES
> (?)' ['totsuka']

OK, you have the unique constraint on userName alone. that means that you
*cannot* have two rows like this:

userId userName
1 someuser
2 someuser

because unlike a primary key that consists of "userId" and "userName", the
distinctness here of "userName" is independent of that of userId.

Assuming this is what you want, please issue a query() beforehand for each
object you wish to create, asking for any existing object which has the
given username. If it already exists, populate that object with your new
data. Otherwise create a new object and add() it to your session.


Josh Stratton

unread,
Oct 22, 2009, 4:55:54 PM10/22/09
to sqlal...@googlegroups.com
> OK, you have the unique constraint on userName alone.  that means that you
> *cannot* have two rows like this:
>
> userId      userName
> 1           someuser
> 2           someuser

Right.

> because unlike a primary key that consists of "userId" and "userName", the
> distinctness here of "userName" is independent of that of userId.
>
> Assuming this is what you want, please issue a query() beforehand for each
> object you wish to create, asking for any existing object which has the
> given username.   If it already exists, populate that object with your new
> data.  Otherwise create a new object and add() it to your session.

Maybe my setup isn't correct, but if I do something like this:

session.add(myObj)
session.flush()

# try adding to the database here
# check if it's in there
matches = [obj for obj in session.query(MyObj)]

And matches is an empty list, can I assume something is seriously
wrong? Shouldn't this work even without the flush since the session
keeps track of it?

Michael Bayer

unread,
Oct 22, 2009, 5:45:03 PM10/22/09
to sqlal...@googlegroups.com
Josh Stratton wrote:
>
> Maybe my setup isn't correct, but if I do something like this:
>
> session.add(myObj)
> session.flush()
>
> # try adding to the database here
> # check if it's in there
> matches = [obj for obj in session.query(MyObj)]
>
> And matches is an empty list, can I assume something is seriously
> wrong? Shouldn't this work even without the flush since the session
> keeps track of it?

if the flush succeeds and that's a MyObj, then yes it should be coming
back in the MyObj.

There's a deprecated setting on mapper() called allow_null_pks=True.
Since you're mapping to a join, if any of the values within the composite
primary key of a particular row are NULL you'll want to set this to True.
Its on by default in future versions of SQLA. not sure if this applies to
your specific situation, though.



>
> >
>

Reply all
Reply to author
Forward
0 new messages