Difficulties with parallel data insertion into the same table

3,769 views
Skip to first unread message

SF Markus Elfring

unread,
Jan 22, 2015, 1:30:51 PM1/22/15
to sqlal...@googlegroups.com
Hello,

I try to write some data from a source code analysis
which is performed by a few processor cores simultaneously
into a single table.
Now I stumble on a message like the following again.

"…
sqlalchemy.exc.IntegrityError: (IntegrityError) duplicate key value violates unique constraint "pg_type_typname_nsp_index"
DETAIL: Key (typname, typnamespace)=(positions_parallel1_line_seq, 2200) already exists.
'\nCREATE TABLE positions_parallel1 (\n\tfunction […], pattern)\n)\n\n' {}
…"


The following software components were involved for this
application test on my openSUSE Tumbleweed system.

1. SQLAlchemy 0.9.8-78.1
2. Psycopg 2.5.2-3.1
3. PostgreSQL 9.3.5-3.1
4. Python 2.7.9-2.1
5. Coccinelle spatch 1.0.0-rc23
6. make 4.1-2.2


I have searched a bit on the internet. Are there any further
software development challenges to consider for the parallel
creation of database tables with the object-relational interface?

Regards,
Markus

Michael Bayer

unread,
Jan 22, 2015, 2:00:58 PM1/22/15
to sqlal...@googlegroups.com


SF Markus Elfring <elf...@users.sourceforge.net> wrote:

> Hello,
>
> I try to write some data from a source code analysis
> which is performed by a few processor cores simultaneously
> into a single table.
> Now I stumble on a message like the following again.
>
> "…
> sqlalchemy.exc.IntegrityError: (IntegrityError) duplicate key value violates unique constraint "pg_type_typname_nsp_index"
> DETAIL: Key (typname, typnamespace)=(positions_parallel1_line_seq, 2200) already exists.
> '\nCREATE TABLE positions_parallel1 (\n\tfunction […], pattern)\n)\n\n' {}
> …"
>
>
> The following software components were involved for this
> application test on my openSUSE Tumbleweed system.
>
> 1. SQLAlchemy 0.9.8-78.1
> 2. Psycopg 2.5.2-3.1
> 3. PostgreSQL 9.3.5-3.1
> 4. Python 2.7.9-2.1
> 5. Coccinelle spatch 1.0.0-rc23
> 6. make 4.1-2.2
>
>
> I have searched a bit on the internet.

Normally, this is an integrity constraint error which is raised by your database, which occurs when a process attempts to INSERT or UPDATE a row to include a key that already exists. When performing an INSERT or UPDATE you need to ensure that the row will not duplicate a uniquely-constraint value that is already present in the table.

However in this case, this appears to be a side effect of attempting to issue CREATE TABLE statements in parallel, and in particular attempting to create a table of the same name twice in two different processes. PG developers describe this expected behavior here: http://www.postgresql.org/message-id/CA+TgmoZAdYVtwBfp1FL2sMZb...@mail.gmail.com. I’m not surprised that PG does not prioritize attempting to make this error more user friendly as this is a very odd and probably unnecessary use case.


> Are there any further
> software development challenges to consider for the parallel
> creation of database tables
> with the object-relational interface?

this has nothing to do with the ORM nor does it really have much to do with SQLAlchemy as a whole. SQLAlchemy can be seen as a comprehensive system to automate the sending and receiving of messages along a database connection. You would have this issue in exactly the same way if you were emitting conflicting CREATE TABLE statements on the DBAPI cursor directly.

The concurrent creation of tables is an extremely unusual use case that I would not recommend, but especially when two different threads/processes are contending to create the same table, that suggests the system attempts to encode data within the database schema itself; that is, the existence of a table “XYZ” in fact represents the data “XYZ” being present. There are some high volume websites that do things like this, like Reddit, but I think they’re nuts. But certainly, for an application that isn’t handling millions of simultaneous users or hundreds of billions of rows, such an architecture is absolutely not called for. Data should be stored in rows, not in the names of tables.

Otherwise, the PG developers in that thread suggest this use case can be facilitated using postgres advisory locks. I have no experience with those, but see that thread for more details.










>
> Regards,
> Markus
>
> --
> You received this message because you are subscribed to the Google Groups "sqlalchemy" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to sqlalchemy+...@googlegroups.com.
> To post to this group, send email to sqlal...@googlegroups.com.
> Visit this group at http://groups.google.com/group/sqlalchemy.
> For more options, visit https://groups.google.com/d/optout.

SF Markus Elfring

unread,
Jan 22, 2015, 2:40:19 PM1/22/15
to Michael Bayer, sqlal...@googlegroups.com
> However in this case, this appears to be a side effect of attempting
> to issue CREATE TABLE statements in parallel, and in particular attempting
> to create a table of the same name twice in two different processes.

It seems that I dared this approach.


> PG developers describe this expected behavior here:
> http://www.postgresql.org/message-id/CA+TgmoZAdYVtwBfp1FL2sMZb...@mail.gmail.com

Does the dialogue on a topic like "Errors on CREATE TABLE IF NOT EXISTS"
show any remaining open issues?


> I’m not surprised that PG does not prioritize attempting to make
> this error more user friendly as this is a very odd and probably
> unnecessary use case.

I am surprised that this database software show such (unexpected) behaviour.


> this has nothing to do with the ORM nor does it really have much
> to do with SQLAlchemy as a whole.

It makes it more convenient to stress other software components,
doesn't it?


> You would have this issue in exactly the same way if you were emitting
> conflicting CREATE TABLE statements on the DBAPI cursor directly.

Thanks for your acknowledgement.


> The concurrent creation of tables is an extremely unusual use case
> that I would not recommend, but especially when two different
> threads/processes are contending to create the same table,

Should this use case become more known after the number of processor
cores grew in various computers through the years?


> that suggests the system attempts to encode data within the database
> schema itself; that is, the existence of a table “XYZ” in fact
> represents the data “XYZ” being present.

I do not try to achieve such a data encoding for my software application
at the moment.


> Otherwise, the PG developers in that thread suggest this use case
> can be facilitated using postgres advisory locks.

I am going to try another approach out.

* Will it be more useful here to extend a serial database preparation step
before records will be stored by background processes simultaneously?

* How often will corresponding implementation details need to be redesigned?

Regards,
Markus

Michael Bayer

unread,
Jan 22, 2015, 3:57:30 PM1/22/15
to SF Markus Elfring, sqlal...@googlegroups.com


SF Markus Elfring <elf...@users.sourceforge.net> wrote:

>
>> The concurrent creation of tables is an extremely unusual use case
>> that I would not recommend, but especially when two different
>> threads/processes are contending to create the same table,
>
> Should this use case become more known after the number of processor
> cores grew in various computers through the years?

not at all.

The schema of a relational database is fixed in a similar way as to the structure of the software itself. Just like it’s not a great idea to dynamically produce and compile new code within an application in order to solve an issue, it’s not a great idea to produce new schema structures on the fly. Decent relational database design will typically have as many as a few dozen tables in order to handle a particular component of a reasonably-designed application; databases that have hundreds of tables are typically dealing with multiple sub-components of an application. A schema that has “N” tables with no upper bound doesn’t actually have any schema design, the app is just generating tables in response to data input. So there is no valid use case for parallel generation of tables except in particular kinds of multi-tenancy situations.

Jonathan Vanasco

unread,
Jan 22, 2015, 4:08:27 PM1/22/15
to sqlal...@googlegroups.com, elf...@users.sourceforge.net
On Thursday, January 22, 2015 at 2:40:19 PM UTC-5, SF Markus Elfring wrote:
I am surprised that this database software show such (unexpected) behaviour.
 
Aside from what Michael wrote...

The reason why you're seeing an IntegrityError like that, is because Postgres is raising an integrity error on it's internal tables.  The `pg_` prefix is well known to relate to internal postgres system data.  PostgreSQL explicitly reserves the `pg_` prefix for it's own schemas and advises users against creating tables that begin with `pg_` as well (http://www.postgresql.org/docs/9.2/static/ddl-schemas.html).  So to those with experience in Postgres, this wouldn't be unexpected error and accurately explains what the underlying problem is.

SF Markus Elfring

unread,
Jan 22, 2015, 5:32:32 PM1/22/15
to Michael Bayer, sqlal...@googlegroups.com

> So there is no valid use case for parallel generation of tables
> except in particular kinds of multi-tenancy situations.

I find my use case simple enough (and therefore very valid).
I am going to manage two database tables by corresponding
Python classes with SQLAlchemy services.

I would appreciate if I can fill these data structures in parallel
without a serial database preparation step (table creation with
repeated meta-data specification).

Regards,
Markus

Michael Bayer

unread,
Jan 22, 2015, 5:55:34 PM1/22/15
to sqlal...@googlegroups.com


SF Markus Elfring <elf...@users.sourceforge.net> wrote:

>
You’d need to implement checks for this concurrency. A create table would need to be preceded by a check to see that it exists, for example, and that would likely need to be mutexed on that name so that no race condition occurs in between the time that the name is checked vs. the create table is emitted. if the mutex is on the database end, then you’d need to use Postgresql’s facilities; pg_advisory_lock() appears to be session level (http://www.postgresql.org/docs/9.1/static/functions-admin.html) , which is probably not sufficient if you are using multiple processes.



SF Markus Elfring

unread,
Jan 23, 2015, 2:21:09 AM1/23/15
to sqlal...@googlegroups.com
>> I would appreciate if I can fill these data structures in parallel
>> without a serial database preparation step (table creation with
>> repeated meta-data specification).
>
> You’d need to implement checks for this concurrency.

I would expect that database implementations will provide functionality
for parallel updates including concurrent creation of each table.


> A create table would need to be preceded by a check to see that it exists,
> for example, and that would likely need to be mutexed on that name so that
> no race condition occurs in between the time that the name is checked
> vs. the create table is emitted.

I am curious to clarify more software development challenges.
How many open issues are already known around such implementation details?

Regards,
Markus

Simon King

unread,
Jan 23, 2015, 5:01:27 AM1/23/15
to sqlal...@googlegroups.com
On Fri, Jan 23, 2015 at 7:20 AM, SF Markus Elfring
<elf...@users.sourceforge.net> wrote:
>>> I would appreciate if I can fill these data structures in parallel
>>> without a serial database preparation step (table creation with
>>> repeated meta-data specification).
>>
>> You’d need to implement checks for this concurrency.
>
> I would expect that database implementations will provide functionality
> for parallel updates including concurrent creation of each table.
>

What would you expect a database to do if it receives 2 "CREATE TABLE
my_table(...)" instructions simultaneously? What if the table
definitions are different? One of the instructions would *have* to
fail.

SF Markus Elfring

unread,
Jan 23, 2015, 6:45:46 AM1/23/15
to Simon King, sqlal...@googlegroups.com
>> I would expect that database implementations will provide functionality
>> for parallel updates including concurrent creation of each table.
>
> What would you expect a database to do if it receives 2 "CREATE TABLE
> my_table(...)" instructions simultaneously?

This depends on the passed parameters.


> What if the table definitions are different?

I am going to pass the same settings for the application I am developing
at the moment.


> One of the instructions would *have* to fail.

Not in every case.

The parameter "IF NOT EXISTS" could be passed to the SQL statement "CREATE TABLE".

* Is this setting already used by the class library "SQLAlchemy 0.9.8-78.1"?

* Does it really work in the current praxis?

Regards,
Markus

Simon King

unread,
Jan 23, 2015, 7:09:05 AM1/23/15
to SF Markus Elfring, sqlal...@googlegroups.com
On Fri, Jan 23, 2015 at 11:45 AM, SF Markus Elfring
<elf...@users.sourceforge.net> wrote:
>>> I would expect that database implementations will provide functionality
>>> for parallel updates including concurrent creation of each table.
>>
>> What would you expect a database to do if it receives 2 "CREATE TABLE
>> my_table(...)" instructions simultaneously?
>
> This depends on the passed parameters.
>

That makes the rules complicated. At the moment, I imagine the rules
are very simple - a "CREATE TABLE" statement will fail if the table
already exists, unless "IF NOT EXISTS" is passed. Changing the rules
to allow an identical CREATE TABLE statement to silently do nothing
sounds like a lot of work for very little benefit to me.

>
>> What if the table definitions are different?
>
> I am going to pass the same settings for the application I am developing
> at the moment.

(I was talking about the general case, not your specific case)

>
>
>> One of the instructions would *have* to fail.
>
> Not in every case.
>
> The parameter "IF NOT EXISTS" could be passed to the SQL statement "CREATE TABLE".
>
> * Is this setting already used by the class library "SQLAlchemy 0.9.8-78.1"?

From a quick scan of the docs it appears not.

>
> * Does it really work in the current praxis?
>

Are you asking if "IF NOT EXISTS" will work in a high-concurrency
context? I've no idea, but this is a very PostgreSQL-specific
question, so you'd probably get a better answer from them.

Simon

SF Markus Elfring

unread,
Jan 23, 2015, 7:17:55 AM1/23/15
to sqlal...@googlegroups.com, Simon King
>> The parameter "IF NOT EXISTS" could be passed to the SQL statement "CREATE TABLE".
>>
>> * Is this setting already used by the class library "SQLAlchemy 0.9.8-78.1"?
>
> From a quick scan of the docs it appears not.

How can parameter additions be achieved for this software?


> Are you asking if "IF NOT EXISTS" will work in a high-concurrency context?

Yes.

How many database implementations support the simultaneous table creation
according to the rules from transaction management?

Regards,
Markus

Simon King

unread,
Jan 23, 2015, 9:09:58 AM1/23/15
to SF Markus Elfring, sqlal...@googlegroups.com
On Fri, Jan 23, 2015 at 12:17 PM, SF Markus Elfring
<elf...@users.sourceforge.net> wrote:
>>> The parameter "IF NOT EXISTS" could be passed to the SQL statement "CREATE TABLE".
>>>
>>> * Is this setting already used by the class library "SQLAlchemy 0.9.8-78.1"?
>>
>> From a quick scan of the docs it appears not.
>
> How can parameter additions be achieved for this software?
>

The general mechanism for generating SQL that isn't directly supported
by SQLAlchemy is to use the "compiler" module:

http://docs.sqlalchemy.org/en/rel_0_9/core/compiler.html

I guess it's the CreateTable object you'd need to adapt:

http://docs.sqlalchemy.org/en/rel_0_9/core/ddl.html

Although this sounds to me like a reasonable flag to add to core (it's
supported by multiple database implementations), so perhaps if you
added support for it directly into SQLAlchemy, Mike might accept that
patch.

>
>> Are you asking if "IF NOT EXISTS" will work in a high-concurrency context?
>
> Yes.
>
> How many database implementations support the simultaneous table creation
> according to the rules from transaction management?
>

No idea. You'd need to ask the makers of all the databases you are
interested in.

Simon

Jonathan Vanasco

unread,
Jan 23, 2015, 11:14:28 AM1/23/15
to sqlal...@googlegroups.com, elf...@users.sourceforge.net
Using "IF NOT EXISTS" would not solve this problem in a high concurrency scenario.  

There would still be a race condition within the Postgres internal functions.  This is because of how Postgres checks for existing tables and creates new ones with its internal bookkeeping.  It's explained in the link that Mike shared above.

Have you tried using savepoints?  SqlAlchemy handles them within the "nested transaction" concept.  I'm not sure how Postgres would handle race conditions of table creation within a savepoint, but it might get you past this issue.

SF Markus Elfring

unread,
Jan 23, 2015, 11:30:33 AM1/23/15
to Jonathan Vanasco, sqlal...@googlegroups.com
> Using "IF NOT EXISTS" would not solve this problem
> in a high concurrency scenario.

Thanks for your feedback.


> There would still be a race condition within the
> Postgres internal functions.

Are there any chances that this database software
implementation will become robust and safe against
the discussed race condition?


> Have you tried using savepoints?

I am going to use a serial database preparation step
instead for my application so that the corresponding
tables will be explicitly deleted and created by
a small SQL script.

Regards,
Markus

Michael Bayer

unread,
Jan 23, 2015, 12:18:04 PM1/23/15
to sqlal...@googlegroups.com


SF Markus Elfring <elf...@users.sourceforge.net> wrote:

>
>
>> There would still be a race condition within the
>> Postgres internal functions.
>
> Are there any chances that this database software
> implementation will become robust and safe against
> the discussed race condition?


I wonder why you’re asking of the SQLAlchemy list about a specific
developmental goal of the Postgresql project? Wouldn’t you ask them about
this?

Jonathan Vanasco

unread,
Jan 23, 2015, 12:26:17 PM1/23/15
to sqlal...@googlegroups.com, jona...@findmeon.com, elf...@users.sourceforge.net

On Friday, January 23, 2015 at 11:30:33 AM UTC-5, SF Markus Elfring wrote:
Are there any chances that this database software
implementation will become robust and safe against
the discussed race condition?

I would not count on this happening in the near future as it doesn't seem to be slated as a bug to be worked on, but you can ask on one of the postgres lists (http://www.postgresql.org/list/). 

As others noted, dynamic table creation within an application's runtime isn't really a standard approach -- it usually only happens in specific situations (multi-tenancy platforms, admin interfaces, automated plugin installations [ie, wordpress, drupal, etc]).   Having multiple processes attempt to dynamically create tables in a high-concurrency situation is even less standard.  The status-quo is to have some sort of setup script, or a setup routine that happens before forking threads (or whatever).  If the table structure/name is known and expected to be used -- there's not really a good reason to defer creating it .  

SF Markus Elfring

unread,
Jan 23, 2015, 12:28:57 PM1/23/15
to sqlal...@googlegroups.com
> I wonder why you’re asking of the SQLAlchemy list about
> a specific developmental goal of the Postgresql project?

I hoped that some more corresponding experiences could
already be shared here.


> Wouldn’t you ask them about this?

That might follow ...


How should I add the parameter "IF NOT EXISTS" to Python
classes in the meantime eventually?

Regards,
Markus

SF Markus Elfring

unread,
Jan 23, 2015, 12:39:02 PM1/23/15
to Jonathan Vanasco, sqlal...@googlegroups.com
> If the table structure/name is known and expected to be used
> -- there's not really a good reason to defer creating it .

Is the reason good enough to avoid the repeated specification
of corresponding meta-data?
Is it safer to maintain and manage column attributes for some
tables only at a single place?

Regards,
Markus

Jonathan Vanasco

unread,
Jan 23, 2015, 12:42:49 PM1/23/15
to sqlal...@googlegroups.com, jona...@findmeon.com, elf...@users.sourceforge.net


On Friday, January 23, 2015 at 12:39:02 PM UTC-5, SF Markus Elfring wrote:
Is the reason good enough to avoid the repeated specification
of corresponding meta-data?  
Is it safer to maintain and manage column attributes for some
tables only at a single place?

 How are you currently specifying the meta-data?  Most SqlAlchemy implementations will only specify this once in the application.

SF Markus Elfring

unread,
Jan 23, 2015, 12:48:46 PM1/23/15
to Jonathan Vanasco, sqlal...@googlegroups.com
> How are you currently specifying the meta-data?

Should the Python class be sufficient for the definition
of a table structure?

Will the mapping interface work also without tables
that were created by other SQL scripts before?

Regards,
Markus

Jonathan Vanasco

unread,
Jan 23, 2015, 1:00:34 PM1/23/15
to sqlal...@googlegroups.com, elf...@users.sourceforge.net


On Friday, January 23, 2015 at 12:48:46 PM UTC-5, SF Markus Elfring wrote:
Should the Python class be sufficient for the definition
of a table structure?

If you're using the declarative syntax, yes.   It's common to have a `models.py` file that simply defines the classes in one place; then that is imported and metadata associated to the engine.  

 
Will the mapping interface work also without tables
that were created by other SQL scripts before?

SqlAlchemy is one of the very few ORMs that does not impose any design requirements.  Classes/Tables can be easily mapped to any legacy scenario.  The only issue that arises is if there are complex relationships/joins -- then special mappers may be needed.  There is an extended section in the docs for that, but it is a rare and advanced topic.

There are also 3rd party packages like sqlacodegen (https://pypi.python.org/pypi/sqlacodegen) that can query a legacy database for you and generate the corresponding SqlAlchemy python classes.

SF Markus Elfring

unread,
Jan 23, 2015, 1:10:37 PM1/23/15
to sqlal...@googlegroups.com
> Should the Python class be sufficient for the definition
> of a table structure?
>
> If you're using the declarative syntax, yes.

Thanks for your acknowledgement.


> It's common to have a `models.py` file that simply defines
> the classes in one place; then that is imported and metadata
> associated to the engine.

How often do you need to fill these data structures in
a concurrent way?

Does parallel table creation become more interesting then?

Regards,
Markus

Jonathan Vanasco

unread,
Jan 23, 2015, 1:18:59 PM1/23/15
to sqlal...@googlegroups.com, elf...@users.sourceforge.net

On Friday, January 23, 2015 at 1:10:37 PM UTC-5, SF Markus Elfring wrote:
How often do you need to fill these data structures in
a concurrent way? 

Does parallel table creation become more interesting then?


Often and Not at All.
Reply all
Reply to author
Forward
0 new messages