Multiple databases or multiple schemas?

Henry Precheur

unread,

Jun 13, 2010, 4:27:34 PM6/13/10

to sqlal...@googlegroups.com

I'm starting a new project which is roughly hosting a bunch of sites. I
want the sites to be isolated, they'll all have the same schema (data
definition), but will store their data on different 'name-spaces'.

The front end will be a python web app. And I'd like to have this
application talk to all the 'name-spaces' at the same time:

With a request on example.com/site1, the web app will use the
'name-space' site1, with example.com/site2 it will use site2.

I am using Postgres. There are 2 options for the 'name-space': Database
or Schema [1]:
1. One database per site
2. One database for all the sites and 1 schema per site

Solution #1 would require to maintain 1 connection per site & per python
process. That means: lots of connections, & lots of memory needed. One
the other hand, this solution is supported by SQLAlchemy out-of-the-box.
I'll have a dictionary like that:

{'site1': Engine('postgres://.../site1',
'site2': Engine('postgres://.../site2', ...}

And whenever a request comes in I get the right engine via this
dictionary.

Solution #2 is not supported natively by SQLAlchemy. Each time a request
comes-in I'll have to issue an additional query "SET search_path TO
MY_SITE" where MY_SITE is the schema associated with the site.

Solution #2 seems much more lightweight to me. The only problem is the
small overhead that might be created by the additional query.

What do you guys think? Will I get into trouble with solution #2?

If you have alternative suggestions I'd like to hear them :)

Regards,

[1]: http://www.postgresql.org/docs/8.4/interactive/ddl-schemas.html

--
Henry Pr�cheur

Faheem Mitha

unread,

Jun 13, 2010, 5:23:22 PM6/13/10

to sqlal...@googlegroups.com

On Sun, 13 Jun 2010 13:27:34 -0700, Henry Precheur <he...@precheur.org> wrote:
> I'm starting a new project which is roughly hosting a bunch of sites. I
> want the sites to be isolated, they'll all have the same schema (data
> definition), but will store their data on different 'name-spaces'.
>
> The front end will be a python web app. And I'd like to have this
> application talk to all the 'name-spaces' at the same time:
>
> With a request on example.com/site1, the web app will use the
> 'name-space' site1, with example.com/site2 it will use site2.
>
> I am using Postgres. There are 2 options for the 'name-space': Database
> or Schema [1]:
> 1. One database per site
> 2. One database for all the sites and 1 schema per site
>
>
> Solution #1 would require to maintain 1 connection per site & per python
> process. That means: lots of connections, & lots of memory needed. One
> the other hand, this solution is supported by SQLAlchemy out-of-the-box.
> I'll have a dictionary like that:
>
> {'site1': Engine('postgres://.../site1',
> 'site2': Engine('postgres://.../site2', ...}
>
> And whenever a request comes in I get the right engine via this
> dictionary.
>
>
> Solution #2 is not supported natively by SQLAlchemy. Each time a request
> comes-in I'll have to issue an additional query "SET search_path TO
> MY_SITE" where MY_SITE is the schema associated with the site.

Sqlalchemy's table can take the qschema as argument, eg.

pheno_table = Table(
'pheno', metadata,
Column('patientid', String(60), primary_key=True),
Column('famid', String(60), nullable=True),
Column('sex_id', None, ForeignKey(schemaname+'.sex.val', onupdate='CASCADE', ondelete='CASCADE'), index=True),
Column('race_id', None, ForeignKey(schemaname+'.race.val', onupdate='CASCADE', ondelete='CASCADE'), index=True),
Column('phenotype', SmallInteger),
schema = schemaname,
)

So I don't think you do have to do that.

> Solution #2 seems much more lightweight to me. The only problem is the
> small overhead that might be created by the additional query.

I'm actually using multiple schemas in one db myself, and it seems to
me sqla supports this just fine. The only time I have to do

SET search_path TO MY_SITE

is when I access the db directly using psql. Of course, you might have
to worry whether the web end of things support schemas too.

Faheem.

> What do you guys think? Will I get into trouble with solution #2?
>
> If you have alternative suggestions I'd like to hear them :)
>
>
> Regards,
>
> [1]: http://www.postgresql.org/docs/8.4/interactive/ddl-schemas.html
>
> --

> Henry Prêcheur
>

Michael Bayer

unread,

Jun 14, 2010, 1:06:04 PM6/14/10

to sqlal...@googlegroups.com

you can do the multi-schema thing using either a PoolListener, or a Table subclass that implements a @property for the "schema" attribute - in both cases these would be linked to a thread local variable that you would set per request. If the load per site is low, you can also forego the usage of pooling to eliminate the issue of dormant connections (possibly look into something like pgpool, perhaps they have some feature that optimizes this use case).

As far as approach, consider the load each per-user site will have, and how you might scale out to include multiple database servers. I'd grep around the PG docs and wiki carefully to get a full grasp of the implications of each approach.

Henry Precheur

unread,

Jun 14, 2010, 2:29:40 PM6/14/10

to sqlal...@googlegroups.com

On Mon, Jun 14, 2010 at 01:06:04PM -0400, Michael Bayer wrote:
> you can do the multi-schema thing using either a PoolListener, or a
> Table subclass that implements a @property for the "schema" attribute
> - in both cases these would be linked to a thread local variable that
> you would set per request. If the load per site is low, you can also
> forego the usage of pooling to eliminate the issue of dormant
> connections (possibly look into something like pgpool, perhaps they
> have some feature that optimizes this use case).

Both approach would work. The PoolListener looks a bit better, the Table
subclass looks a bit hackish. BTW I am not planning to use threads, I'll
just use a dozen python processes to make sure the site is responsive
under heavy load. This should simplify pooling quite a bit, I'll just
use a StaticPool.

> As far as approach, consider the load each per-user site will have,
> and how you might scale out to include multiple database servers.
> I'd grep around the PG docs and wiki carefully to get a full grasp of
> the implications of each approach.

Yes, scaling was one of the motivation for this approach, I definitely
need to learn more about how PG would handle this.

Simplifying the architecture was the main motivation. I don't want to
have a column `site_id` in every table. Also I think that putting
everything in its own schema would improve performance dramatically:
instead of big indexes for all sites, there'll be small indexes for each
site.

Thank-you so much, your reply helped a great deal.

--
Henry Pr�cheur

Henry Precheur

unread,

Jun 14, 2010, 2:33:46 PM6/14/10

to sqlal...@googlegroups.com

On Sun, Jun 13, 2010 at 09:23:22PM +0000, Faheem Mitha wrote:
> Sqlalchemy's table can take the qschema as argument, eg.
>
> pheno_table = Table(
> 'pheno', metadata,
> Column('patientid', String(60), primary_key=True),
> Column('famid', String(60), nullable=True),
> Column('sex_id', None, ForeignKey(schemaname+'.sex.val', onupdate='CASCADE', ondelete='CASCADE'), index=True),
> Column('race_id', None, ForeignKey(schemaname+'.race.val', onupdate='CASCADE', ondelete='CASCADE'), index=True),
> Column('phenotype', SmallInteger),
> schema = schemaname,
> )
>
> So I don't think you do have to do that.

The thing is that each table will be present in every schema. So I can't
use the schema parameter (or I can subclass Table like Mike suggested).