Performance questions on semi-large result sets

148 views
Skip to first unread message

Uri Okrent

unread,
Oct 21, 2015, 4:52:40 PM10/21/15
to sqlalchemy
Hello, I'm trying to understand the various performance implications of working with the ORM layer, vs the SQL language layer, vs a raw dbapi connection.  Ideally, I would like to stick with the ORM layer

First, the set up:
CentOS 6
SQLAlchemy 0.9.4
Psycopg2 2.5.3
Postgresql 9.3.9

My test is a simple query including a join on three columns, guid, name, and index -- guid and name are Text and index is an Integer.  Note, in all cases, I'm querying for individual columns, not ORM objects.

connection setup:
>>> engine = sqlalchemy.engine.create_engine("postgresql:///xmsdb", echo=True)
>>> Session = sqlalchemy.orm.sessionmaker(bind=engine)
>>> session=Session()


the query:

>>> q = session.query(Volume.guid, Volume.name, Volume.index)


First, query using the ORM layer:

>>> with profiled():
...   r = q.all()
...
2015-10-21 20:24:20,539 INFO sqlalchemy.engine.base.Engine SELECT volumes.guid AS volumes_guid, mom_objects.name AS mom_objects_name, mom_objects.index AS mom_objects_index
FROM mom_objects JOIN volumes ON mom_objects
.guid = volumes.guid
2015-10-21 20:24:20,539 INFO sqlalchemy.engine.base.Engine {}

         
147901 function calls (147890 primitive calls) in 0.396 CPU seconds

   
Ordered by: cumulative time

   ncalls  tottime  percall  cumtime  percall filename
:lineno(function)

       
1    0.007    0.007    0.396    0.396 /usr/lib64/python2.6/site-packages/sqlalchemy/orm/query.py:2286(all)
     
8191    0.055    0.000    0.357    0.000 /usr/lib64/python2.6/site-packages/sqlalchemy/orm/loading.py:26(instances)
   
24570    0.060    0.000    0.123    0.000 /usr/lib64/python2.6/site-packages/sqlalchemy/orm/query.py:3468(proc)
       
1    0.000    0.000    0.093    0.093 /usr/lib64/python2.6/site-packages/sqlalchemy/engine/result.py:778(fetchall)
       
1    0.000    0.000    0.089    0.089 /usr/lib64/python2.6/site-packages/sqlalchemy/engine/result.py:747(_fetchall_impl)
       
1    0.028    0.028    0.089    0.089 {method 'fetchall' of 'psycopg2._psycopg.cursor' objects}
     
8190    0.043    0.000    0.085    0.000 /usr/lib64/python2.6/site-packages/sqlalchemy/util/_collections.py:56(__new__)
   
24579    0.045    0.000    0.063    0.000 /usr/lib64/python2.6/site-packages/sqlalchemy/sql/annotation.py:85(__hash__)
   
16380    0.033    0.000    0.061    0.000 /usr/lib64/python2.6/encodings/utf_8.py:15(decode)
       
1    0.000    0.000    0.032    0.032 /usr/lib64/python2.6/site-packages/sqlalchemy/orm/query.py:2399(__iter__)
       
1    0.000    0.000    0.031    0.031 /usr/lib64/python2.6/site-packages/sqlalchemy/orm/query.py:2413(_execute_and_instances)
       
1    0.000    0.000    0.031    0.031 /usr/lib64/python2.6/site-packages/sqlalchemy/engine/base.py:652(execute)
       
1    0.000    0.000    0.031    0.031 /usr/lib64/python2.6/site-packages/sqlalchemy/sql/elements.py:316(_execute_on_connection)
       
1    0.000    0.000    0.031    0.031 /usr/lib64/python2.6/site-packages/sqlalchemy/engine/base.py:781(_execute_clauseelement)
       
1    0.000    0.000    0.030    0.030 /usr/lib64/python2.6/site-packages/sqlalchemy/engine/base.py:868(_execute_context)
       
1    0.000    0.000    0.030    0.030 /usr/lib64/python2.6/site-packages/sqlalchemy/engine/default.py:434(do_execute)
       
1    0.030    0.030    0.030    0.030 {method 'execute' of 'psycopg2._psycopg.cursor' objects}
   
16380    0.028    0.000    0.028    0.000 {_codecs.utf_8_decode}
   
24579    0.018    0.000    0.018    0.000 {hash}
     
8195    0.016    0.000    0.016    0.000 {method 'update' of 'dict' objects}
     
8191    0.015    0.000    0.015    0.000 {zip}



The same query executed directly on the engine:

>>> with profiled():
...   r = engine.execute(str(q)).fetchall()
...
2015-10-21 20:29:32,336 INFO sqlalchemy.engine.base.Engine SELECT volumes.guid AS volumes_guid, mom_objects.name AS mom_objects_name, mom_objects.index AS mom_objects_index
FROM mom_objects JOIN volumes ON mom_objects
.guid = volumes.guid
2015-10-21 20:29:32,337 INFO sqlalchemy.engine.base.Engine {}

         
33314 function calls (33303 primitive calls) in 0.116 CPU seconds

   
Ordered by: cumulative time

   ncalls  tottime  percall  cumtime  percall filename
:lineno(function)
       
1    0.000    0.000    0.092    0.092 /usr/lib64/python2.6/site-packages/sqlalchemy/engine/result.py:778(fetchall)
       
1    0.000    0.000    0.087    0.087 /usr/lib64/python2.6/site-packages/sqlalchemy/engine/result.py:747(_fetchall_impl)
       
1    0.028    0.028    0.087    0.087 {method 'fetchall' of 'psycopg2._psycopg.cursor' objects}
   
16380    0.032    0.000    0.059    0.000 /usr/lib64/python2.6/encodings/utf_8.py:15(decode)
   
16380    0.027    0.000    0.027    0.000 {_codecs.utf_8_decode}
       
1    0.000    0.000    0.023    0.023 /usr/lib64/python2.6/site-packages/sqlalchemy/engine/base.py:1665(execute)
       
1    0.000    0.000    0.023    0.023 /usr/lib64/python2.6/site-packages/sqlalchemy/engine/base.py:652(execute)
       
1    0.000    0.000    0.023    0.023 /usr/lib64/python2.6/site-packages/sqlalchemy/engine/base.py:846(_execute_text)
       
1    0.000    0.000    0.023    0.023 /usr/lib64/python2.6/site-packages/sqlalchemy/engine/base.py:868(_execute_context)
       
1    0.000    0.000    0.022    0.022 /usr/lib64/python2.6/site-packages/sqlalchemy/engine/default.py:434(do_execute)
       
1    0.022    0.022    0.022    0.022 {method 'execute' of 'psycopg2._psycopg.cursor' objects}
       
1    0.004    0.004    0.004    0.004 /usr/lib64/python2.6/site-packages/sqlalchemy/engine/result.py:762(process_rows)
       
1    0.000    0.000    0.001    0.001 /usr/lib64/python2.6/site-packages/sqlalchemy/orm/query.py:2923(__str__)
       
1    0.000    0.000    0.001    0.001 /usr/lib64/python2.6/site-packages/sqlalchemy/sql/elements.py:476(__str__)
       
1    0.000    0.000    0.001    0.001 <string>:1(<lambda>)



So, question #1:

Given that in both cases I'm querying columns and essentially just getting rows as a result, why is going through the ORM layer almost 4x slower?



Next, just for sanity's sake I compared the query done on the raw dbapi connection as obtained via sqlalchemy's engine, and on a new psycopg2 connection (which should theoretically be exactly the same):

>>> conn = engine.connect()
>>> raw_conn = conn.connection
>>> curs = raw_conn.cursor()
>>> with profiled():
...   curs.execute(str(q))
...   r = curs.fetchall()
...
         
33071 function calls (33060 primitive calls) in 0.113 CPU seconds

   
Ordered by: cumulative time

   ncalls  tottime  percall  cumtime  percall filename
:lineno(function)
       
1    0.029    0.029    0.091    0.091 {method 'fetchall' of 'psycopg2._psycopg.cursor' objects}
   
16380    0.034    0.000    0.062    0.000 /usr/lib64/python2.6/encodings/utf_8.py:15(decode)
   
16380    0.028    0.000    0.028    0.000 {_codecs.utf_8_decode}
       
1    0.020    0.020    0.020    0.020 {method 'execute' of 'psycopg2._psycopg.cursor' objects}
       
1    0.000    0.000    0.002    0.002 /usr/lib64/python2.6/site-packages/sqlalchemy/orm/query.py:2923(__str__)
       
1    0.000    0.000    0.001    0.001 /usr/lib64/python2.6/site-packages/sqlalchemy/sql/elements.py:476(__str__)
       
1    0.000    0.000    0.001    0.001 <string>:1(<lambda>)


>>> import psycopg2
>>> conn = psycopg2.connect("dbname=xmsdb")
>>> curs = conn.cursor()
>>> with profiled():
...   curs.execute(str(q))
...   r = curs.fetchall()
...

         
311 function calls (300 primitive calls) in 0.029 CPU seconds

   
Ordered by: cumulative time

   ncalls  tottime  percall  cumtime  percall filename
:lineno(function)
       
1    0.024    0.024    0.024    0.024 {method 'execute' of 'psycopg2._psycopg.cursor' objects}
       
1    0.003    0.003    0.003    0.003 {method 'fetchall' of 'psycopg2._psycopg.cursor' objects}
       
1    0.000    0.000    0.001    0.001 /usr/lib64/python2.6/site-packages/sqlalchemy/orm/query.py:2923(__str__)
       
1    0.000    0.000    0.001    0.001 /usr/lib64/python2.6/site-packages/sqlalchemy/sql/elements.py:476(__str__)
       
1    0.000    0.000    0.001    0.001 <string>:1(<lambda>)


These should theoretically be exactly the same, but it looks like the sqlalchemy version is spending some time doing utf-8 decoding, which the psycopg2 connection doesn't.  It looks like using the default connection parameters, the psycopg2 connection just returns byte strings while the sqlalchemy connection decodes them too.


I was able to make the call via sqlalchemy match the psycopg2 call time-wise, by modifying the connection parameters:

>>> engine = sqlalchemy.engine.create_engine("postgresql:///xmsdb", echo=True, use_native_unicode=False, client_encoding='utf8')


That caused the query via sqlalchemy to return byte strings as well (and also reduced the query time to ~30ms).


So, question #2, given the performance impact of decoding all strings, what are the possible implications of simply leaving all strings as byte strings?  I do want my software to support the full utf-8 character set.


I'm also curious about the apparent difference in the default behavior (to decode vs. not decode).  The database itself does hold unicode strings -- for example i changed the name field of one of the items to u'volßßßß'.  The response via the two different connections was:

(u'0984e1b061b248b3b4705b198cadc1eb', u'vol\xdf\xdf\xdf\xdf', 487)

vs

('0984e1b061b248b3b4705b198cadc1eb', 'vol\xc3\x9f\xc3\x9f\xc3\x9f\xc3\x9f', 487)

I.e., one, a bytestring and the other a utf-8 encoded string.  The one major implication I get is that I can't compare the two without first putting them into a common encoding.  Is it worthwhile to defer decoding until I'd actually need to do a comparison or else just pass the bytestring on the client and let him deal with decoding it?



That's a lot to digest...  Thanks in advance for the help, and for a great library!


Uri

Mike Bayer

unread,
Oct 21, 2015, 5:43:22 PM10/21/15
to sqlal...@googlegroups.com


On 10/21/15 4:52 PM, Uri Okrent wrote:
> Hello, I'm trying to understand the various performance implications of
> working with the ORM layer, vs the SQL language layer, vs a raw dbapi
> connection. Ideally, I would like to stick with the ORM layer
>
> First, the set up:
> CentOS 6
> SQLAlchemy 0.9.4
> Psycopg2 2.5.3
> Postgresql 9.3.9
>
> My test is a simple query including a join on three columns, guid, name,
> and index -- guid and name are Text and index is an Integer. Note, in
> all cases, I'm querying for individual columns, not ORM objects.
>
> connection setup:
> |
>>>>engine =sqlalchemy.engine.create_engine("postgresql:///xmsdb",echo=True)
>>>>Session=sqlalchemy.orm.sessionmaker(bind=engine)
>>>>session=Session()
> |
>
>
> the query:
>
> |
>>>>q =session.query(Volume.guid,Volume.name,Volume.index)
> |
>
>
> First, query using the ORM layer:
>
> |
>>>>withprofiled():
> ... r =q.all()
> ...
> 2015-10-2120:24:20,539INFO sqlalchemy.engine.base.EngineSELECT
> volumes.guid AS volumes_guid,mom_objects.name AS
> mom_objects_name,mom_objects.index AS mom_objects_index
> FROM mom_objects JOIN volumes ON mom_objects.guid =volumes.guid
> 2015-10-2120:24:20,539INFO sqlalchemy.engine.base.Engine{}
>
> 147901functioncalls (147890primitive calls)in0.396CPU seconds
>
> Orderedby:cumulative time
>
> ncalls tottime percall cumtime percall filename:lineno(function)
>
> 1 0.007 0.007 0.396
> 0.396/usr/lib64/python2.6/site-packages/sqlalchemy/orm/query.py:2286(all)
> 8191 0.055 0.000 0.357
> 0.000/usr/lib64/python2.6/site-packages/sqlalchemy/orm/loading.py:26(instances)
> 24570 0.060 0.000 0.123
> 0.000/usr/lib64/python2.6/site-packages/sqlalchemy/orm/query.py:3468(proc)
> 1 0.000 0.000 0.093
> 0.093/usr/lib64/python2.6/site-packages/sqlalchemy/engine/result.py:778(fetchall)
> 1 0.000 0.000 0.089
> 0.089/usr/lib64/python2.6/site-packages/sqlalchemy/engine/result.py:747(_fetchall_impl)
> 1 0.028 0.028 0.089 0.089{method 'fetchall'of
> 'psycopg2._psycopg.cursor'objects}
> 8190 0.043 0.000 0.085
> 0.000/usr/lib64/python2.6/site-packages/sqlalchemy/util/_collections.py:56(__new__)
> 24579 0.045 0.000 0.063
> 0.000/usr/lib64/python2.6/site-packages/sqlalchemy/sql/annotation.py:85(__hash__)
> 16380 0.033 0.000 0.061
> 0.000/usr/lib64/python2.6/encodings/utf_8.py:15(decode)
> 1 0.000 0.000 0.032
> 0.032/usr/lib64/python2.6/site-packages/sqlalchemy/orm/query.py:2399(__iter__)
> 1 0.000 0.000 0.031
> 0.031/usr/lib64/python2.6/site-packages/sqlalchemy/orm/query.py:2413(_execute_and_instances)
> 1 0.000 0.000 0.031
> 0.031/usr/lib64/python2.6/site-packages/sqlalchemy/engine/base.py:652(execute)
> 1 0.000 0.000 0.031
> 0.031/usr/lib64/python2.6/site-packages/sqlalchemy/sql/elements.py:316(_execute_on_connection)
> 1 0.000 0.000 0.031
> 0.031/usr/lib64/python2.6/site-packages/sqlalchemy/engine/base.py:781(_execute_clauseelement)
> 1 0.000 0.000 0.030
> 0.030/usr/lib64/python2.6/site-packages/sqlalchemy/engine/base.py:868(_execute_context)
> 1 0.000 0.000 0.030
> 0.030/usr/lib64/python2.6/site-packages/sqlalchemy/engine/default.py:434(do_execute)
> 1 0.030 0.030 0.030 0.030{method 'execute'of
> 'psycopg2._psycopg.cursor'objects}
> 16380 0.028 0.000 0.028 0.000{_codecs.utf_8_decode}
> 24579 0.018 0.000 0.018 0.000{hash}
> 8195 0.016 0.000 0.016 0.000{method 'update'of 'dict'objects}
> 8191 0.015 0.000 0.015 0.000{zip}
> |
>
>
>
> The same query executed directly on the engine:
>
> |
>>>>withprofiled():
> ... r =engine.execute(str(q)).fetchall()
> ...
> 2015-10-2120:29:32,336INFO sqlalchemy.engine.base.EngineSELECT
> volumes.guid AS volumes_guid,mom_objects.name AS
> mom_objects_name,mom_objects.index AS mom_objects_index
> FROM mom_objects JOIN volumes ON mom_objects.guid =volumes.guid
> 2015-10-2120:29:32,337INFO sqlalchemy.engine.base.Engine{}
>
> 33314functioncalls (33303primitive calls)in0.116CPU seconds
>
> Orderedby:cumulative time
>
> ncalls tottime percall cumtime percall filename:lineno(function)
> 1 0.000 0.000 0.092
> 0.092/usr/lib64/python2.6/site-packages/sqlalchemy/engine/result.py:778(fetchall)
> 1 0.000 0.000 0.087
> 0.087/usr/lib64/python2.6/site-packages/sqlalchemy/engine/result.py:747(_fetchall_impl)
> 1 0.028 0.028 0.087 0.087{method 'fetchall'of
> 'psycopg2._psycopg.cursor'objects}
> 16380 0.032 0.000 0.059
> 0.000/usr/lib64/python2.6/encodings/utf_8.py:15(decode)
> 16380 0.027 0.000 0.027 0.000{_codecs.utf_8_decode}
> 1 0.000 0.000 0.023
> 0.023/usr/lib64/python2.6/site-packages/sqlalchemy/engine/base.py:1665(execute)
> 1 0.000 0.000 0.023
> 0.023/usr/lib64/python2.6/site-packages/sqlalchemy/engine/base.py:652(execute)
> 1 0.000 0.000 0.023
> 0.023/usr/lib64/python2.6/site-packages/sqlalchemy/engine/base.py:846(_execute_text)
> 1 0.000 0.000 0.023
> 0.023/usr/lib64/python2.6/site-packages/sqlalchemy/engine/base.py:868(_execute_context)
> 1 0.000 0.000 0.022
> 0.022/usr/lib64/python2.6/site-packages/sqlalchemy/engine/default.py:434(do_execute)
> 1 0.022 0.022 0.022 0.022{method 'execute'of
> 'psycopg2._psycopg.cursor'objects}
> 1 0.004 0.004 0.004
> 0.004/usr/lib64/python2.6/site-packages/sqlalchemy/engine/result.py:762(process_rows)
> 1 0.000 0.000 0.001
> 0.001/usr/lib64/python2.6/site-packages/sqlalchemy/orm/query.py:2923(__str__)
> 1 0.000 0.000 0.001
> 0.001/usr/lib64/python2.6/site-packages/sqlalchemy/sql/elements.py:476(__str__)
> 1 0.000 0.000 0.001 0.001<string>:1(<lambda>)
> |
>
>
>
> So, question #1:
>
> Given that in both cases I'm querying columns and essentially just
> getting rows as a result, why is going through the ORM layer almost 4x
> slower?

Start with the explanation at
http://docs.sqlalchemy.org/en/rel_1_0/faq/performance.html#result-fetching-slowness-orm
and then work into the performance example suite which should give more
of an idea. Also 1.1 should improve upon these numbers a slight bit, as
does virtually every major SQLAlchemy version (see
http://techspot.zzzeek.org/2010/12/12/a-tale-of-three-profiles/ for much
older but more in-depth history on this ongoing subject).

>
>
>
> Next, just for sanity's sake I compared the query done on the raw dbapi
> connection as obtained via sqlalchemy's engine, and on a new psycopg2
> connection (which should theoretically be exactly the same):

except as you noticed, they are not, there's lots of options that can be
enabled or disabled on DBAPI connections.

>
> |
>>>>conn =engine.connect()
>>>>raw_conn =conn.connection
>>>>curs =raw_conn.cursor()
>>>>withprofiled():
> ... curs.execute(str(q))
> ... r =curs.fetchall()
> ...
> 33071functioncalls (33060primitive calls)in0.113CPU seconds
>
> Orderedby:cumulative time
>
> ncalls tottime percall cumtime percall filename:lineno(function)
> 1 0.029 0.029 0.091 0.091{method 'fetchall'of
> 'psycopg2._psycopg.cursor'objects}
> 16380 0.034 0.000 0.062
> 0.000/usr/lib64/python2.6/encodings/utf_8.py:15(decode)
> 16380 0.028 0.000 0.028 0.000{_codecs.utf_8_decode}
> 1 0.020 0.020 0.020 0.020{method 'execute'of
> 'psycopg2._psycopg.cursor'objects}
> 1 0.000 0.000 0.002
> 0.002/usr/lib64/python2.6/site-packages/sqlalchemy/orm/query.py:2923(__str__)
> 1 0.000 0.000 0.001
> 0.001/usr/lib64/python2.6/site-packages/sqlalchemy/sql/elements.py:476(__str__)
> 1 0.000 0.000 0.001 0.001<string>:1(<lambda>)
>
>
>>>>importpsycopg2
>>>>conn =psycopg2.connect("dbname=xmsdb")
>>>>curs =conn.cursor()
>>>>withprofiled():
> ... curs.execute(str(q))
> ... r =curs.fetchall()
> ...
>
> 311functioncalls (300primitive calls)in0.029CPU seconds
>
> Orderedby:cumulative time
>
> ncalls tottime percall cumtime percall filename:lineno(function)
> 1 0.024 0.024 0.024 0.024{method 'execute'of
> 'psycopg2._psycopg.cursor'objects}
> 1 0.003 0.003 0.003 0.003{method 'fetchall'of
> 'psycopg2._psycopg.cursor'objects}
> 1 0.000 0.000 0.001
> 0.001/usr/lib64/python2.6/site-packages/sqlalchemy/orm/query.py:2923(__str__)
> 1 0.000 0.000 0.001
> 0.001/usr/lib64/python2.6/site-packages/sqlalchemy/sql/elements.py:476(__str__)
> 1 0.000 0.000 0.001 0.001<string>:1(<lambda>)
> |
>
>
> These should theoretically be exactly the same, but it looks like the
> sqlalchemy version is spending some time doing utf-8 decoding, which the
> psycopg2 connection doesn't.

SQLAlchemy uses this subsystem of psycopg2 by default:

http://initd.org/psycopg/docs/usage.html#unicode-handling


It looks like using the default connection
> parameters, the psycopg2 connection just returns byte strings while the
> sqlalchemy connection decodes them too.
>
>
> I was able to make the call via sqlalchemy match the psycopg2 call
> time-wise, by modifying the connection parameters:
>
> |
>>>>engine
> =sqlalchemy.engine.create_engine("postgresql:///xmsdb",echo=True,use_native_unicode=False,client_encoding='utf8')
> |
>
>
> That caused the query via sqlalchemy to return byte strings as well (and
> also reduced the query time to ~30ms).
>
>
> So, question #2, given the performance impact of decoding all strings,
> what are the possible implications of simply leaving all strings as byte
> strings? I do want my software to support the full utf-8 character set.
>
>
> I'm also curious about the apparent difference in the default behavior
> (to decode vs. not decode). The database itself does hold unicode
> strings -- for example i changed the name field of one of the items to
> u'volßßßß'. The response via the two different connections was:

You must consider the question of scale. If a large SELECT with ORM
takes 5 seconds, a SELECT with core takes 1.8 seconds, a SELECT with raw
psycopg2 with unicode handling takes .9 seconds, and a SELECT with raw
psycopg2 *without* unicode handling takes .75598 seconds, this would
suggest heavily that turning off the unicode handling, especially given
that your app *wants* to do unicode, is not really where the primary
performance gain would be. If your app takes up tons of time doing
various things with database result sets, the .15 sec difference would
be drowned out.

This is the reason the performance suite is there. I ran a few tests
with the large_resultsets and it seems like at a Core / native level,
Psycopg2's native unicode is a little pricey. Making a few
modifications to the suite to add an engine with native_unicode=False
and running with 100000 rows:


python -m examples.performance --dburl
postgresql://scott:tiger@localhost/test --num 100000 large_resultsets

test_core_fetchall : Load Core result rows using fetchall. (100000
iterations); total time 0.377719 sec
test_core_fetchall_no_unicode : Load Core result rows using fetchall,
don't use unicode. (100000 iterations); total time 0.171641 sec

test_dbapi_fetchall_no_object : Load rows using DBAPI fetchall(), don't
make any objects. (100000 iterations); total time 0.322190 sec
test_dbapi_fetchall_no_object_no_unicode : Load rows using DBAPI
fetchall(), don't make any objects, don't use unicode. (100000
iterations); total time 0.102264 sec


If we OTOH use native_unicode=False and use the Unicode type for the
columns we care about, that seems a lot faster, e.g. changing the
mapping to:

class Customer(Base):
__tablename__ = "customer"
id = Column(Integer, primary_key=True)
name = Column(Unicode(255))
description = Column(Unicode(255))

This will only take effect with ORM and Core loads, not raw psycopg2.
The core load of 100K rows now takes:

test_core_fetchall : Load Core result rows using fetchall. (100000
iterations); total time 0.182480 sec

this is better than the 0.37719 sec that PG's unicode handling took.

When we use the Unicode type and set use_native_unicode=False,
SQLAlchemy's own use of Python unicode codecs takes place; this takes
place within the C extensions and has been highly optimized for speed in
the way that the codec is invoked. We've already observed that
cx_Oracle's native unicode methodologies turn out to be much slower than
SQLAlchemy's, which was unexpected since cx_Oracle is pure C code, and
in this case psycopg2 is also pure C code so it is again surprising
SQLAlchemy's approach is a little faster.


Then again, this is Python. If you're looking to save milliseconds on
enormous amounts of data, you'd at the very least be using Pypy, or a
compiled language like Go.

>
> |
> (u'0984e1b061b248b3b4705b198cadc1eb',u'vol\xdf\xdf\xdf\xdf',487)
> |
>
> vs
>
> |
> ('0984e1b061b248b3b4705b198cadc1eb','vol\xc3\x9f\xc3\x9f\xc3\x9f\xc3\x9f',487)
> |
>
> I.e., one, a bytestring and the other a utf-8 encoded string. The one
> major implication I get is that I can't compare the two without first
> putting them into a common encoding. Is it worthwhile to defer decoding
> until I'd actually need to do a comparison or else just pass the
> bytestring on the client and let him deal with decoding it?

I would never deal with bytestrings that represent text within
application logic. They should be turned to Unicode as early as
possible and turned back to bytestrings as late as possible.


>
>
>
> That's a lot to digest... Thanks in advance for the help, and for a
> great library!
>
>
> Uri
>
> --
> You received this message because you are subscribed to the Google
> Groups "sqlalchemy" group.
> To unsubscribe from this group and stop receiving emails from it, send
> an email to sqlalchemy+...@googlegroups.com
> <mailto:sqlalchemy+...@googlegroups.com>.
> To post to this group, send email to sqlal...@googlegroups.com
> <mailto:sqlal...@googlegroups.com>.
> Visit this group at http://groups.google.com/group/sqlalchemy.
> For more options, visit https://groups.google.com/d/optout.

Uri Okrent

unread,
Oct 21, 2015, 7:27:34 PM10/21/15
to sqlalchemy
Interesting...


On Wednesday, October 21, 2015 at 5:43:22 PM UTC-4, Michael Bayer wrote:
class Customer(Base):
    __tablename__ = "customer"
    id = Column(Integer, primary_key=True)
    name = Column(Unicode(255))
    description = Column(Unicode(255))

My declarative classes use Text() for all string columns.  This is because I *know* my backend is postgres and that's sort of what they recommend (the "tip": http://www.postgresql.org/docs/9.3/static/datatype-character.html).  However, I neglected to consider the consequences of that decision on the ORM.

It sounds like your recommendation is to disable unicode decoding at the engine level with native_unicode=False, and instead explicitly call out only those columns that contain unicode for sqlalchemy to handle the decoding of only those columns, using a mixture of (in the postgres case) Text() and UnicodeText() columns.

Although, reading the docs, I got the feeling that you were discouraging people from using sqlalchemy's built-in decoding facilities in favor of the native facilities provided by the dbapi.

Mike Bayer

unread,
Oct 21, 2015, 9:18:03 PM10/21/15
to sqlal...@googlegroups.com


On 10/21/15 7:27 PM, Uri Okrent wrote:
> Interesting...
>
> On Wednesday, October 21, 2015 at 5:43:22 PM UTC-4, Michael Bayer wrote:
>
> class Customer(Base):
> __tablename__ = "customer"
> id = Column(Integer, primary_key=True)
> name = Column(Unicode(255))
> description = Column(Unicode(255))
>
>
> My declarative classes use Text() for all string columns. This is
> because I *know* my backend is postgres and that's sort of what they
> recommend (the "tip":
> http://www.postgresql.org/docs/9.3/static/datatype-character.html).
> However, I neglected to consider the consequences of that decision on
> the ORM.

use UnicodeText() then. on PG, VARCHAR /CHAR / TEXT are all identical,
it doesn't matter. Also the only difference between Unicode and
UnicodeText on that platform is what DDL is emitted in CREATE TABLE. At
the level of DML and queries, the data is all the same across these types.


>
> It sounds like your recommendation is to disable unicode decoding at the
> engine level with native_unicode=False, and instead explicitly call out
> only those columns that contain unicode for sqlalchemy to handle the
> decoding of only those columns, using a mixture of (in the postgres
> case) Text() and UnicodeText() columns.

If you're really looking to save 15 ms per 100K rows, it seems to have
that effect for now.


>
> Although, reading the docs, I got the feeling that you were discouraging
> people from using sqlalchemy's built-in decoding facilities in favor of
> the native facilities provided by the dbapi.

Well, originally we did it all ourselves when we didn't have C
extensions, and DBAPIs barely did it. Then the DBAPIs started supplying
it natively, and especially with the coming of Python 3, they all had
to; compared to SQLAlchemy doing it all in pure Python, there was no
contest. But then our C extensions came along and sped things up, and
then we started doing things like caching the codec object which is
probably what the DBAPIs aren't doing yet and gained even more speed, so
it seems like we've surpassed the DBAPIs in just this one area, which is
ridiculous because the pure C DBAPIs are always so much faster in every
way, it's quite annoying that this weird incongruity seems to be present.

Under Python 3 we generally don't have an option here, the DBAPIs now
all do unicode encoding automatically. So the architectures have been
pushed to assume that's the default, but in the case of cx_Oracle and
now apparently psycopg2 we're seeing that in Py2X their unicode
facilities still seem to perform worse than those of SQLAlchemy's.
There's not a clear answer. I'd prefer if the DBAPIs that are written
in pure C anyway like psycopg2 could just allow their performance to be
faster here, I'd maybe report it to them.

Gaëtan de Menten

unread,
Oct 22, 2015, 2:34:45 AM10/22/15
to sqlal...@googlegroups.com
On Wed, Oct 21, 2015 at 11:43 PM, Mike Bayer <mik...@zzzcomputing.com> wrote:

> If we OTOH use native_unicode=False and use the Unicode type for the
> columns we care about, that seems a lot faster, e.g. changing the
> mapping to:

[...]

> When we use the Unicode type and set use_native_unicode=False,
> SQLAlchemy's own use of Python unicode codecs takes place; this takes
> place within the C extensions and has been highly optimized for speed in
> the way that the codec is invoked. We've already observed that
> cx_Oracle's native unicode methodologies turn out to be much slower than
> SQLAlchemy's, which was unexpected since cx_Oracle is pure C code, and
> in this case psycopg2 is also pure C code so it is again surprising
> SQLAlchemy's approach is a little faster.

Back then when I created the C extensions, I found that out as well:
virtually all backends were slower at decoding unicode than our own
unicode handling. At that time, there was no easy way (that I knew of,
at least) to turn off the native decoding of the DBAPI via SQLAlchemy.
I am glad to see that it now exists !

Congratulations to Mike for his dedication and endurance in the
project, You never cease to impress.

Gaëtan
Reply all
Reply to author
Forward
0 new messages