Django Oracle backend vs. numbers

337 views
Skip to first unread message

Shai Berger

unread,
Aug 29, 2012, 8:14:30 AM8/29/12
to Django developers
Hi Django developers,

I've been working with Oracle on a project with some performance requirements,
and we've noticed long ago that the Django Oracle backend performs
significantly worse than other backends; we've been able to profile it and see
that, indeed, a lot of time is spent in processing the returned rows, while
the Oracle DBMS itself is not breaking a sweat.

This week, I found what is, I think, the main culprit: The Oracle backend
prefers accuracy to performance. Since Oracle numbers can have more significant
digits than C doubles (=CPython floats), and the backend wishes to preserve
this accuracy, it makes cxOracle (the Python Oracle client library) return all
numbers as strings, and then converts the strings back to numbers -- ints,
floats or decimals -- according to the Oracle data type descriptions.

To make things even slower, the conversion decisions are made anew for each
row; this is done in the function _rowfactory (currently at
https://github.com/django/django/blob/master/django/db/backends/oracle/base.py#L790).
The function itself is called less efficiently than possible (on rows returned
from cxOracle, instead of within cxOracle before the rows are returned).

For our application -- as for many others, I guess -- precision of 38 digits,
on real numbers or integers, is not important. I made a modified backend which
returns numbers as numbers and the improvement in performance is very
significant. A row-factory function is still required to make sure strings are
returned as unicode, and creating such a function per query to only go over
the string columns may improve things even more, but I didn't even have to go
there yet.

Can we put this into Django core? Perhaps with a Settings knob for people who
really need all the precision?

Thanks,
Shai.

Jacob Kaplan-Moss

unread,
Aug 29, 2012, 12:34:29 PM8/29/12
to django-d...@googlegroups.com
Hey Shai --

Generally sounds like a good idea, I don't see why this shouldn't go
in. You might want to wait until someone with more Oracle experience
chimes in, but I'm having a hard time seeing why a number -> str ->
number dance would be a good idea.

I'm generally adverse to introducing more settings -- settings creep
is a real problem -- so a couple of alternate ideas you might consider
in producing a patch:

1. Just pick one correct behavior, probably for each type. My guess
would be that ints and floats should get the fast behavior, and
decimals the slow-but-correct one. If you're using a decimal field
it's because you care about precision, so correctness makes more sense
there IMO.

2. If you must introduce a setting, do it in DATABASES[OPTIONS]
instead of a top-level ORACLE_FAST_NUMBERS or whatever.

Jacob
> --
> You received this message because you are subscribed to the Google Groups "Django developers" group.
> To post to this group, send email to django-d...@googlegroups.com.
> To unsubscribe from this group, send email to django-develop...@googlegroups.com.
> For more options, visit this group at http://groups.google.com/group/django-developers?hl=en.
>

Ian Kelly

unread,
Aug 29, 2012, 3:17:22 PM8/29/12
to django-d...@googlegroups.com
On Wed, Aug 29, 2012 at 10:34 AM, Jacob Kaplan-Moss <ja...@jacobian.org> wrote:
> Hey Shai --
>
> Generally sounds like a good idea, I don't see why this shouldn't go
> in. You might want to wait until someone with more Oracle experience
> chimes in, but I'm having a hard time seeing why a number -> str ->
> number dance would be a good idea.

The reason for it is because cx_Oracle doesn't create Decimal objects
when fetching from the database. It only creates ints and floats. As
a result, if you try to use a DecimalField(38), you're not going to be
able to effectively store a 38-digit decimal number, even though
Oracle supports that. You basically end up being limited to the
precision of a float with some rounding error, since the value gets
converted from NUMBER to float to Decimal. Using strings as the
intermediate instead of floats allows the full precision to be used.

In theory we could do this only for DecimalFields and leave the other
numeric field types alone, by wrapping each individual DecimalField
expression in the select list in a TO_CHAR() call. In practice this
work has never been done because of the mess it would make of query
generation, so we instead use a cx_Oracle API feature to transmit all
numbers as strings and then sort it out later.

There's another option that also hasn't been implemented, because it
would mean dropping support for cx_Oracle versions older than 5.0.1.
Perhaps the time has come to do that (I note that the oldest binary
still listed at the cx_Oracle website is 5.1). This would involve
removing the _rowhandler function and the numbers_as_strings setting
and replacing it with an _outputtypehandler that would only be called
once per column instead of on every single returned value. I've
written a POC patch for this and so far, performance seems to be much
improved, on par with using cx_Oracle without Django.

https://github.com/ikelly/django/commit/086f502a1c2acb9db27531f7df78cb9df9ab83bd

Shai, if you would be willing to try this patch, I'd like to hear your
results. Please bear in mind that this is currently only a POC, so it
could be buggy -- I've only run it through the basic tests.

Cheers,
Ian

Anssi Kääriäinen

unread,
Aug 30, 2012, 4:12:05 AM8/30/12
to Django developers
On 29 elo, 22:18, Ian Kelly <ian.g.ke...@gmail.com> wrote:
> There's another option that also hasn't been implemented, because it
> would mean dropping support for cx_Oracle versions older than 5.0.1.
> Perhaps the time has come to do that (I note that the oldest binary
> still listed at the cx_Oracle website is 5.1).  This would involve
> removing the _rowhandler function and the numbers_as_strings setting
> and replacing it with an _outputtypehandler that would only be called
> once per column instead of on every single returned value.  I've
> written a POC patch for this and so far, performance seems to be much
> improved, on par with using cx_Oracle without Django.

5.0.1 seems to be released somewhere around 03/2009. [http://
sourceforge.net/projects/cx-oracle/files/]. +1 for requiring 5.0.1.

We could probably detect the cx_oracle version and based on that use
either the _rowhandler approach or the _outputtypehandler approach.
This seems like overkill to me. We do not have backwards compatibility
requirements for old library versions, right?

- Anssi

Shai Berger

unread,
Aug 30, 2012, 4:23:44 PM8/30/12
to django-d...@googlegroups.com
Hi Ian, Jacob, Anssi and all,

Thanks for the quick and positive responses.

On Wednesday 29 August 2012, Ian Kelly wrote:
>https://github.com/ikelly/django/commit/086f502a1c2acb9db27531f7df78cb9df9ab83bd
>
> Shai, if you would be willing to try this patch, I'd like to hear your
> results. Please bear in mind that this is currently only a POC, so it
> could be buggy -- I've only run it through the basic tests.
>

This looks very promising, I will try it next week. On a first look, there is
just one thing about it that I don't like, and it is related to Jacob's
comments --

On Wed, Aug 29, 2012 at 10:34 AM, Jacob Kaplan-Moss <ja...@jacobian.org>
wrote:
> I'm generally adverse to introducing more settings -- settings creep
> is a real problem -- so a couple of alternate ideas you might consider
> in producing a patch:
>
> 1. Just pick one correct behavior, probably for each type. My guess
> would be that ints and floats should get the fast behavior, and
> decimals the slow-but-correct one. If you're using a decimal field
> it's because you care about precision, so correctness makes more sense
> there IMO.
>
This would definitely be preferrable to a setting, but it doesn't cover all
cases -- we have some raw queries where expressions are being selected. The
default number type in this case -- in current code as well as Ian's patch --
is to return the data as a string and convert it to int or Decimal. Our code,
which was written against other database systems, tries to use the result in
arithmetic with floats, and this fails (you can't add Decimals to floats).

I could, of course, prevent the exception by converting the value to float in
my code, but that feels like the wrong solution -- I know, in my case, that
float is enough, and I'd much rather tell the backend "give me floats" than have
it go through float->str->Decimal->float.

> 2. If you must introduce a setting, do it in DATABASES[OPTIONS]
> instead of a top-level ORACLE_FAST_NUMBERS or whatever.

That is indeed one good option. The alternative I'm considering -- less
convenient for me, but perhaps more flexible generally -- is, since we're
dealing with raw SQL anyway, to put some API for this on the cursor; perhaps a
property, so that setting it would be a no-op on other backends.

As per Anssi's comment, I have no backwards-compatibility requirements with
respect to cx_Oracle, but we are still running Django 1.3. Trying to call the
performance improvement a "bugfix" so it can be backported, while changing
cx_Oracle requirements, would be pushing my luck, right?

Thanks again,

Shai.

Anssi Kääriäinen

unread,
Aug 31, 2012, 3:43:21 AM8/31/12
to Django developers
On 30 elo, 23:24, Shai Berger <s...@platonix.com> wrote:
> Hi Ian, Jacob, Anssi and all,
>
> Thanks for the quick and positive responses.
>
> On Wednesday 29 August 2012, Ian Kelly wrote:
>
> >https://github.com/ikelly/django/commit/086f502a1c2acb9db27531f7df78c...
1.3 is out of the question as only security fixes go into that
release. Even 1.4 is out because this isn't a bugfix.

Even if this was backported, requiring library upgrades for Django's
minor version updates should be avoided if at all possible. For 1.4 ->
1.5 upgrade it is in my opinion OK.

It should be possible to subclass Django's Oracle backend, override
parts of the backend with the changes that go into 1.5, and use that
in 1.3.

- Anssi

Shai Berger

unread,
Sep 3, 2012, 8:14:31 AM9/3/12
to django-d...@googlegroups.com, Ian Kelly
On Wednesday 29 August 2012 22:17:22 Ian Kelly wrote:
>
> https://github.com/ikelly/django/commit/086f502a1c2acb9db27531f7df78cb9df9a
> b83bd
>
I've run a version based on this -- basically, porting the patch to Django 1.3
which we use (so, no tzinfo), using "float" as the default for expressions, and
putting the whole thing as a derived class from the original adapter rather
than a patch (because that's what we do here, as we also need named parameters
and we had to add them). I found that it leaks something -- the error says
"cursors", but I suspect something more subtle; either way, this is what I got
from a page which makes several hundred queries (don't ask):


Traceback (most recent call last):

File ".../lib/python2.7/site-packages/django/core/servers/basehttp.py", line 283, in run
self.result = application(self.environ, self.start_response)

File ".../lib/python2.7/site-packages/staticfiles/handlers.py", line 66, in __call__
return self.application(environ, start_response)

File ".../lib/python2.7/site-packages/django/core/handlers/wsgi.py", line 274, in __call__
signals.request_finished.send(sender=self.__class__)

File ".../lib/python2.7/site-packages/django/dispatch/dispatcher.py", line 172, in send
response = receiver(signal=self, sender=sender, **named)

File ".../lib/python2.7/site-packages/django/db/__init__.py", line 85, in close_connection
conn.close()

File ".../lib/python2.7/site-packages/django/db/backends/__init__.py", line 244, in close
self.connection.close()

DatabaseError: ORA-02002: error while writing to audit trail
ORA-00604: error occurred at recursive SQL level 1
ORA-01000: maximum open cursors exceeded

Thanks,
Shai

Ian Kelly

unread,
Sep 4, 2012, 3:02:14 PM9/4/12
to django-d...@googlegroups.com

Resending to the list:

On Sep 4, 2012 11:44 AM, "Ian Kelly" <ian.g...@gmail.com> wrote:


>
> On Mon, Sep 3, 2012 at 6:14 AM, Shai Berger <sh...@platonix.com> wrote:
> > On Wednesday 29 August 2012 22:17:22 Ian Kelly wrote:
> >>
> >> https://github.com/ikelly/django/commit/086f502a1c2acb9db27531f7df78cb9df9a
> >> b83bd
> >>
> > I've run a version based on this -- basically, porting the patch to Django 1.3
> > which we use (so, no tzinfo), using "float" as the default for expressions, and
> > putting the whole thing as a derived class from the original adapter rather
> > than a patch (because that's what we do here, as we also need named parameters
> > and we had to add them). I found that it leaks something -- the error says
> > "cursors", but I suspect something more subtle; either way, this is what I got
> > from a page which makes several hundred queries (don't ask):
>

> Are these ORM queries or custom queries?  If the latter, are you
> explicitly closing the cursors?  That should help prevent the issue,
> although unfortunately the ORM doesn't do this.  The cursors should be
> closed anyway when they're finalized, but maybe we've got a reference
> cycle preventing them from getting cleaned up right away, and too many
> are accumulating between garbage collection passes.
>
> When I've seen this error in the past, the immediate solution has been
> to increase the OPEN_CURSORS parameter in the database.  See:
>
> http://www.orafaq.com/wiki/ORA-01000
>
> Cheers,
> Ian

Shai Berger

unread,
Sep 4, 2012, 4:17:20 PM9/4/12
to django-d...@googlegroups.com, Ian Kelly
On Tuesday 04 September 2012, Ian Kelly wrote:
> On Mon, Sep 3, 2012 at 6:14 AM, Shai Berger <sh...@platonix.com> wrote:
> > On Wednesday 29 August 2012 22:17:22 Ian Kelly wrote:
> >> https://github.com/ikelly/django/commit/086f502a1c2acb9db27531f7df78cb9d
> >> f9a b83bd
> >
> > more subtle; either way, this is what I got from a page which makes
> > several hundred queries (don't ask):
>
> Are these ORM queries or custom queries?

Mostly ORM.

> If the latter, are you
> explicitly closing the cursors? That should help prevent the issue,
> although unfortunately the ORM doesn't do this. The cursors should be
> closed anyway when they're finalized, but maybe we've got a reference
> cycle preventing them from getting cleaned up right away, and too many
> are accumulating between garbage collection passes.
>
That would make sense.

> When I've seen this error in the past, the immediate solution has been
> to increase the OPEN_CURSORS parameter in the database. See:
>
> http://www.orafaq.com/wiki/ORA-01000
>
I'll try, but this shouldn't be the long-term solution -- especially not when
an alternative exists, which does not have this problem.

BTW, I'm not at all sure that using outputtypehandler (function called per
returned value) is more efficient than using rowfactory (function called per
row) when both are used via the cx_oracle API. The current code calls its
_rowfactory outside this API, which does seem inferior. If the rowfactory
approach behaves better in terms of GC, I don't see why we should insist on
outputtypehandler.

I'll try to prepare a patch for consideration, hopefully this week.

Thanks,
Shai.


Anssi Kääriäinen

unread,
Sep 4, 2012, 5:28:17 PM9/4/12
to Django developers
I did some digging, and the reason for the error is this line in the
patch:
self.cursor.outputtypehandler = self._outputtypehandler

What is happening is that self has a reference to self.cursor, which
has reference to bound method self._outputtypehandler, which has a
reference to self. Thus, there is a reference loop, and garbage
collection is delayed. Changing the _outputtypehandler to @classmethod
seems to solve this problem.

The test case used was basically this: for i in range(10000):
SomeModel.objects.create(...)

While inspecting this issue I found out that if one iterates a
queryset partially, then the cursor will not be closed until gc. The
reason for this is that the queryset converts the rows from the db
into objects in batches, and if all of the rows aren't converted, then
the cursor is kept open. I have been toying with the idea of removing
the chunked object conversion from the queryset __iter__ (#18702)
which would remove the possibility of leaking cursors that way.

- Anssi

On 4 syys, 22:02, Ian Kelly <ian.g.ke...@gmail.com> wrote:
> Resending to the list:
>
> On Sep 4, 2012 11:44 AM, "Ian Kelly" <ian.g.ke...@gmail.com> wrote:
>
> > On Mon, Sep 3, 2012 at 6:14 AM, Shai Berger <s...@platonix.com> wrote:
> > > On Wednesday 29 August 2012 22:17:22 Ian Kelly wrote:
>
> https://github.com/ikelly/django/commit/086f502a1c2acb9db27531f7df78c...

Ian Kelly

unread,
Sep 4, 2012, 6:18:41 PM9/4/12
to django-d...@googlegroups.com
On Tue, Sep 4, 2012 at 3:28 PM, Anssi Kääriäinen
<anssi.ka...@thl.fi> wrote:
> I did some digging, and the reason for the error is this line in the
> patch:
> self.cursor.outputtypehandler = self._outputtypehandler
>
> What is happening is that self has a reference to self.cursor, which
> has reference to bound method self._outputtypehandler, which has a
> reference to self. Thus, there is a reference loop, and garbage
> collection is delayed. Changing the _outputtypehandler to @classmethod
> seems to solve this problem.

Thanks for tracking that down. _rowfactory was a module-level
function rather than a method, and I should have left it that way.

Shai Berger

unread,
Sep 22, 2012, 4:34:21 PM9/22/12
to django-d...@googlegroups.com
On Wednesday 05 September 2012, Ian Kelly wrote:
>
> Thanks for tracking that down. _rowfactory was a module-level
> function rather than a method, and I should have left it that way.

So -- Ian, Anssi, are you going to include the fix in Django 1.5? I can help by
either reviewing it for you or providing the patch for you to review.

If you let me write the patch, I will also include the extra setting in the
"options" dictionary (choosing if expressions should be treated as decimals or
floats). I think this is then an added feature, and needs to get in before the
feature freeze. Otherwise, just fixing the performance can be treated as a bug
in be included later. Is this guess correct?

Thanks,
Shai.

Anssi Kääriäinen

unread,
Sep 23, 2012, 9:05:21 AM9/23/12
to Django developers
Yes, the optimization part can get in after feature freeze. Applying
it even at relatively late stage to 1.5 sounds OK to me.

There isn't much time to do feature additions. It is possible to get
the feature into 1.5, but you should not expect it to happen. As for
the feature itself, I am not yet sure if it is a good idea or not. The
basic problem is granularity: wouldn't you want to do this per query,
or even per expression basis?

Of the wording "if you let me write the patch" - of course you can
write a patch. Even if somebody else claims to be working on this, you
can still write a competing patch. It is not advisable as work is
likely going to be wasted, but you can do it. Doing final polish for
Ian's patch and providing benchmark results for it will get this patch
closer to commit.

If you decide to work on this, please split the patch into two: one
for the ._rowfactory change, one for the feature addition. My belief
is that the ._rowfactory change is going to be something we can very
easily justify committing into 1.5, but the proposed feature addition
sounds like something which isn't as obvious.

I am ready to review patches related to this issue, but otherwise I
don't have any plans to work on this. I hope Ian has some time
available for this feature, as he knows a lot more about Oracle than I
do.

- Anssi

Shai Berger

unread,
Sep 24, 2012, 8:41:29 PM9/24/12
to django-d...@googlegroups.com
On Sunday 23 September 2012 15:05:21 Anssi Kääriäinen wrote:

>
> If you decide to work on this, please split the patch into two: one
> for the ._rowfactory change, one for the feature addition. My belief
> is that the ._rowfactory change is going to be something we can very
> easily justify committing into 1.5, but the proposed feature addition
> sounds like something which isn't as obvious.

While preparing the patch, I ran into the inspectdb regression test suite;
this suite includes, among other things, a model with a database column named
'prc(%) x', which breaks test database creation on Oracle on current trunk. I
intend to include a fix for this (in a separate commit, of course) in the first
part as well (doubling percent signs in quote_name()).

Shai Berger

unread,
Sep 25, 2012, 3:31:40 AM9/25/12
to django-d...@googlegroups.com
On Sunday 23 September 2012, Anssi Kääriäinen wrote:
> Doing final polish for Ian's patch and providing benchmark results for it
> will get this patch closer to commit.
>
I had hoped the Django test suite alone will provide a convincing benchmark,
but apparently that is not the case -- on my laptop, with Oracle in a virtual
machine, the basic suite takes 63 minutes with or without the patch. This
makes sense, because the effect is likely to be evident mostly in queries
returning many rows, and those are not very common in the test suite.

I do not have a working Django-trunk project to test on, and this evening and
tomorrow is a very sacred holiday here (Yom Kippur), so it might take me until
the end of the week to produce benchmark results. If anyone wants to chime in,
the pull request is https://github.com/django/django/pull/393. There is, so
far, anecdotal evidence of very significant improvements, from Ian Kelly
(original author of this patch) and myself (on a somewhat different
implementation and older Django versions, but same general idea). In my case,
page load times were cut by dozens of percents.

Thanks,
Shai.

Anssi Kääriäinen

unread,
Sep 25, 2012, 4:03:11 AM9/25/12
to Django developers
On 25 syys, 10:31, Shai Berger <s...@platonix.com> wrote:
> On Sunday 23 September 2012, Anssi Kääriäinen wrote:> Doing final polish for Ian's patch and providing benchmark results for it
> > will get this patch closer to commit.
>
> I had hoped the Django test suite alone will provide a convincing benchmark,
> but apparently that is not the case -- on my laptop, with Oracle in a virtual
> machine, the basic suite takes 63 minutes with or without the patch. This
> makes sense, because the effect is likely to be evident mostly in queries
> returning many rows, and those are not very common in the test suite.
>
> I do not have a working Django-trunk project to test on, and this evening and
> tomorrow is a very sacred holiday here (Yom Kippur), so it might take me until
> the end of the week to produce benchmark results. If anyone wants to chime in,
> the pull request ishttps://github.com/django/django/pull/393. There is, so
> far, anecdotal evidence of very significant improvements, from Ian Kelly
> (original author of this patch) and myself (on a somewhat different
> implementation and older Django versions, but same general idea). In my case,
> page load times were cut by dozens of percents.
>
> Thanks,
>         Shai.

Am I correct the benchmark is basically:

cursor = connection.cursor()
cursor.execute("some sql fetching many rows with NUMBER() columns")
timer.start()
for row in cursor.fetchall():
pass
print timer.used()

If so, I should be able to produce a benchmark of this fairly easily.

- Anssi

Shai Berger

unread,
Sep 25, 2012, 4:11:45 AM9/25/12
to django-d...@googlegroups.com
On Tuesday 25 September 2012 10:03:11 Anssi Kääriäinen wrote:
>
> Am I correct the benchmark is basically:
>
> cursor = connection.cursor()
> cursor.execute("some sql fetching many rows with NUMBER() columns")
> timer.start()
> for row in cursor.fetchall():
> pass
> print timer.used()
>
Yes, I think so. I find timeit, from the python stdlib, very helpful for
benchmarks.

Ian Kelly

unread,
Sep 25, 2012, 1:47:01 PM9/25/12
to django-d...@googlegroups.com
Thanks for putting together a pull request. Please note the patch
will also need to update the documentation to state the new cx_Oracle
version requirement (was 4.3.1, now 5.0.1).

Cheers,
Ian

Ian Kelly

unread,
Sep 25, 2012, 1:49:26 PM9/25/12
to django-d...@googlegroups.com
On Tue, Sep 25, 2012 at 11:47 AM, Ian Kelly <ian.g...@gmail.com> wrote:
> Thanks for putting together a pull request. Please note the patch
> will also need to update the documentation to state the new cx_Oracle
> version requirement (was 4.3.1, now 5.0.1).

Also, there should probably be something about this in the 1.5 release notes.

Andre Terra

unread,
Sep 26, 2012, 2:26:11 PM9/26/12
to django-d...@googlegroups.com
On Mon, Sep 24, 2012 at 9:41 PM, Shai Berger <sh...@platonix.com> wrote:
While preparing the patch, I ran into the inspectdb regression test suite;
this suite includes, among other things, a model with a database column named
'prc(%) x', which breaks test database creation on Oracle on current trunk. I
intend to include a fix for this (in a separate commit, of course) in the first
part as well (doubling percent signs in quote_name()).

For the record, I believe the relevant ticket and discussion are located at:



Cheers,
AT (airstrike)

Shai Berger

unread,
Sep 26, 2012, 3:23:24 PM9/26/12
to django-d...@googlegroups.com
No and maybe: I think it's the "other side" of the same issue. The ticket is
about problems in inspection; the problem I ran into is in creating a table,
where the column name is given in the model. My code does nothing to fix the
inspetion (and indeed, there are failures in the test suite with it); but
without some fix to the problem I solved, the test suite cannot be run at all.

It may be the problem Anssi refers to in his comment there, I'm not sure.

I can open a ticket for this if core devs like, but it feels like unnecessary
procedure when it is included with a fix that has already been discussed.

Thanks,
Shai.

Shai Berger

unread,
Sep 26, 2012, 8:01:12 PM9/26/12
to django-d...@googlegroups.com
Yep, forgot about documentation, thanks. PR updated.

Shai Berger

unread,
Sep 27, 2012, 3:15:00 PM9/27/12
to django-d...@googlegroups.com
Hi all,

On Sunday 23 September 2012 15:05:21 Anssi Kääriäinen wrote:
>
> Doing final polish for Ian's patch and providing benchmark results for it
> will get this patch closer to commit.
>

I made some initial benchmarks, and they do not reflect the change I've seen in
the actual application. I don't yet know what to make of this. With this
model:

class AllSorts(models.Model):

start = models.IntegerField()
square = models.IntegerField()
sqrt = models.FloatField()
cubicroot = models.DecimalField(decimal_places=20, max_digits=25)
name = models.CharField(max_length=100)
slug = models.SlugField()

Calling this function for fetching:

@transaction.commit_on_success
def getAllSorts():
errors = 0
print "Getting %d objects" % AllSorts.objects.all().count()
for alls in AllSorts.objects.all():
if alls.cubicroot.is_nan():
errors+=1
print "Got %d errors in the decimal" % errors


I made 10,000 records. Getting them all (and making sure the decimal did not
return a NaN) takes ~1.95 seconds on the old code, ~1.75 seconds on the new
code, and ~1.6 seconds on the new-new code (where we use a new float_is_enough
option, which tells the backend to only use Decimals for DecimalFields).

Adding a calculated expression:

@transaction.commit_on_success
def getAllSorts():
errors = 0
print "Getting %d objects" % AllSorts.objects.all().count()
for alls in \
AllSorts.objects.all().extra(select={'e':'"START"+(2*"CUBICROOT")'}):
if alls.cubicroot.is_nan():
errors+=1
print "Got %d errors in the decimal" % errors

Did not change the results significantly, despite my expectations.

I'm not sure yet what we did in the application that made this make a huge
difference. I see a significant difference here, but not amazing (though ~20% is
nothing to scoff at). Perhaps making more request of fewer rows, counter-
intuitively makes more of a difference. Perhaps more strings. But I know I saw
a much bigger difference than this.

The new-new code (sans documentation, at this point) is available for your
review and benchmarking as Pull Request 402, or the oracle-float-exprs branch
in my fork. The benchmarking project is attached to this message.

Thanks for your attention,
Shai.

bench.tgz

Shai Berger

unread,
Sep 29, 2012, 2:51:33 PM9/29/12
to django-d...@googlegroups.com
Hi, Sorry about the delay in writing this -- it was in my mind, but I forgot
to let it out...

On Sunday 23 September 2012, Anssi Kääriäinen wrote:
> On 22 syys, 23:34, Shai Berger <s...@platonix.com> wrote:
> >
> > [...] I will also include the extra setting in
> > the "options" dictionary (choosing if expressions should be treated as
> > decimals or floats). I think this is then an added feature, and needs to
> > get in before the feature freeze. Otherwise, just fixing the performance
> > can be treated as a bug in be included later. Is this guess correct?
>
> There isn't much time to do feature additions. It is possible to get
> the feature into 1.5, but you should not expect it to happen. As for
> the feature itself, I am not yet sure if it is a good idea or not. The
> basic problem is granularity: wouldn't you want to do this per query,
> or even per expression basis?
>
No, I don't; the issue is cross-database compatibility. We have raw SQL
statements calculating some expressions, and then use the returned values in
calculations in Python -- adding and muliplying them with floats. This worked
fine with MySql and the 3rd-party SQL Server backends, but blew up on Oracle
because Decimals and floats don't mix very well:

>>> from decimal import Decimal as D
>>> D(1) + 1.0
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: unsupported operand type(s) for +: 'Decimal' and 'float'
>>> 1.0 + D(1)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: unsupported operand type(s) for +: 'float' and 'Decimal'

I could, of course, solve it on a query-by-query basis, adding a cast to float
that is a nop on other backends. That seems a little backward to me, and an
unfair requirements of pluggable apps.

Similarly, any per-query or per-expression API would be added only when using
Oracle, and at best be a NO-OP on other engines -- which means it would only
be added by Oracle users, with the same consequences.

It seems much better to me to make it possible to run Oracle in a mode that
makes expressions in raw sql have semantics that is compatible with other
backends.

(would filing a bug with the above arguments make it more likely to include the
feature in 1.5?)

Thanks,
Shai.

Anssi Kääriäinen

unread,
Sep 29, 2012, 4:25:35 PM9/29/12
to Django developers
On 29 syys, 21:51, Shai Berger <s...@platonix.com> wrote:
> On Sunday 23 September 2012, Anssi Kääriäinen wrote:> On 22 syys, 23:34, Shai Berger <s...@platonix.com> wrote:
> > There isn't much time to do feature additions. It is possible to get
> > the feature into 1.5, but you should not expect it to happen. As for
> > the feature itself, I am not yet sure if it is a good idea or not. The
> > basic problem is granularity: wouldn't you want to do this per query,
> > or even per expression basis?
>
> No, I don't; the issue is cross-database compatibility. We have raw SQL
> statements calculating some expressions, and then use the returned values in
> calculations in Python -- adding and muliplying them with floats. This worked
> fine with MySql and the 3rd-party SQL Server backends, but blew up on Oracle
> because Decimals and floats don't mix very well:

The problem seems to be that on Oracle, 1.1 is NUMERIC, and that is
something which _is_ a decimal in Python. We should not change that.

To me it seems the answer is to use to_binary_float/to_binary_double
in raw SQL queries. For example:

cur = connection.cursor()
cur.execute("select count(*) from tests_place")
print cur.fetchone()[0]
vals = None
def do_query():
global vals
cur.execute("select 1.1 from tests_place")
vals = [val[0] for val in cur.fetchall()]
t = Timer('do_query()', 'from __main__ import do_query')
print t.timeit(number=20)
print type(vals[0])

OUT:
2000
0.719314813614
<class 'decimal.Decimal'>

Replace the query with "select to_binary_double(1.1) from
tests_place":
OUT:
2000
0.337587118149
<type 'float'>

Based on the above I am -1 on having a flag that interprets NUMBER
fields as floats. There is already an Oracle data type to use for
this.

I tested the pull request 393 (the speedup to ._rowfactory). Above
test case, and the decimal case runs in around one second, the float
case in 0.1 seconds. So, there is a regressions for decimals. On the
other hand floats are way faster. Will investigate...

As for new features for 1.5: there is roughly a day before feature
freeze. Getting any new ticket into 1.5 is really unlikely at this
point, existing pull requests have priority.

- Anssi

Shai Berger

unread,
Sep 29, 2012, 6:23:52 PM9/29/12
to django-d...@googlegroups.com
Hi, and thanks for your attention. I know you must be extremely busy with
other issues of the feature freeze. I just want to make some things clear:

> Based on the above I am -1 on having a flag that interprets NUMBER
> fields as floats. There is already an Oracle data type to use for
> this.
>

The flag does not affect the interpretation of NUMBER columns with scale and
precision. True, in its current form, it turns bare NUMBER columns (decimal-
precision floating point) to floats. But these are columns that are not
generated by Django's built-in fields; if you have them, you are doing
something Oracle-specific. The flag is mostly intended to improve cross-database
compatibility; other than that, it offers a very clear trade-off and it is off by
default. It really shouldn't hit anybody by surprise.

> > > I am not yet sure if it is a good idea or not. The
> > > basic problem is granularity: wouldn't you want to do this per query,
> > > or even per expression basis?
> >
> > No, I don't; the issue is cross-database compatibility.
> >
> The problem seems to be that on Oracle, 1.1 is NUMERIC, and that is
> something which _is_ a decimal in Python. We should not change that.
>
> To me it seems the answer is to use to_binary_float/to_binary_double
> in raw SQL queries. For example:
>

But this is an Oracle-specific solution. Exactly the thing I'm trying to avoid.

>
> I tested the pull request 393 (the speedup to ._rowfactory). Above
> test case, and the decimal case runs in around one second, the float
> case in 0.1 seconds. So, there is a regressions for decimals. On the
> other hand floats are way faster. Will investigate...
>
I confirm your findings: Running with different parameters (10,000 rows) I found
Decimals are ~1.8x slower, floats are ~3 times faster, and significantly, ints
are ~2 times faster. I suspect in the typical Django application, most numbers
are ints.

I investigated a little myself, and found the reason to be the redundant
lookup in _decimal_or_int(): It was referring to Decimal as "decimal.Decimal".
I moved that to a default argument, and got performance to be ~40% better than
trunk for decimals too. The pull request is already updated.

> As for new features for 1.5: there is roughly a day before feature
> freeze. Getting any new ticket into 1.5 is really unlikely at this
> point, existing pull requests have priority.
>
FWIW, the pull request exists (402).

Thanks for your help,

Shai.

Shai.

Anssi Kääriäinen

unread,
Sep 29, 2012, 7:03:53 PM9/29/12
to Django developers
On 29 syys, 23:25, Anssi Kääriäinen <anssi.kaariai...@thl.fi> wrote:

> I tested the pull request 393 (the speedup to ._rowfactory). Above
> test case, and the decimal case runs in around one second, the float
> case in 0.1 seconds. So, there is a regressions for decimals. On the
> other hand floats are way faster. Will investigate...

So, while investigating this I found out the reason for the
performance regression is calling _decimal_or_int for each row. As far
as I can tell there is no alternative to doing a call for each row.
The next idea was to cache the type of the column so that conversions
would be faster. This leads to an interesting test case:

cur.execute("select case when dbms_random.random > 0.5 then 0.1
else 0 end from testtable")

Guess what is the type of the column? It is both Decimal and int,
alternating based on the random result.

This seems like surprising behaviour to me. The problem is in
cx_Oracle, not Django code. In raw cx_Oracle queries you get ints and
floats mixed. I don't believe it is OK to return different types for
the same column in one resultset. I will report to cx_Oracle mailing
list and see if they have anything to say about this. DBAPI2 doesn't
offer any clear answer to this, all numeric types are put under same
type, NUMBER, no matter if they are ints, floats or Decimals.

To me it seems we can't do anything to this in Django. We simply don't
have enough information available in the outputtypehandler to guess
the type correctly.

- Anssi

Anssi Kääriäinen

unread,
Sep 29, 2012, 7:30:00 PM9/29/12
to Django developers
On 30 syys, 01:24, Shai Berger <s...@platonix.com> wrote:
> > The problem seems to be that on Oracle, 1.1 is NUMERIC, and that is
> > something which _is_ a decimal in Python. We should not change that.
>
> > To me it seems the answer is to use to_binary_float/to_binary_double
> > in raw SQL queries. For example:
>
> But this is an Oracle-specific solution. Exactly the thing I'm trying to avoid.

I just don't like a global flag. You could easily then have some code
that requires numbers as decimals, and some other code that requires
numbers as floats. A global flag is a no-go in this situation.

> > I tested the pull request 393 (the speedup to ._rowfactory). Above
> > test case, and the decimal case runs in around one second, the float
> > case in 0.1 seconds. So, there is a regressions for decimals. On the
> > other hand floats are way faster. Will investigate...
>
> I confirm your findings: Running with different parameters (10,000 rows) I found
> Decimals are ~1.8x slower, floats are ~3 times faster, and significantly, ints
> are ~2 times faster. I suspect in the typical Django application, most numbers
> are ints.
>
> I investigated a little myself, and found the reason to be the redundant
> lookup in _decimal_or_int(): It was referring to Decimal as "decimal.Decimal".
> I moved that to a default argument, and got performance to be ~40% better than
> trunk for decimals too. The pull request is already updated.

This looks very good.

> > As for new features for 1.5: there is roughly a day before feature
> > freeze. Getting any new ticket into 1.5 is really unlikely at this
> > point, existing pull requests have priority.
>
> FWIW, the pull request exists (402).

I walked into this one... But, sorry, I will not have time for this
pull request for 1.5.

- Anssi

Shai Berger

unread,
Sep 29, 2012, 7:42:32 PM9/29/12
to django-d...@googlegroups.com
On Sunday 30 September 2012, Anssi Kääriäinen wrote:
>
> I walked into this one... But, sorry, I will not have time for this
> pull request for 1.5.
>
Ok, I can't say you didn't warn me this would be the case. Thanks a lot for
the time you did spend on it,

Shai.

Shai Berger

unread,
Sep 29, 2012, 7:51:13 PM9/29/12
to django-d...@googlegroups.com
On Sunday 30 September 2012, Anssi Kääriäinen wrote:
>
> cur.execute("select case when dbms_random.random > 0.5 then 0.1
> else 0 end from testtable")
>
> Guess what is the type of the column? It is both Decimal and int,
> alternating based on the random result.
>
> This seems like surprising behaviour to me. The problem is in
> cx_Oracle, not Django code. In raw cx_Oracle queries you get ints and
> floats mixed.

I tried this query in sqlplus, and the response looks like mixed ints and
floats there too. I don't know a way to get a value's datatype inside Oracle,
so it's hard to be sure, but it could be that cx_Oracle is only reflecting
Oracle with this.

> I don't believe it is OK to return different types for
> the same column in one resultset. I will report to cx_Oracle mailing
> list and see if they have anything to say about this.

I'm curious as well.

Shai.
Reply all
Reply to author
Forward
0 new messages