django select extremely slow on large table

3,169 views
Skip to first unread message

Joris Benschop

unread,
Jan 22, 2015, 10:32:36 AM1/22/15
to django...@googlegroups.com
Dear List,

I'm trying to run a simple select on a very large Oracle table (1500 million records).

I create a standard django model:
------------------
class Marker(models.Model):
    marker_id = SYSGUID16Field(primary_key=True)  # This field type is a guess.
    markid = models.CharField(unique=True, max_length=20, blank=True)
    crop = models.ForeignKey("variantdb_admin.Crop", blank=True, null=True)
    insid = models.CharField(max_length=5, blank=True)
    insdate = models.DateTimeField(blank=True, null=True)

    class Meta:
        managed = False
        db_table = "prod_schema"."marker"
------------------

then simply run:
>>> x = Marker.objects.filter(markid = 'TO11')
>>> print x

in debugger, this basically creates the following simple query:
DEBUG (44.120) QUERY = u'SELECT "PROD_SCHEMA"."MARKER"."MARKER_ID", "PROD_SCHEMA"."MARKER"."MARKID", "PROD_SCHEMA"."MARKER"."CROP_ID", "PROD_SCHEMA"."MARKER"."INSID", "PROD_SCHEMA"."MARKER"."INSDATE" FROM "PROD_SCHEMA"."MARKER" WHERE "PROD_SCHEMA"."MARKER"."MARKID" = :arg0' - PARAMS = (u'TO11',); args=('TO11',)

As you see, in Django this takes 44 seconds to run.
If I run this exact same query with cx_oracle directly the response time = < 0.1sec.

For small tables the performance of django is quite good, so it almost seems like django tries to do a count on the table before runnign the query or something. Any suggestions on how I can make this lookup faster?

thanks
joris



Message has been deleted

Joris Benschop

unread,
Jan 22, 2015, 11:15:04 AM1/22/15
to django...@googlegroups.com
Thanks you for your response
Unfortunately, yes, the 1.5b records is not something I have control over.

I agree the SYSGUID field needs working. However, if I change this field to any other field type, the query still takes over 40 seconds.

class Marker(models.Model):
    marker_id = models.BinaryField(primary_key=True)  # This field type is a guess.

    markid = models.CharField(unique=True, max_length=20, blank=True)
    crop = models.ForeignKey("variantdb_admin.Crop", blank=True, null=True)
    insid = models.CharField(max_length=5, blank=True)
    insdate = models.DateTimeField(blank=True, null=True)

    class Meta:
        managed = False
        db_table = "prod_schema"."marker"

DEBUG (58.566) QUERY = u'SELECT * FROM (SELECT "_SUB".*, ROWNUM AS "_RN" FROM (SELECT "PROD_SCHEMA"."MARKER"."MARKER_ID", PROD_SCHEMA"."MARKER"."MARKID", "PROD_SCHEMA"."MARKER"."CROP_ID", "PROD_SCHEMA"."MARKER"."INSID", "PROD_SCHEMA"."MARKER"."INSDATE" FROM "PROD_SCHEMA"."MARKER" WHERE "PROD_SCHEMA"."MARKER"."MARKID" = :arg0) "_SUB" WHERE ROWNUM <= 1) WHERE "_RN" > 0' - PARAMS = (u'TO1',); args=('TO1',)


thanks again for your help
joris


On Thu, Jan 22, 2015 at 4:51 PM, SK <korot...@gmail.com> wrote:
To many calculations in the from_db_value function. Need optimizations. And what about query limitations: 1.5b is really needed?



 
class SYSGUID16Field(models.Field):
 default_error_messages
= {
 
'invalid': "'%(value)s' is not a valid SYS_GUID."
 
}

 description
= "A connector to the SYS_GUID() fields for Oracle
 Backends"


 
def __init__(self, *args, **kwargs):
 kwargs
['max_length'] = 16
 
super(SYSGUID16Field, self).__init__(*args,**kwargs)

 
def from_db_value(self, value, connection):
 
#print 'call from_db_value %s' % value
 
if value is None:
 
return value
 
return str(b2a_hex(value)).upper()





четверг, 22 января 2015 г., 18:32:36 UTC+3 пользователь Joris Benschop написал:

--
You received this message because you are subscribed to a topic in the Google Groups "Django users" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/django-users/CYIxM8nZl1A/unsubscribe.
To unsubscribe from this group and all its topics, send an email to django-users...@googlegroups.com.
To post to this group, send email to django...@googlegroups.com.
Visit this group at http://groups.google.com/group/django-users.
To view this discussion on the web visit https://groups.google.com/d/msgid/django-users/29682a24-fdb6-4883-882e-58fb53d0a98c%40googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

Tim Chase

unread,
Jan 22, 2015, 11:09:14 PM1/22/15
to django...@googlegroups.com
On 2015-01-22 17:14, Joris Benschop wrote:
> Thanks you for your response
> Unfortunately, yes, the 1.5b records is not something I have
> control over.
[snip]
> WHERE "PROD_SCHEMA"."MARKER"."MARKID" = :arg0

Just out of curiosity, do you have an index on MARKER.MARKID ? And
if so, is it combined with any other fields?

I don't know enough about Oracle to be sure, but it also looks like it
doesn't have an OFFSET/LIMIT, so it's doing some strange ROWNUM
chicanery to produce similar results.

-tkc



Jani Tiainen

unread,
Jan 23, 2015, 1:24:00 AM1/23/15
to django...@googlegroups.com
Hi,

1.5b rows is rather big data.

Could you test what .values() or .values_list() produces in terms of speed.

also print qs might actually pull in all selected records in Python memory so it may take a while to constuct bunch of objects.
> --
> You received this message because you are subscribed to the Google Groups "Django users" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to django-users...@googlegroups.com.
> To post to this group, send email to django...@googlegroups.com.
> Visit this group at http://groups.google.com/group/django-users.
> To view this discussion on the web visit https://groups.google.com/d/msgid/django-users/7db20206-9af7-491f-a0ed-b84f43cd4096%40googlegroups.com.

Anssi Kääriäinen

unread,
Jan 23, 2015, 1:54:27 AM1/23/15
to django...@googlegroups.com
Do you fetch all the results when running the query directly against Oracle? Many Oracle SQL clients do not fetch all the results, instead they fetch just the top 100 or so rows.

 - Anssi

Anssi Kääriäinen

unread,
Jan 23, 2015, 1:56:30 AM1/23/15
to django...@googlegroups.com


Do you fetch all the results when running the query directly against Oracle? Many Oracle SQL clients do not fetch all the results, instead they fetch just the top 100 or so rows.    markid = models.CharField(unique=True, max_length=20, blank=True)
I missed the part where you said you are using cx_oracle directly. Still, make sure you fetch all the results (use list(cursor.fetchall() after execution).

If you don't want to return all the results, you can use the .iterator() method in Django.

 - Anssi

James Schneider

unread,
Jan 23, 2015, 2:20:35 AM1/23/15
to django...@googlegroups.com

How many results do you typically get back from that query?

It sounds like the DB is highly optimized if you can gather results from tables that large that quickly running the query outside of Django. If you are returning a large dataset, then it is more likely that Django is furiously coercing every row in the result set to the appropriate model, which takes time. Anything above a couple hundred rows will probably be noticeable, maybe even less.

If a large result set is desired, I would recommend you filter your query a bit more to reduce the number of rows returned and look at using values() which may help slightly.

Installing the Django debug toolbar will also help, as you will be able to see how long each query takes, which will probably lead you to the culprit. It has a sqlshell management command that also may be of use, printing out information about the queries, etc.

You can also slice the query to only retrieve the first 100 rows, etc.

As quick validation, I'm guessing that if you look at your system process manager while the query is running, you'll see a single python process pegged at 100% for the duration of the query run. If the DB is at fault, python should be sitting idle.

TL;DR; I don't think the query itself is at fault, probably too much data being returned to process.

-James

--
You received this message because you are subscribed to the Google Groups "Django users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to django-users...@googlegroups.com.
To post to this group, send email to django...@googlegroups.com.
Visit this group at http://groups.google.com/group/django-users.

Erik Cederstrand

unread,
Jan 23, 2015, 3:06:14 AM1/23/15
to Django Users

> Den 23/01/2015 kl. 08.19 skrev James Schneider <jrschn...@gmail.com>:
>
> How many results do you typically get back from that query?

There should be only one result, right? Since the "markid" field is defined as unique and OP is filtering on that with a single value.

First thing to check is that the actual column in Oracle is also unique, and that there is an index on that column alone. AFAIK Django creates an index automatically on unique fields.

The cx_oracle client may be cheating by returning when the query completed, but before results are actually fetched.

Erik

James Schneider

unread,
Jan 23, 2015, 3:48:01 AM1/23/15
to django...@googlegroups.com
Granted, I glazed over that fact. What I did see, though, is the blank=True on a CharField(), which would indicate to me that multiple '' (empty string) entries could potentially be inserted into the database, meaning that it can't be unique (with implied null=False). 

*confused*

I'm guessing the model definition doesn't quite match the database, and given the amount of data, wasn't created using Django (hinted at by managed=False), although I have no experience with Oracle DB's.

The operation needs to be profiled to determine where the bottleneck is. Obviously pulling a single record and coercing it into a small model shouldn't take that long.

The OP also posted a second SQL query with two sub-SELECTS in it, which is probably less efficient, and I have no idea where it came from.

Also, a return on a search against 1.5b rows, even with an index, would probably not take < 0.1sec (100ms if that number is accurate). It's possible/likely that the DB was caching the (now repeated) query, and skewing the timing results. 

It'd be interesting to determine whether or not the DB is caching the query, and then re-run the operation with the same Django model call several times to see if the timing drops down into the sub-second range (it should). 

-James


--
You received this message because you are subscribed to the Google Groups "Django users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to django-users...@googlegroups.com.
To post to this group, send email to django...@googlegroups.com.
Visit this group at http://groups.google.com/group/django-users.

Joris Benschop

unread,
Jan 23, 2015, 3:58:22 AM1/23/15
to django...@googlegroups.com
THank you all for your help

there's a few things here:
* This Oracle table is highly partitioned and optimized.There's indexes everywhere. There's two full time senior Oracle DBAs working on tuning this DB.So a return of 0.1 sec is not strange for this setup.
(btw the DBAs are really impressed with the query setup by django, using bind vars and all)

* The query only returns a single record (as also seen from the Unique=True in the model definition). There is of course a unique index on the MARKID column.

* If I run the exact query (included the nested _ROWNUM selection wrapper that Django does ) in cx_oracle or sqldev the query is very very fast. This is not a basic DB problem

* If I monitor CPU, then executing the query does not result in any specific CPU load on the python process. Still, the query takes 50 seconds. And again, running this exact query in cx_Oracle with the same credentials is <0.1 sec to respond

* connection from django to small tables is very fast!

* there is no caching, as asking the result from django twice is also slow

@james: two two subselects are generated by django


The main problem seems to be that Django Oracle backend connects different than vanilla cx_Oracle.
This is noticable from the fact that I cannot set the pooled connect as NAME, but I need to specify a server. in cx_Oracle, I need to specify the pool instead, or I get an error.
Now the real weirdness is that this results in a different execution plan of the query. For cx_Oracle, the indexes are used properly. for Django it uses a full table scan. That of course takes a lot of time
This is obviously partly oracles fault and we're looking into that. Howevr I still find it strange why I cannot use the pooled connection in django, or why django is connecting differently to the oracle db than cx_oracle (which I think is embedded in the oracle backend of django)





Stephen J. Butler

unread,
Jan 23, 2015, 4:43:52 AM1/23/15
to django...@googlegroups.com
Could it be a character set issue? That is, perhaps the database
NLS_CHARACTERSET is not unicode friendly, your column is VARCHAR
instead of NVARCHAR2, and Django is sending unicode instead of the
native database charset (which cx_Oracle directly might use).

I'm just brainstorming on what might cause Oracle to ignore an index.
That it is in a different character set from the connection might be a
possibility.

I use Oracle all the time for Django projects. I took a look at some
of my CharField columns generated by South migrations and they're all
NVARCHAR2 (AL16UTF16 in my case).
> --
> You received this message because you are subscribed to the Google Groups
> "Django users" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to django-users...@googlegroups.com.
> To post to this group, send email to django...@googlegroups.com.
> Visit this group at http://groups.google.com/group/django-users.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/django-users/6f829cee-afd2-4514-baab-1d7a05d46159%40googlegroups.com.

Joris Benschop

unread,
Jan 23, 2015, 5:06:58 AM1/23/15
to django...@googlegroups.com
Very interesting point.
In the backend the MARKID column is of type VARCHAR2(20 BYTE).  This is not something I can change. But it does seem that the bindvar method forces a unicode lookup onto an ascii table.

Is there any way to make django get this field in ascii to test this theory?


Jani Tiainen

unread,
Jan 23, 2015, 5:24:47 AM1/23/15
to django...@googlegroups.com
Hi, I remember having lot of slowness in my 21M row table when I
discovered this issue: https://code.djangoproject.com/ticket/11017

Not sure if that is any help. But I remember that when oracle drops to
implicit data type conversions things tend to get sideways.


--
Jani Tiainen

Joris Benschop

unread,
Jan 23, 2015, 5:28:18 AM1/23/15
to django...@googlegroups.com
This seems to be specific for partial text searches (LIKE, STARTSWITH etc). Still, interesting addition.

Joris Benschop

unread,
Jan 23, 2015, 5:42:10 AM1/23/15
to django...@googlegroups.com
As an addition:
>>>x=Marker.objects.raw("SELECT * from VARIANT_CLONE.MARKER WHERE MARKID='TO1'")
This is also slow

so it indeed seems to be a locale issue
the database is AL32UTF8
and django uses national character set id "2000"

is this a locale issue?


Joris Benschop

unread,
Jan 23, 2015, 6:11:19 AM1/23/15
to django...@googlegroups.com
Skip that,
>>>x=Marker.objects.raw("SELECT * from VARIANT_CLONE.MARKER WHERE MARKID='TO1'")
is fast

DEBUG (0.001) QUERY = u"SELECT * from SCHEMA_PROD.MARKER WHERE MARKID='mTO1'" - PARAMS = (); args=()



Joris Benschop

unread,
Jan 23, 2015, 6:38:19 AM1/23/15
to django...@googlegroups.com
And to keep replying to myself:

This one is slow:
x=Marker.objects.raw("SELECT * from PROD_SCHEMA.MARKER WHERE MARKID= %s",[u'TO1'])
print x[0]

This one is fast:
y=Marker.objects.raw("SELECT * from PROD_SCHEMA.MARKER WHERE MARKID= 'TO1'")
print y[0]

If I copy the table and make the  MARKID field in NVARCHAR2(20), then it uses the index properly on both counts

So... so do I make django use ascii? is there a field type?

Jani Tiainen

unread,
Jan 23, 2015, 6:55:41 AM1/23/15
to django...@googlegroups.com
> --
> You received this message because you are subscribed to the Google Groups "Django users" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to django-users...@googlegroups.com.
> To post to this group, send email to django...@googlegroups.com.
> Visit this group at http://groups.google.com/group/django-users.
> To view this discussion on the web visit https://groups.google.com/d/msgid/django-users/50ce32f1-90ac-4290-a950-ed70c7e9692b%40googlegroups.com.
> For more options, visit https://groups.google.com/d/optout.

It's pretty much impossible without writing custom backend or magical
monkey patching. In upcoming 1.8 or 1.9 is possible to write custom
translators that could help.

If I were you I most probably would resort with fast version of raw
query for now and just document that it's insecure and will be replaced
something better as soon as such a mechanism is available.

--
Jani Tiainen

Jani Tiainen

unread,
Jan 23, 2015, 7:00:13 AM1/23/15
to django...@googlegroups.com
You could try explicit cast to VARCHAR2, not sure does that help at all:

x=Marker.objects.raw("SELECT * from PROD_SCHEMA.MARKER WHERE MARKID=CAST(%s AS VARCHAR2)",[u'TO1'])


--
Jani Tiainen

Edgar Gabaldi

unread,
Jan 23, 2015, 7:16:16 AM1/23/15
to django...@googlegroups.com
You can try to put a db_index=True in markid field.


--
You received this message because you are subscribed to the Google Groups "Django users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to django-users...@googlegroups.com.
To post to this group, send email to django...@googlegroups.com.
Visit this group at http://groups.google.com/group/django-users.

Joris Benschop

unread,
Jan 23, 2015, 8:04:05 AM1/23/15
to django...@googlegroups.com
That sounds only relevant if Django creates the tables tight? You cannot do that with partitioned Oracle of this size. As much as I like to have django create the schemas, legacy db is all I have. THe entire DB is over 2TB of data, you cannot just change a column field type or add an index without waiting a day to reanalyze.
 

Stephen J. Butler

unread,
Jan 23, 2015, 12:57:14 PM1/23/15
to django...@googlegroups.com
2000 is AL16UTF16. But since there should be a non-lossy conversion
between AL32UTF8 and AL16UTF16 you'd think that Oracle would perform
it and use the index. Puzzling.

Furthermore, when you look in the Oracle db driver you can see in
base.py where it sets NLS_LANG to ".UTF8". This is an old setting for
Oracle and I think it should probably be ".AL32UTF8" for modern
installs. But that doesn't explain where the AL16UTF16 is coming
from... is there a tsnames.ora file involved somewhere?

IDK. I've reached the end of what I know about character sets, Django,
and Oracle :)
> --
> You received this message because you are subscribed to the Google Groups
> "Django users" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to django-users...@googlegroups.com.
> To post to this group, send email to django...@googlegroups.com.
> Visit this group at http://groups.google.com/group/django-users.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/django-users/c33b7fc9-e06f-45b0-a677-fc950247dd25%40googlegroups.com.
Reply all
Reply to author
Forward
0 new messages