Using RawSQL-Models

18 views
Skip to first unread message

Adrian R.

unread,
May 14, 2008, 10:10:59 AM5/14/08
to Django developers
Hello Django developers,

I'm new to this mailing list and I hope that I didn't fail completely
by searching the django groups while I was considering to discuss this
on the list:

At the moment (like since 1.5 years) I'm working on a web interface
which is used to control some options for a software testing backend
using Django 0.96.1 on an Apache 2 with Python 2.5.1 and MySQL 5.0.24.

For that I've created a database view for one (the biggest and most
used) page containing seven "LEFT JOIN" and a "GROUP BY" and some
SUM()- and IF()-fields which takes about 5 seconds for more than
30,000 entries in the database. The problem is, that this time is also
taken if I use LIMIT or WHERE conditions (e.g. SELECT * from MyView
where State='FAILED') - which is as I think a MySQL limitation - but
if I enter the SQL statement of the view directly using a where
condition the performance is much better (taking 0.15 seconds).

So my plan was to create a "SQL-Model" which consists of a MySQL-Query
to get the data but which supports the normal methods like filter()
and extra() if possible. First I didn't want to touch the django
sources due to the fact that I'd have to maintain it on every Django
update if you'd not want to take it into the Django repos, but I
couldn't find a good solution how to get that model without doing it,
so I want to do so..

And now I'm sitting here and ask you if there is a possibility to use
JOIN and GRUOP BY and some SUM()- and IF()-fields right now using
QuerySets and custom models and/or a custom manager or if it's planned
for Django. If not, I'd like to know if it would be possible to
integrate it into the normal Django branch if I'd start developing it.
I thought about something like a Model.RawSQL-class or something like
that and some modifications to the QuerySet() but I'm not so
experienced with the Django source (most notably not with the dev-
branch) and so you'll probably have better ideas how to implement
something like that. Comments, corrections, improvements and
additional requirements are very welcome;)

Best regards from Karlsruhe, Germany
- Adrian

Scott Moonen

unread,
May 14, 2008, 10:18:38 AM5/14/08
to django-d...@googlegroups.com
Adrian, are you displaying all 30,000 entries on the same page?  Or are you using some sort of pagination?

  -- Scott
--
http://scott.andstuff.org/ | http://truthadorned.org/

Adrian R.

unread,
May 14, 2008, 10:26:38 AM5/14/08
to Django developers
On 14 Mai, 16:18, "Scott Moonen" <smoo...@andstuff.org> wrote:
> Adrian, are you displaying all 30,000 entries on the same page? Or are you
> using some sort of pagination?

No, they aren't on the same page and I'm using pagination (and the
result can be filtered by different fields), but the problem is as far
as I understood it, that the view is first processed completely in the
database before it returns the result-slice. So it takes as long as it
takes without a LIMIT or WHERE. That's the reason why I try to use raw
SQL because then the execution of the statement is stopped when the
LIMIT is reached as far as I'm understanding MySQL;)

- Adrian

Jeremy Dunck

unread,
May 14, 2008, 10:36:21 AM5/14/08
to django-d...@googlegroups.com, Django developers

On May 14, 2008, at 9:26, "Adrian R." <ilikeg...@googlemail.com>
wrote:

> the problem is as faras I understood it, that the view is first

> processed completely in the
> database before it returns the result-slice. So it takes as long as it
> takes without a LIMIT or WHERE. That's the reason why I try to use raw
> SQL because then the execution of the statement is stopped when the
> LIMIT is reached as far as I'm understanding

In this 'raw SQL', are you still calling the view? If so, your theory
of limiting being the cause doesn't make sense. If not, you could
just do raw SQL through django.db.connection.cursor().

See if that's still slow.

Scott Moonen

unread,
May 14, 2008, 10:53:01 AM5/14/08
to django-d...@googlegroups.com

Adrian, that's interesting.  Are you using a QuerySetPaginator and not a Paginator?  The QuerySetPaginator should be smart enough to ensure that only the given slice is retrieved from the database.

  -- Scott

Adrian R.

unread,
May 14, 2008, 11:36:16 AM5/14/08
to Django developers
On 14 Mai, 16:36, Jeremy Dunck <jdu...@gmail.com> wrote:
> On May 14, 2008, at 9:26, "Adrian R." <ilikegoodn...@googlemail.com>
> wrote:
>
> > the problem is as faras I understood it, that the view is first
> > processed completely in the
> > database before it returns the result-slice. So it takes as long as it
> > takes without a LIMIT or WHERE. That's the reason why I try to use raw
> > SQL because then the execution of the statement is stopped when the
> > LIMIT is reached as far as I'm understanding
>
> In this 'raw SQL', are you still calling the view? If so, your theory
> of limiting being the cause doesn't make sense. If not, you could
> just do raw SQL through django.db.connection.cursor().

Okay, I'm sorry, I think I didn't specify the problem enough:
In this context I'm talking about a database view which was created on
the database server (MySQL). To show the performance problem I'll show
you some of the result times as the (approximately) appear when they
are executed in the MySQL Query Browser.

"SELECT * FROM view" ~4 sec.
"SELECT * FROM view WHERE State='FAILED'" ~4 sec.
"SELECT * FROM view LIMIT 50" ~4 sec.

So I come to the result that MySQL first generates a TEMPTABLE (MERGE
not possible due to utilization of GROUP BY) and then filters/limits
the TEMPTABLE.

When I use the SQL which created the view and append WHERE or LIMIT
the times are much better:

"SELECT * FROM table1 LEFT JOIN table2 on [...] LEFT JOIN [...] GROUP
BY column" ~4 sec. - logical because its the complete listing (like
above)
"SELECT * FROM table1 LEFT JOIN table2 on [...] LEFT JOIN [...] WHERE
State='FAILED' GROUP BY column" ~ 0.13 s.
"SELECT * FROM table1 LEFT JOIN table2 on [...] LEFT JOIN [...] GROUP
BY column LIMIT 50" ~ 0.007 s.

The last two cases are the important ones because the display is
always limited/filteres.

On 14 Mai, 16:53, "Scott Moonen" <smoo...@andstuff.org> wrote:
> On Wed, May 14, 2008 at 10:26 AM, Adrian R. <ilikegoodn...@googlemail.com>
> wrote:
>
> > On 14 Mai, 16:18, "Scott Moonen" <smoo...@andstuff.org> wrote:
> > > Adrian, are you displaying all 30,000 entries on the same page? Or are
> > you
> > > using some sort of pagination?
>
> > No, they aren't on the same page and I'm using pagination (and the
> > result can be filtered by different fields), but the problem is as far
> > as I understood it, that the view is first processed completely in the
> > database before it returns the result-slice. So it takes as long as it
> > takes without a LIMIT or WHERE. That's the reason why I try to use raw
> > SQL because then the execution of the statement is stopped when the
> > LIMIT is reached as far as I'm understanding MySQL;)
>
> Adrian, that's interesting. Are you using a QuerySetPaginator and not a
> Paginator? The QuerySetPaginator should be smart enough to ensure that only
> the given slice is retrieved from the database.

The object are received by calling

MyViewModel.objects.select_related().extra(
select=additionalSelects,
where=filterOptions
)pagination.GetOffset():pagination.GetLimit()]

where MyViewModel is a model to access the database-view (does not
create tables) and map them to django-objects. The pagination-class
was written by me and the GetOffset() and GetLimit() methods are
returning some integer values (which are 10, 25, 50 or 100 apart from
each other).

So that leads me to my decision to put this SQL-code into a model so
that the query is performed faster than by the view. Okay, it's
somehow like converting a MySQL-problem to a Django-problem, but this
is the best solution as far as I know.

Okay, now to the reason why I don't want to use the
django.db.connection.cursor():
Parts of my code use (as seen above) the extra() and filter()-methods
and I originally didn't want to rewrite the whole QuerySet-Class and
put it into my model;)

- Adrian

Scott Moonen

unread,
May 14, 2008, 11:44:45 AM5/14/08
to django-d...@googlegroups.com
Adrian, disregard my previous question (you're using your own paginator).

Does your paginator use len(set) or does it use set.count() to determine the number of items?  You should find that the latter has much better performance.

  -- Scott

Adrian R.

unread,
May 14, 2008, 11:56:12 AM5/14/08
to Django developers
On 14 Mai, 17:44, "Scott Moonen" <smoo...@andstuff.org> wrote:
> Adrian, disregard my previous question (you're using your own paginator).
>
> Does your paginator use len(set) or does it use set.count() to determine the
> number of items? You should find that the latter has much better
> performance.

The total number of items returned by the query is determined by
calling

MyViewModel.objects.extra(where=filterOptions).count()

which has the problem, that it takes as long as the normal query which
results in a total time of approximately 8-10 seconds to open the page:
(
So if I could speed up the request by using the SQL-Model mentioned
above the whole page would probably gain 8-9 seconds.

- Adrian

Russell Keith-Magee

unread,
May 14, 2008, 9:12:27 PM5/14/08
to django-d...@googlegroups.com
On Wed, May 14, 2008 at 10:10 PM, Adrian R.
<ilikeg...@googlemail.com> wrote:
>
> So my plan was to create a "SQL-Model" which consists of a MySQL-Query
> to get the data but which supports the normal methods like filter()
> and extra() if possible. First I didn't want to touch the django
> sources due to the fact that I'd have to maintain it on every Django
> update if you'd not want to take it into the Django repos, but I
> couldn't find a good solution how to get that model without doing it,
> so I want to do so..
>
> And now I'm sitting here and ask you if there is a possibility to use
> JOIN and GRUOP BY and some SUM()- and IF()-fields right now using
> QuerySets and custom models and/or a custom manager or if it's planned
> for Django. If not, I'd like to know if it would be possible to
> integrate it into the normal Django branch if I'd start developing it.

First off: support for GROUP BY, SUM and other aggregation functions
is a work in progress. The new QS-RF code contains placeholders to
hold GROUP BY statements, but there isn't a public API to get a GROUP
BY into your queries at present. There is a GSOC code aiming to
implement aggregates in the Django ORM, which will obviously involve
introducing GROUP BY statements into the generated SQL.

However, other than that point, I'm afraid I don't understand what
you're proposing. You seem to be suggesting a new kind of model/query
base class that will have all the capabilities of QuerySet/Model, but
will be "raw SQL". You are evidently having some sort of performance
problem, but I'm not sure how this new "raw SQL Model" will fix this
problem.

At the end of the day, a QuerySet is just a mechanism for producing
SQL statements. If that SQL is inefficient for some reason, we're open
to suggestions. However, I'd be highly surprised if reimplementing
QuerySet turned out to be the right solution to your problem.

Can you give a concrete example of something that doesn't work the way
you expect? What exactly are you proposing to implement? How would
this address the problem you are having? We're not opposed to
considering any idea - but first, we need to understand what the idea
is :-)

Yours,
Russ Magee %-)

Michael Elsdörfer

unread,
May 15, 2008, 5:25:47 AM5/15/08
to Django developers
> So that leads me to my decision to put this SQL-code into a model so
> that the query is performed faster than by the view.

I think what you might be looking for is a custom (default) manager
for your model that adds your base query to each QuerySet interaction.
I suppose depending on what that query looks like and whether you are
using qsrf you might have to fall back to connection.cursor(), or not.

Michael

Adrian R.

unread,
May 16, 2008, 5:24:14 AM5/16/08
to Django developers
On May 15, 3:12 am, "Russell Keith-Magee" <freakboy3...@gmail.com>
wrote:
> First off: support for GROUP BY, SUM and other aggregation functions
> is a work in progress. The new QS-RF code contains placeholders to
> hold GROUP BY statements, but there isn't a public API to get a GROUP
> BY into your queries at present. There is a GSOC code aiming to
> implement aggregates in the Django ORM, which will obviously involve
> introducing GROUP BY statements into the generated SQL.
>
> However, other than that point, I'm afraid I don't understand what
> you're proposing. You seem to be suggesting a new kind of model/query
> base class that will have all the capabilities of QuerySet/Model, but
> will be "raw SQL". You are evidently having some sort of performance
> problem, but I'm not sure how this new "raw SQL Model" will fix this
> problem.
>
> At the end of the day, a QuerySet is just a mechanism for producing
> SQL statements. If that SQL is inefficient for some reason, we're open
> to suggestions. However, I'd be highly surprised if reimplementing
> QuerySet turned out to be the right solution to your problem.
>
> Can you give a concrete example of something that doesn't work the way
> you expect? What exactly are you proposing to implement? How would
> this address the problem you are having? We're not opposed to
> considering any idea - but first, we need to understand what the idea
> is :-)

Hello Russ,

I'll try to explain my problem in-depth;)
As described above the SQL-query which is used to return the data uses
aggregated fields, some fields using IF and other SQL-Commands and
some JOINs (also two times accessing the same tables but aliasing them
which does AFAIK not work with extra(tables=[...])).
The database view performs really bad, so I'm trying to relocate the
SQL into the Django ORM. The problem is now, that I can't use the
default QuerySet-functions to build the query I need (to use it in a
custom manager) and by falling back to raw SQL I can't use the extra-
and filter-commands which are supplied by Django.

- Adrian

Adrian R.

unread,
May 16, 2008, 9:27:17 AM5/16/08
to Django developers
Okay, so I've got some additional information for you which are
hopefully useful for you. I've tried to switch to the dev version but
I don't think that I'm on the right way right now.

So, here the SQL-code which generates the database view:

SELECT tr.*, tp.TestPlanName,
(SELECT IF(COUNT(ProductVersionId) > 0, 1, 0) FROM Masters WHERE
TestCaseId=tr.TestCaseId GROUP BY TestCaseId) AS UsesVM,
civ.SvnRevision CanSvnRevision,
civ.BuildDate CanBuildDate,
concat(cpv.BuildVersion, '-', civ.MinorVersion) CanVersion,
civ.ProductVersionId CanProductVersionId,
civ.State CanState,
civ.EnvironmentId CanEnvironmentId,
miv.SvnRevision MasSvnRevision,
miv.BuildDate MasBuildDate,
concat(mpv.BuildVersion, '-', miv.MinorVersion) MasVersion,
miv.ProductVersionId MasProductVersionId,
COUNT(IF(er.TestRunId = tr.TestRunId, 1, NULL)) numEvalResults,
COUNT(IF(er.TestRunId = tr.TestRunId AND er.DeviationGrade = 'A',1,
NULL)) DeviationA,
COUNT(IF(er.TestRunId = tr.TestRunId AND er.DeviationGrade = 'B',1,
NULL)) DeviationB,
COUNT(IF(er.TestRunId = tr.TestRunId AND er.DeviationGrade = 'C',1,
NULL)) DeviationC,
COUNT(IF(er.TestRunId = tr.TestRunId AND er.DeviationGrade = 'D',1,
NULL)) DeviationD,
COUNT(IF(er.TestRunId = tr.TestRunId AND er.DeviationGrade > 'D',1,
NULL)) DeviationZ,
MAX(er.DeviationGrade) MaxDeviation
FROM testruns tr
LEFT JOIN InstalledVersions civ ON
(tr.InstalledVersionId=civ.InstalledVersionId)
LEFT JOIN ProductVersions cpv ON
(civ.ProductVersionId=cpv.ProductVersionId)
LEFT JOIN InstalledVersions miv ON (tr.UsedAsMaster =
miv.InstalledVersionId)
LEFT JOIN ProductVersions mpv ON (miv.ProductVersionId =
mpv.ProductVersionId)
LEFT JOIN TestPlans tp ON (tr.TestPlanId = tp.TestPlanId)
LEFT JOIN EvaluationResults er ON (tr.TestRunId = er.TestRunId)
GROUP BY TestRunId

Tried to optimize it a bit to run faster. It also is pretty fast - if
not used in a view;)
So.. I started with this model:

#==============================================================================
class vTestRunOverview(models.Model):
#==============================================================================
"""
This model should just be used for read only-access to the
vTestRunOverview-View in the MySQL-DB.
"""
objects = TestRunOverviewManager()
# From table TestRuns
testRunId = models.ForeignKey(TestRun, blank=True,
primary_key=True, db_column='TestRunId')
testCaseId = models.ForeignKey(TestCase, blank=True,
db_column='TestCaseId')
installedVersionId = models.ForeignKey(InstalledVersion,
blank=True, db_column='InstalledVersionId')
state = models.CharField(max_length=20, choices = TESTRUN_STATE,
db_column='State')
updateTime = models.DateTimeField(db_column='UpdateTime',
editable=False)
usedAsMaster = models.ForeignKey(InstalledVersion, null=True,
db_column='UsedAsMaster', related_name='UsedAsMaster')
startDuration = models.DecimalField(max_digits=10,
decimal_places=2, db_column='StartDuration')
loadDuration = models.DecimalField(max_digits=10,
decimal_places=2, db_column='LoadDuration')
runDuration = models.DecimalField(max_digits=10, decimal_places=2,
db_column='RunDuration')
exitDuration = models.DecimalField(max_digits=10,
decimal_places=2, db_column='ExitDuration')
testDuration = models.DecimalField(max_digits=10,
decimal_places=2, db_column='TestDuration')

# From table TestPlans
testPlanName = models.CharField(max_length=20,
db_column='TestPlanName')

# From Table Masters
usesVM = models.IntegerField(db_column='UsesVM')

# From table InstalledVersions
cSvnRevision = models.CharField(max_length=10,
db_column='CanSvnRevision')
cBuildDate = models.DateField(null=True, blank=True,
db_column='CanBuildDate')
cProductVersionId = models.IntegerField(null=True, blank=True,
db_column='CanProductVersionId')
cState = models.CharField(max_length=10, db_column='CanState')
cEnvironmentId = models.IntegerField(db_column='CanEnvironmentId')
mSvnRevision = models.CharField(max_length=10,
db_column='MasSvnRevision')
mBuildDate = models.DateField(null=True, blank=True,
db_column='MasBuildDate')
mProductVersionId = models.DateField(null=True, blank=True,
db_column='MasProductVersionId')

# Derived fields
cVersion = models.CharField(max_length=41, db_column='CanVersion')
mVersion = models.CharField(max_length=41, db_column='MasVersion')
numEvalResults = models.IntegerField(db_column='NumEvalResults')
deviationACnt = models.IntegerField(db_column='DeviationA')
deviationBCnt = models.IntegerField(db_column='DeviationB')
deviationCCnt = models.IntegerField(db_column='DeviationC')
deviationDCnt = models.IntegerField(db_column='DeviationD')
deviationZCnt = models.IntegerField(db_column='DeviationZ')
maxDeviation = models.CharField(max_length=10,
db_column='MaxDeviation')


#--------------------------------------------------------------------------
class Meta:

#--------------------------------------------------------------------------
db_table = 'TestRuns'

As far as I understand it, this is also required if I use a custom
Manager because django uses it to determine related fields (I need
select_related()).
Now I tried to create a custom manager which should just consist of a
get_query_set() with all fields mentioned above added to the
extra(select=[...]). To get the joins, I trued the extra(tables=[...])
but without the possibility of using aliases for tables that wasn't
such a good idea (okay, as mentioned in the documentation). I got the
error that the tables aren't uniqie ("Not unique table/alias:
'InstalledVersions'").

Here is my Manager:

#==============================================================================
class TestRunOverviewManager(models.Manager):
#==============================================================================
def get_query_set(self):
return super(TestRunOverviewManager,
self).get_query_set().extra(
select={
'UsesVM': '(SELECT IF(COUNT(ProductVersionId) > 0, 1,
0) FROM Masters WHERE TestCaseId=tr.TestCaseId GROUP BY TestCaseId)',
'CanSvnRevision': 'civ.SvnRevision',
'CanBuildDate': 'civ.BuildDate',
'CanVersion': 'concat(cpv.BuildVersion, \'-\',
civ.MinorVersion)',
'CanProductVersionId': 'civ.ProductVersionId',
'CanState': 'civ.State',
'CanEnvironmentId': 'civ.EnvironmentId',
'MasSvnRevision': 'miv.SvnRevision',
'MasBuildDate': 'miv.BuildDate',
'MasVersion': 'concat(mpv.BuildVersion, \'-\',
miv.MinorVersion)',
'MasProductVersionId': 'miv.ProductVersionId',
'numEvalResults': 'COUNT(IF(er.TestRunId =
tr.TestRunId, 1, NULL))',
'DeviationA': 'COUNT(IF(er.TestRunId = tr.TestRunId
AND er.DeviationGrade = \'A\',1, NULL))',
'DeviationB': 'COUNT(IF(er.TestRunId = tr.TestRunId
AND er.DeviationGrade = \'B\',1, NULL))',
'DeviationC': 'COUNT(IF(er.TestRunId = tr.TestRunId
AND er.DeviationGrade = \'C\',1, NULL))',
'DeviationD': 'COUNT(IF(er.TestRunId = tr.TestRunId
AND er.DeviationGrade = \'D\',1, NULL))',
'DeviationZ': 'COUNT(IF(er.TestRunId = tr.TestRunId
AND er.DeviationGrade > \'D\',1, NULL))',
'MaxDeviation': 'MAX(er.DeviationGrade)'
},
tables=['InstalledVersions', 'ProductVersions',
'InstalledVersions', 'TestPlans', 'EvaluationResults']
).select_related()

I hope you can help my with this;)

Best Regards
- Adrian

Russell Keith-Magee

unread,
May 21, 2008, 9:50:08 AM5/21/08
to django-d...@googlegroups.com
On Fri, May 16, 2008 at 9:27 PM, Adrian R. <ilikeg...@googlemail.com> wrote:
>
> Okay, so I've got some additional information for you which are
> hopefully useful for you. I've tried to switch to the dev version but
> I don't think that I'm on the right way right now.

I don't mean to be rude, but this isn't particularly helpful. A
pageful of SQL doesn't really help me determine the problem you are
having, or how a 'Raw SQL model' will solve the problem.

Can you reduce your problem to a simple (or, at least, simplified) case?

Yours,
Russ Magee %-)

Adrian R.

unread,
May 21, 2008, 1:17:38 PM5/21/08
to Django developers
On 21 Mai, 15:50, "Russell Keith-Magee" <freakboy3...@gmail.com>
wrote:
> On Fri, May 16, 2008 at 9:27 PM, Adrian R. <ilikegoodn...@googlemail.com> wrote:
>
> > Okay, so I've got some additional information for you which are
> > hopefully useful for you. I've tried to switch to the dev version but
> > I don't think that I'm on the right way right now.
>
> I don't mean to be rude, but this isn't particularly helpful. A
> pageful of SQL doesn't really help me determine the problem you are
> having, or how a 'Raw SQL model' will solve the problem.
>
> Can you reduce your problem to a simple (or, at least, simplified) case?

I think that there isn't much more to explain, because today we
decided to switch to PostgreSQL which solved the problem in first
tests and hopefully keeps this good performance in other tests.

But if you still want to know what I tried to explain in my last six
messages:
Accessing the database-view using a WHERE-condition is as slow as
without and it takes 350 times more time to generate results instead
of using the SQL which generated the view and put a WHERE-condition
directly into it. Optimizing the view didn't solve the problem. The
problem isn't the query (because with WHERE-conditions or a LIMIT it
performed very well - and it was the usual case to use conditions or
limits) but the view which was with and without conditions equally
slow.
So I decided to use Django: I couldn't generate the query using
Django's ORM. So tried to put the SQL directly into Django without
using the db.cursor() which wouldn't solve my problem because i
couldn't use extra() anymore. It didn't work because of table-aliases
I think, but I didn't do any further testing.
But I'd recommend to extend the extra(tables=...)-method to accept own
aliases for tables.

Oh, I think at least this all wasn't really my problem. The problem is
the way how the MySQL-guys have implemented views (http://
bugs.mysql.com/bug.php?id=23136). My attempt was just a workaround to
solve this...

So the idea behind my question was to implement some sort of database-
view in django, so that you just take the sql which would create the
view and put it into a model. Why? Because MySQL doesn't handle some
views very well and maybe other DBMS also don't and you could make new
models which could just join and aggregate data from other tables
or...

I don't know. hope that this isn't my problem anymore..
Reply all
Reply to author
Forward
0 new messages