DAL performance vs executesql

1,502 views
Skip to first unread message

pbreit

unread,
Jun 22, 2012, 2:54:05 AM6/22/12
to web...@googlegroups.com
I have an app where I pull down 5,000 records. I started with DAL and db.item.ALL and it was taking around 4 seconds. I added 5 fields to the select() which got it down to around 2 seconds. Then I implemented the same thing with executesql and got it down to 500 ms.

So, selecting only the fields you need is a good optimization. But going with raw SQL is much better.

Alec Taylor

unread,
Jun 22, 2012, 3:08:00 AM6/22/12
to web...@googlegroups.com
From what I understand people use DAL or ORM layers for abstraction purposes.

So a struct with pointer to functions in C will have less overhead
than a class with member functions in C++. However working at this
higher level of abstraction allows for some fancy time-saving
mechanisms such as multiple inheritance, templates, encapsulation and
subtype polymorphism.

In the database world: queries would of course run much faster when
written by hand, and stored-procedures will speed up the equivalent
function built inn Python with the layer in-between. Does this mean we
should just write pure SQL and stored-procedures?

Yes and no. It depends on what you are setting out to do. If
performance is your goal and your willing to sacrifice abstraction to
achieve it, then do as much as you can database and query side.

{An additional advantage of using DAL or ORM layers is the ability to
move to a different database engine (e.g.: a NoSQL one or from sqlite
to postgres [use one for dev one for production])/}

<apologies if I went to much into philosophy here, if you have actual
ideas on how to speed up queries without sacrificing abstraction I'm
sure your patches will be accepted :]>
> --

Michele Comitini

unread,
Jun 22, 2012, 4:21:47 AM6/22/12
to web...@googlegroups.com
The DAL is a pure python implementation so it is slower than a C implementation.
With large datasets the overhead is high so it is better to use executesql.
Often there's no other way than pulling all the records at once, but
sometimes a lazy approach to data extraction with the use of limitby
can give the best results: speed and DAL ease of use.

mic

2012/6/22 Alec Taylor <alec.t...@gmail.com>:
> --
>
>
>

Johann Spies

unread,
Jun 22, 2012, 6:26:28 AM6/22/12
to web...@googlegroups.com
On 22 June 2012 10:21, Michele Comitini <michele....@gmail.com> wrote:
The DAL is a pure python implementation so it is slower than a C implementation.
With large datasets the overhead is high so it is better to use executesql.
Often there's no other way than pulling all the records at once, but
sometimes a lazy approach to data extraction with the use of limitby
can give the best results: speed and DAL ease of use.


I wish it was possible to use other time-saving tools like SQLFORM.grid with db.executesql().

At the moment most of the representation tools like SQLTABLE and plugin_wiki's jqgrid suppose DAL usage.

Regards
Johann
--
Because experiencing your loyal love is better than life itself,
my lips will praise you.  (Psalm 63:3)

Anthony

unread,
Jun 22, 2012, 11:37:10 AM6/22/12
to web...@googlegroups.com
I wish it was possible to use other time-saving tools like SQLFORM.grid with db.executesql().

At the moment most of the representation tools like SQLTABLE and plugin_wiki's jqgrid suppose DAL usage.

Maybe it would be possible to convert the results of executesql() to a pseudo-Rows object (without all the processing of each individual field value) so it could be used with the grid, etc.

Anthony 

pbreit

unread,
Jun 22, 2012, 5:17:39 PM6/22/12
to web...@googlegroups.com
I'd be interested to understand what processing all goes on. It's definitely very pleasant working with DAL rows but in this case, not viable.

Niphlod

unread,
Jun 22, 2012, 6:14:02 PM6/22/12
to web...@googlegroups.com

 @pbreit : if I'm not mistaken, the "parse()" function in DAL handles a lot ... basically read just any value returned from the database to the correct type (blob, integer, double, datetime, .... list:*, just to say a few), handles virtualfields and prepares the Rows object making the Storage(). All the resultset is fetched, "parsed()" line by line, column by column, and returned as a whole list. Also if operations were to be kept at a minimum, building a simple list of 1 million record take some time, (and memory).
Obviously the first shortcut is use limitby. The second is selecting only the needed columns (best practice anyway to limit the amount of wire transfer). No other shoutcuts.

@all: Giving that a million record is hardly displayed, SQLTABLEd, SQLFORM.gridded, pickled, jsoned, etc.....I think the "need" here is having a Storage() object instead of a tuple for accessing the values as row.id instead of row[0], etc as in current executesql.
I see two roads (to be considered as 1 AND 2 and also as 1 OR 2) here, but I don't know the implications of having to deal with SQLFORM.grid, for instance.
1) skip the parse() and return a quasi-rows object "a la executesql", just turning the list of tuples as list of row objects, skipping virtual fields, references, etc.
2) approach the returned values as a generator, and optionally as an only-forward iterator "a la web.py iterbetter()" (https://github.com/webpy/webpy/blob/master/web/db.py#L646), i.e. you can fetch the results and cycle through them only one time

Downsides for method 1):
- some types could not be converted at all, references would be unavailable (but for million records you should anyway have done a join), no virtualfields
- building a list of million records still requires time
- for large datasets having to work on them only "after" all the list is prepared could be taking some time
Downsides for method 2):
- may work as expected only if underlying db driver supports it
- in the need of cycling the resultset for the second time another query is required or you have to come up with you own memoization (i.e. prepare an empty list to append results you're interested in cycling again)

I used web.py a lot and also for millions of records it's very fast for returning results.
NB: the "time cut" here is the one passing from the "db is ready to return results" to "the queryset is done and prepared for me to work on it".
The assumption on 2) is that rarely you need to access randomly a million of records - i.e. myrows[31289], myrows[129459] - and what you can do in 2 loops on the same dataset is usually "accomplishable" (that word even exist ? :D) in a single one.

Niphlod

unread,
Jun 22, 2012, 6:24:39 PM6/22/12
to web...@googlegroups.com
BTW, 1) is quite achievable if not selecting from multiple tables (i.e. no print db.dogs.name, db.person.name) in a breeze:
from gluon.storage import Storage
raw_rowset = db.executesql("thequery", as_dict=True)
myrowset = [Storage(row) for row in raw_rowset]

No time to patch DAL's executesql and test the improvement in time right now but if in the need I could provide it.

pbreit

unread,
Jun 22, 2012, 7:47:21 PM6/22/12
to web...@googlegroups.com
Interesting, I'll try that out.

I can see a million rows being slow but seems 5,000 should move pretty well. With the increasing usage of Javascript frameworks, it seems like more data is being retrieved for manipulating on the front-end. But in that case you just need JSON, I think with minimal processing by the server.

pbreit

unread,
Jun 26, 2012, 1:05:00 PM6/26/12
to web...@googlegroups.com
Massimo, any ideas on DAL performance? I wonder if it would make sense to have some flags to skip some of the non-essential processing?

Massimo Di Pierro

unread,
Jun 26, 2012, 8:23:53 PM6/26/12
to web...@googlegroups.com
The only "non-essential" processing that has been mentioned is parsing and normalization of the records returned by the DB.

One can already skip this by specifying an alterante processor function

    db(...).select(processor=lambda rows,fields,colnames,blob_decode: ....)

where processor acts like the parse function defined in dal BaseAdapter.


Anyway, delaying the parsing and making it lazy has not performance advantage. If you retrieve N records it is because you plan to do something with all them. If you parse them at retrieve time you are guaranteed to do it once. If you do it lazily, you make the select faster but you risk of parsing them more than once later thus in incurring in an overall performance penalty.

The only think that we can do to speed the dal is to do avoid re-executing the define-table. We are working on this. One option would be to have the define table in a module (not a model) which is executed only once and the use the new recorrect method to recycle an existing db object with all its table definitions. this will break migrations and it is still very experimental.

Is there something else we can in your opinion improve?

Massimo

Anthony

unread,
Jun 26, 2012, 8:51:14 PM6/26/12
to web...@googlegroups.com
One can already skip this by specifying an alterante processor function

    db(...).select(processor=lambda rows,fields,colnames,blob_decode: ....)

We should probably document that. Note, it doesn't appear to pass a blob_decode argument to the processor (though the default parse() method does take that argument).

Anthony

Anthony

unread,
Jun 27, 2012, 12:32:38 AM6/27/12
to
I wish it was possible to use other time-saving tools like SQLFORM.grid with db.executesql().

At the moment most of the representation tools like SQLTABLE and plugin_wiki's jqgrid suppose DAL usage.

This wouldn't help with SQLFORM.grid, which does its own query, but for other cases where you need a Rows object with executesql, this might work:

raw_rows = db.executesql('[SQL code]')
rows
= db._adapter.parse(raw_rows,
    fields
=[field for field in db.mytable],
    colnames
=db.mytable.fields)

Anthony

Johann Spies

unread,
Jun 28, 2012, 6:18:49 AM6/28/12
to web...@googlegroups.com
On 27 June 2012 03:34, Anthony <abas...@gmail.com> wrote:
I wish it was possible to use other time-saving tools like SQLFORM.grid with db.executesql().

At the moment most of the representation tools like SQLTABLE and plugin_wiki's jqgrid suppose DAL usage.

This wouldn't help with SQLFORM.grid, which does its own query, but for other cases where you need a Rows object with executesql, this might work:

raw_rows = db.executesql('[SQL code]')
rows
= db._adapter.parse(raw_rows,

    fields
=[db.mytable[f] for f in db.mytable.fields],
    colnames
=db.mytable.fields)


Thanks.  This help a little bit.

Regards
Johann

nick name

unread,
Jul 5, 2012, 1:58:12 PM7/5/12
to web...@googlegroups.com
On Friday, June 22, 2012 11:37:10 AM UTC-4, Anthony wrote:
Maybe it would be possible to convert the results of executesql() to a pseudo-Rows object (without all the processing of each individual field value) so it could be used with the grid, etc.

The original ticket that prompted adding processors <http://code.google.com/p/web2py/issues/detail?id=701> has example that creates namedtuples, which is a kind of pseudo-Row - but I don't know if it's good enough for grid or not.

Michael Hall

unread,
Jun 24, 2013, 8:58:52 AM6/24/13
to web...@googlegroups.com
Was there ever a fully working solution proposed or discovered for using executesql with grid and smartgrid?

Johann Spies

unread,
Jul 8, 2013, 9:06:38 AM7/8/13
to web...@googlegroups.com
On 24 June 2013 14:58, Michael Hall <pix...@gmail.com> wrote:
Was there ever a fully working solution proposed or discovered for using executesql with grid and smartgrid?


I don't know of any.  What I sometimes do is to use create views or tables containing the results of complicated SQL-queries and use them from DAL to view the results in grids.

Regards
Johann

Anthony

unread,
Jul 8, 2013, 10:44:07 AM7/8/13
to web...@googlegroups.com
n 24 June 2013 14:58, Michael Hall <pix...@gmail.com> wrote:
Was there ever a fully working solution proposed or discovered for using executesql with grid and smartgrid?


I don't know of any.  What I sometimes do is to use create views or tables containing the results of complicated SQL-queries and use them from DAL to view the results in grids.

Probably not the most efficient, but you might also be able to do something like:

mem_db = DAL('sqlite:memory')

Then put the contents of a complex .executesql() query into a mem_db table and use that table in the grid.

Anthony
Reply all
Reply to author
Forward
0 new messages