DAL speed - an idea

201 views
Skip to first unread message

nick name

unread,
Feb 9, 2012, 1:51:47 PM2/9/12
to web...@googlegroups.com
One of my controllers need to go through a lot of records to provide a meaningful answer -- as in, 60k records.

Just loading them from the database takes about 100ms (db.executesql("select * from table order by id;")); Doing the same through DAL takes over 6 seconds. I realize that the DAL does do a lot of additional work, which in general is helpful -- but I can do without all the parsing / Rows() generation for this.

What do people here think about adding a db.rawselect(...), which is a slim rapper for db.executesql(db._select()) .... that wraps everything with a named tuple? It solves most of the speed problem when it is needed, but still maintains a lot of the features of the SQL DAL processing.

Bruno Rocha

unread,
Feb 9, 2012, 1:57:35 PM2/9/12
to web...@googlegroups.com
You mean using the generated SQL command to fire execute_sql and returns as a named tuple?

>>> db(db.query).raw_select()

it is the same as doing

command = db(db.query)._select()
result = db.execute_sql(command)
--

nick name

unread,
Feb 9, 2012, 3:04:41 PM2/9/12
to web...@googlegroups.com
Yes, that is the basis of what I am suggesting.

There is not currently such a thing; there is something called 'select_raw' implemented in the GoogleDataStore adapter, but not in anything else, and it isn't exactly what I am proposing.

To elaborate:

Assume the table is defined as follows:
  
    reftable = db.define_table('reftable', Field('a', string))
    table = db.define_table('table', Field('b', reftable))

In my case, I need to pull all the records (60,000) from the database to compute some aggregation which I cannot compute using sql. There are two alternatives here:

    r1 = db().select(table.ALL) # takes > 6 seconds

    r2 = db.executesql(db._select(table.ALL)) # takes ~0.1sec

The records returned in the first instance are much richer; they have record chasing (e.g. I can do r1[0].b.a to select through the foreign key), they have methods like r1[0].update_record() and r1[0].delete_record(), and other nice stuff.

However, for this use, I don't need the additional records, and I do need the speed, so I would rather use r2. However, r2 is not a direct replacement -- it doesn't have the column names. If I use

    r3 = db.executesql(db._select(table.ALL), as_dict=True) # still takes ~0.1sec

I can do r3[0]['b'] but I cannot do r3[0].b; and it takes a lot more memory than r2.

A suggestion: add another parameter, processor=... which, if available, will be called with the db.connection.cursor, returning a function, through which each routine will be passed; example

def named_tuple_process(name, description):
   from collections import namedtuple
   fields = ' '.join([x[0] for x in description])
   return namedtuple(name, fields)

    r4 = db.executesql(db._select(table.ALL), process=lambda x: named_tuple_process('tablerec', x))

r4[0].b # will now work; not a full replacement, but good enough for many uses.

In fact, you can do that externally -

r4 = db.executesql(db._select(table.ALL))
f = named_tuple_process('tablerec', db._adapter.cursor.description)
r4 = [f(x) for x in r4]

But this requires reaching into the internals of the db adapter.

Finally, I propose to define x.raw_select(*args) to do: db.executesql(x._select(*args))

which would make this a relatively clean replacement.

Willoughby

unread,
Feb 9, 2012, 3:41:45 PM2/9/12
to web...@googlegroups.com
"In my case, I need to pull all the records (60,000) from the database to compute some aggregation which I cannot compute using sql"

Are you familiar with window functions in SQL?  I've never met an aggregation need that couldn't be met with clever use of windows...


Anthony

unread,
Feb 9, 2012, 4:14:36 PM2/9/12
to web...@googlegroups.com
I've been thinking about something like this as well. Instead of a separate select_raw() method, maybe we can just add a raw=True|False argument to the existing select() method. I like the namedtuple idea as well (I think some adapters already provide that as an option -- e.g., psycopg2).

Anthony

Massimo Di Pierro

unread,
Feb 10, 2012, 12:04:59 AM2/10/12
to web2py-users
open a ticket, this can be done. I like the idea of passing a
processor.

Simon Lukell

unread,
Feb 10, 2012, 3:16:12 AM2/10/12
to web2py-users
+1
Having this option would make it really simple to change between the
full-blown DAL result set and a faster stripped down one (which could
then be adapted with the processor to keep the rest of the code
working.)

nick name

unread,
Feb 10, 2012, 6:06:09 PM2/10/12
to web...@googlegroups.com
There's a tree structure among the record, upon which the aggregation is computed.

Some dbs (e.g. oracle) have extensions for tree-like structures (CONNECT BY etc), but it is not standard, and I need to support both sqlite and postgres in this app.

nick name

unread,
Mar 9, 2012, 12:41:26 AM3/9/12
to web...@googlegroups.com


On Friday, February 10, 2012 12:04:59 AM UTC-5, Massimo Di Pierro wrote:
open a ticket, this can be done. I like the idea of passing a
processor.

Opened in http://code.google.com/p/web2py/issues/detail?id=701  with discussion and a much improved suggestion of how to handle this.

Anthony

unread,
Mar 9, 2012, 1:12:33 AM3/9/12
to web...@googlegroups.com
Nice. I added a comment.

spiffytech

unread,
Mar 9, 2012, 1:48:06 PM3/9/12
to web...@googlegroups.com
Before we add special features to the DAL, has anyone profiled the DAL with queries and data similar to nick name's to see if there's any low-hanging fruit we can tackle to speed up the normal select()?

nick name, can you please provide your table schema so we can test performance on data sets similar to yours?

Anthony

unread,
Mar 9, 2012, 2:12:18 PM3/9/12
to web...@googlegroups.com
mcm made some initial attempts and indicated about a 25% improvement, with the expectation that some further improvement could be made with more effort: https://groups.google.com/forum/#!topic/web2py-developers/Tp6HXsSf7lk/discussion. However, it sounded like we wouldn't be likely to get enough improvement to substantially help with large selects (i.e., tens of thousands of rows) -- for that case, we probably need something like what nick name is proposing (which will actually be easier to implement, anyway).

Anthony

Vinicius Assef

unread,
Mar 9, 2012, 2:18:58 PM3/9/12
to web...@googlegroups.com
Anthony, I'm not supposed to retrieve tens os thousands records from a
single query in a web app.

I would address this problem to app structure, instead of to DAL.
Anyway, we can execute raw SQL statements via DAL, right?

--
Vinicius Assef

Anthony

unread,
Mar 9, 2012, 2:28:27 PM3/9/12
to web...@googlegroups.com
Anthony, I'm not supposed to retrieve tens os thousands records from a
single query in a web app.


Of course, if you've got a busy site with lots of queries, it can add up even for relatively smaller result sets.
 

Anyway, we can execute raw SQL statements via DAL, right?

Yes, it's possible to create your own manual workarounds -- this would just be an added convenience. At least a basic version of this shouldn't be hard to implement -- just return the raw db response without any further processing.

Anthony

nick name

unread,
Mar 9, 2012, 4:02:23 PM3/9/12
to web...@googlegroups.com

On Friday, March 9, 2012 1:48:06 PM UTC-5, spiffytech wrote:
Before we add special features to the DAL, has anyone profiled the DAL with queries and data similar to nick name's to see if there's any low-hanging fruit we can tackle to speed up the normal select()?

The suggested change is very small (~5 lines), backward compatible, and useful for other stuff as well (e.g. saving memory and speeding up Rows.as_list()  as long as only one table is involved, which is a very common use case).

nick name, can you please provide your table schema so we can test performance on data sets similar to yours?

No time now, but I'll try to set up a test case for future evaluations over the weekend.

nick name

unread,
Mar 9, 2012, 4:13:43 PM3/9/12
to web...@googlegroups.com
On Friday, March 9, 2012 2:18:58 PM UTC-5, viniciusban wrote:
Anthony, I'm not supposed to retrieve tens os thousands records from a
single query in a web app

It's all a question of what each layer is doing. Let's say you have a database with 100,000 records. You need a summary:

select sum(cost) from items where (units*unit_cost - :discount >= cost);

where "discount" is an input you received from the web page. You got one line of response, but unless you have an index on (units*unit_cost - cost) AND a smart optimizer, your database will do a full scan.

Now, what if your criterion cannot be expressed in SQL at all? You have to do the full scan in the app. It might not even take long. However, you need all the records in your app to do that.
 

I would address this problem to app structure, instead of to DAL.

For my app, a < 10 line change in the call to select (entirely in app code, though relying on DAL internals), produced x60 speedup. Restructuring the app would have been a much bigger and more risky undertaking. The suggestion breaks that 10 line change into two 5 line changes, one inside DAL, and one outside -- that makes it much more general, and not dependent on DAL internals.
 

Anyway, we can execute raw SQL statements via DAL, right?

Yes, that's how my speedup is implemented right now (Issue 701 has all the details)
Reply all
Reply to author
Forward
0 new messages