Using SQLFORM.grid with large datasets

Johann Spies

unread,

Apr 17, 2013, 6:58:27 AM4/17/13

to web...@googlegroups.com

By large datasets I mean a database of which one of the tables contain more than 42 million records.

Using SQLFORM.grid on such a dataset is very slow because the process wants to count the total number of records.

I am using Postgresql. I can get an estimated size of the table by replacing count(*) with

# SELECT reltuples::integer FROM pg_class WHERE oid = 'isi.ritem'::regclass;
reltuples
-----------
42183232
(1 row)

And I get the result in a fraction of a second. When working on such a scale the exact number of rows (count is also not always accurate on an active database) is not a necessity.

Is there a way I can replace the count(*) in the grid with the query illustrated above?

Regards

Johann

--
Because experiencing your loyal love is better than life itself,
my lips will praise you. (Psalm 63:3)

Niphlod

unread,

Apr 17, 2013, 8:25:05 AM4/17/13

to web...@googlegroups.com

currently not but it is planned to be configurable (at the very minimum, cacheable).

Your method BTW works only if all the table is shown, doesn't take into account a possible condition passed to the grid (such as db.table.user_id == auth.user_id) or keyword queries.

I'll start working on that as soon as possible, shouldn't be hard.....
Talking about "extensibility", what you'd like to pass as a parameter ?
I'm thinking to allow:
- a callable (to which query and keywords are passed) --> you can pass the result of your custom query in this case, just watch out for the shortcomings explained earlier
- an integer (which will trigger the "cache this count" for n seconds)

Would that be enough?

Niphlod

unread,

Apr 17, 2013, 5:28:36 PM4/17/13

to web...@googlegroups.com

check trunk as soon as the Pull Request gets merged


def test3(dbset, request_vars):
    ##you can retrieve the current query with
    ##dbset._select(), i.e. dbset is a "db(query)" object
    
    ##request_vars are the current request.vars
    ###so you can check for keywods etc etc etc
    ###checking for groupby, distinct, etc is up to you!
    
    ##given that is a callable you may cache it "externally"
    ## with a return dbset.count(cache=(cache.ram, 60))
    ## and do all your crazy things
    
    ## method that works on postgresql only for a full-table-count
    result = db.executesql("SELECT reltuples::integer FROM pg_class WHERE oid = 'public.awesome'::regclass;")
    return result[0][0]

def test2():
    #mode1
    cache_count = 127
    #mode2
    cache_count = (cache.ram, 60)
    #mode3
    cache_count = test3
    grid = SQLFORM.grid(db.awesome, cache_count=cache_count)
    return dict(grid=grid)

Can I leave documentation up to someone willing to test it and report back ? :P

PS: <self_confidence mode="on">fixed 8 hours later.....web2py is awesome</self_confidence>

PS2: on a test table with 4M rows, two 'string' fields, postgresql 9.2 takes 1.9sec to do a "standard" count(*).
Things definitely improved (although it remains a heavy operation for MVCC databases) on the count(*) since previous versions.

Johann Spies

unread,

Apr 18, 2013, 5:26:20 AM4/18/13

to web...@googlegroups.com

Thanks.

I will test your suggestions.

I am currently working with Postgresql 9.1 and it takes nearly 5 minutes to count a table > 42 000 000 records.

It will if we can have the option in the grid to not do a count of the result as it is part of the problem in my case. If for instance the query is db.table.id > 0 I would prefer to use the other way to get an approximate of the total records in the table.

If a query would request more records than the total in the table, postgresql would not have a problem.

e.g. in a table with 15 entries I have tested a query with 'limit 15 offset 10' and there was no problem. Only last 5 records were shown.

Regards
Johann

--

---
You received this message because you are subscribed to the Google Groups "web2py-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to web2py+un...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Niphlod

unread,

Apr 18, 2013, 6:15:54 AM4/18/13

to web...@googlegroups.com

the count affects only the grid for "pagination" purposes. There's no other reason to do a count to display a grid.
This means that if you pass cache_count=10, you'll end up not having any "1,2,3" links at the bottom to fetch the next "page" of results.
Putting that aside, you can pass whatever "thingy" you fit appropriate for the job (fixed int, cache tuple, custom code).

Reply all

Reply to author

Forward