Cache performance with computed columns

35 views
Skip to first unread message

Paolo valleri

unread,
Mar 25, 2013, 4:48:14 AM3/25/13
to web...@googlegroups.com
Dear all, I want to share with you some results. I implemented a query in two different ways. Given the following common code:
start = db.record.with_alias('start_point')
end = db.record.with_alias('end_point')
elapsed_time = end.gathered_on.epoch() - start.gathered_on.epoch()

The first query is (the constrain is in the query):
rows = db( query & (elapsed_time  < 86400) ).select(
start.ALL, 
end.ALL, 
start.gathered_on.epoch(),
end.gathered_on.epoch(),
elapsed_time,
orderby=start.gathered_on.epoch(),
left=start.on( (start.mac == end.mac) & (start.gathered_on < end.gathered_on)),
cache=(cache.memcache, 3600),
cacheable = True
)
The second one is (the constrain is explicitly tested latter):
rows = db( query ).select(
start.ALL, 
end.ALL, 
start.gathered_on.epoch(),
end.gathered_on.epoch(),
elapsed_time,
orderby=start.gathered_on.epoch(),
left=start.on( (start.mac == end.mac) & (start.gathered_on < end.gathered_on)),
cache=(cache.memcache, 3600),
cacheable = True
)
rows2 = [r for r in rows if (r.end_point.gathered_on - r.start_point.gathered_on < datetime.timedelta(days=1)) ]

From the timing results I got that the second query is always faster, with or without cache:
Q_1: 0.273243904114 
Q_1 with cache: 0.0182011127472
Q_2: 0.250607967377
Q_2 with cache: 0.0158171653748

Beside the fact that they are just a few milliseconds of difference and that all the rows satisfy the constrain, what is not clear to me is why even when the cache is enabled the first query is taking longer. The question that came to my mind is about computed columns, are they cached?

Paolo

Niphlod

unread,
Mar 25, 2013, 5:19:12 AM3/25/13
to web...@googlegroups.com
I didn't get the point. ....
Q1 is slower (obviously) cause of the epoch() function and the additional filtering
Q1 with cache shows more or less the same duration as Q2 with cache.
the key point is "even when the cache is enabled the first query is taking longer" ... longer than what ?
Q1 with cache is speedier that:
- Q1 without cache, and that's ok
- Q2 without cache, and that's ok
but it's slower than
- Q2 with cache
however 3 ms (assuming those timings are in seconds) with all the moving parts of a memcache instance seems unfair to blame entirely on web2py

paolo....@gmail.com

unread,
Mar 25, 2013, 5:49:16 AM3/25/13
to web...@googlegroups.com


 Paolo


2013/3/25 Niphlod <nip...@gmail.com>
--
 
---
You received this message because you are subscribed to a topic in the Google Groups "web2py-users" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/web2py/9RfFmNQc-tE/unsubscribe?hl=en.
To unsubscribe from this group and all its topics, send an email to web2py+un...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.
 
 

paolo....@gmail.com

unread,
Mar 25, 2013, 6:12:50 AM3/25/13
to web...@googlegroups.com
Hi Niphlod, thanks for the answer. I am not blaming on web2py I am just sharing with you all these results.
In both the cases I have the time constrain, in one case it is in the query (Q1) while in the second it is explicit(Q2). With cache enabled, Q1 is no longer checking the constrain while Q2 is still checking all the constrain for all rows. Moreover, computing the .epoch() should take a well defined amount of time but for queries without cache, with cache it should be stored the computed result.

To sum up, when cache is enabled, with the same amount of rows (nearly 300): 
- Q2 is taking less time than Q1 even if it has to compute explicitly the time constrain
- Q1 is taking more time than Q2, but it has to retrieve from cache only one field more than Q1 among a dozen of fields

Given that, retrieving more data from memcached can be slower than re-computing it for each request.

Niphlod

unread,
Mar 25, 2013, 6:54:07 AM3/25/13
to web...@googlegroups.com
well, when you plan to use cache you do it because:
a) the additional roundtrip is negligible
b) using the cache means your database is not used
I'll explain it further....
A query that takes 0.18ms to get "computed" on the database side can very well end up in 18sec as soon as your database gets "hit" by several requests.
A cached query that takes 0.18ms to get stored and 0.18ms to get retrieved (unpacked, decompressed, unpickled, choose your own "term" that suits the job) will take, if the cache backend is "proper", 0.36ms the first time and 0.18ms from the 2nd to the nth.
Even so, some complicated queries (or uncomplicated for you, but complicated for the db backend) can be speedied up just requesting less fields or computing in raw python something that can be usually done on the backend.
This kind of optimization (i.e. relying on python's datetime operations vs an extract(epoch from datefield) on the db) will be less and less useful as soon as the resultsets cardinality grows (i.e. the more rows you have the better chance for the db to get the results faster than having a basic query and processing the results in python).
Assuming your "python filter" cuts out the 5% on the total of the rows, as soon as the number of rows grows you'll just waste time unpacking that additional 5%, and the time difference will be more explicit.
On a total unrelated note, remember that Python is fast, but dbs are fast too, especially when leveraged by all their features (e.g. you could set an index on the "computed" epoch and that query will probably gain a lot of speedup).
This is meant to explain why Q2 is faster than Q1.
On the reason why Q1 with cache isn't faster than Q2 with cache, since the time difference is negligible, it's possible that the time difference gets "spent" on calculating the automatic key for the cache of the query (on the first implementation needs to add the "elapsed time" part) . Just to be sure, try timing

def Q1():
    your first implementation
, without cache
def Q2():
    your second implementation
, without cache, without the check in python

with
rows1 = cache('q1', lambda : Q1())


..... and, separately
rows2 = cache('q2', lambda: Q2())
rows2
= [r for row in rows ...]

Paolo valleri

unread,
Mar 25, 2013, 7:21:31 AM3/25/13
to web...@googlegroups.com
Thanks for the very clear explanation.
I made the following:
t0 = time.time()
rows = cache.memcache ('q1', lambda : q1(query))
t1 = time.time()
rows2 = cache.memcache('q2', lambda: q2(query))
rows2 = [r for r in rows2 if (r.end_point.gathered_on - r.start_point.gathered_on < datetime.timedelta(days=1)) ]
t2 = time.time()
print 'q1:', t1-t0
print 'q2:', t2-t1
and I got from three different requests:
q1: 0.0104899406433
q2: 0.0106620788574
q1: 0.011244058609
q2: 0.0156059265137
q1: 0.010370016098
q2: 0.0133299827576
Beside memcache, I tried with cache.ram getting:
q1: 9.77516174316e-06
q2: 0.00217413902283
q1: 1.62124633789e-05
q2: 0.00206780433655
q1: 8.10623168945e-06
q2: 0.00120091438293
Given that, in a local env cache.ram is faster than memcache.

Paolo

Niphlod

unread,
Mar 25, 2013, 7:38:58 AM3/25/13
to web...@googlegroups.com
so, q1 is now always faster than q2 ... seems that the bits you were missing are in fact the ones that auto-set the cache-key based on the query string....

cache.ram is always faster than anything else, cause it stores a pointer to the result, doesn't have to pickle/unpickle the result and can't have (by design) any network latency.
The problem is that it's a locking cache and single-process, so YMMV .
Reply all
Reply to author
Forward
0 new messages