Help understanding memory use and selects

55 views
Skip to first unread message

Ian W. Scott

unread,
Sep 2, 2019, 3:44:18 PM9/2/19
to web2py-users
I'm trying to lower the memory use of an app and have some general questions about how memory is used in DAL selects:

  1. Am I right that the memory used while performing the select isn't released right away, even if the select isn't assigned to a variable? 
  2. I'm aware of iterselect. Am I right that with iterselect the memory used is just enough to store one row of data (instead of the whole selected set)?
  3. Does this mean that, generally, you want to perform as few separate selects as possible, unless you can use iterselect?
  4. Selects seem to occupy a significant amount of memory, even when the result set is only one row. Is the memory use for the select determined by the table size or the result size?
Here's an example of the kind of situation I'm working with. I'm using a list comprehension to loop through a list, performing a select in each loop:

    p_here = [p for p in cpaths if loc_id in db.steps[int(p['steps'][0])].locations]


When I run a memory profiler, this line results in over 1MB of memory being occupied, and that memory isn't released for at least several minutes. The table "steps" has about 3000 rows, so it's not enormous. The result for each select is a single row and doesn't include a huge amount of data (a few strings, ints, etc.). The "cpaths" list might have 50 or so items. So is the memory issue emerging because (a) the memory use for each select is determined by the table size, and (b) memory is being occupied (and not released) separately for each iteration of the loop? Is there a way to rewrite this so that it uses less memory?

Dave S

unread,
Sep 3, 2019, 3:29:02 AM9/3/19
to web2py-users
Naive db abuser responding:

You've already mentioned iterselect(),. but I'll put the link here for other readers:

And the general  advice oft repeated hereabouts is to not do in Python what you can do in the DB, so it may be worth trying to recast the comprehension into elements of the select.

Also, this may depend on the combination of database and driver.  In my case, pg8000 was appearenetly costing me memory growth with the scheduler.  When I finally figured out installing psycopg2 where web2py could find it, this issue diminished.

Good luck!

/dps

Ian W. Scott

unread,
Sep 4, 2019, 1:20:44 PM9/4/19
to web2py-users
Thanks Dave. That was helpful.

Ian W. Scott

unread,
Sep 4, 2019, 1:53:31 PM9/4/19
to web2py-users
After some experimenting and refactoring, I'll offer some preliminary answers to my own question here. First, I was able to refactor that list comprehension so that it uses negligible memory (too low for memory_profiler to register). The original version looked like this and consumed over 1MiB of memory each time it ran:

p_here = [p for p in cpaths if loc_id in db.steps[int(p['steps'][0])].locations]

The refactored version looks like this:

    pid_here = [p['path2steps']['path_id'] for p
               
in db((db.path2steps.step_id == db.steps.id) &
                     
(db.steps.locations.contains(loc_id))
                     
).iterselect(db.path2steps.path_id, db.steps.locations)
                     
if loc_id in p['steps']['locations']
                     
]
    p_here
= [p for p in cpaths if p['id'] in pid_here]

It looks less elegant, but it's *much* lighter on memory. Let me break down the changes I made. 
  1. I removed the db access (select) from the "if" condition which is called on every iteration of the loop. Instead I access the db once and iterate over the result.
  2. I use iterselect instead of select.
  3. In the iterselect I specified just the fields I'm actually going to use, so that useless data doesn't go into memory.
In order to make these changes I had to reorganize the logic significantly. Rather than trying to pinpoint my desired data set in one pass through the list comprehension, I first gather a larger (!) data set and then refine it in a second step that doesn't require db access. The details aren't important here. What surprised me, though, is that it was far more memory efficient to iterate over a single, stripped down iterselect than to make multiple selects. This is true even though the resulting list is larger and has to be pared down in a second stage.

The larger takeaway for me is that db access is generally very expensive in terms of memory. It's worth it for me to organize my logic around minimizing db calls, even if the result is less elegant code.

Ian W. Scott

unread,
Sep 4, 2019, 2:01:26 PM9/4/19
to web2py-users
In case someone is looking for answers to my numbered questions, it seems from what I can tell that:

  1. Memory used performing a select isn't released for quite a while afterwards, even if your code doesn't use that data again. So multiple selects will quickly make your memory usage grow.
  2. iterselect uses more memory than just what would store one row of data. There's significant memory overhead, it seems, in just accessing the db. But because it's only storing one row at a time it's much lighter on memory with large data sets.
  3. Yes, if you're worried about memory use you want to perform as few separate selects as possible, even if you're using iterselect.
  4. If you're using iterselect, memory use isn't affected by table size or even (much) by the size of your result set. It's determined instead by (a) the basic db access overhead, and (b) the amount of data you're retrieving in a given row. 
If I'm off on any of these points, I'd be glad to here more.

Massimo Di Pierro

unread,
Sep 15, 2019, 8:30:00 PM9/15/19
to web2py-users

I would rewrite

p_here = [p for p in cpaths if loc_id in db.steps[int(p['steps'][0])].locations]


as

ids = [int(p['steps][0]) for p in p in cpaths])
rpaths = db(db.steps.id.belongs(ids)(db.steps.locations.contains(loc_id)).select(db.steps.id)
p_here = [p for p in cpaths if int(p['steps'][0]) in rpaths.as_dict()]

(if locations is searchable else)

ids = [int(p['steps][0]) for p in p in cpaths])
rpaths = db(db.steps.id.belongs(ids).select(db.steps.id, db.steps.locations).as_dict()
p_here = [p for p in cpaths if int(p['steps'][0]) in rpaths and loc_id in paths[int(p['steps'][0])].locations]

In both case you do a single select. In the second case it returns more records because the filter is done in python, not in db

Reply all
Reply to author
Forward
0 new messages