efficiency of table row access by coordinates

Skip to first unread message

Ken Walker

Oct 4, 2018, 5:07:24 PM10/4/18
to pytables-users
Looking for guidance to optimize a process that modifies some elements of a 2D dataset (table).
This is about efficient order of operations, not code details.
Is there an efficiency trade-off of get_where_list(), read_coordinates(), and modify_coordinates() based on the number of coordinate values? (search time vs data access time) Is there a point when the number of coordinate values becomes inefficient?

I have to modify SOME columns/fields data on SOME rows.....so will do multiple slices on both rows and columns.
The extracted numpy dtypes look something like this (simplified, there are more fields):
dtype([('ID', '<i8'), ('X', '<f8'), ('Y', '<f8'), ('Z', '<f8'), ('FREQ', '<f8')])
'X','Y','Z' are the values to be modified (by a dot product w/ another vector)

It's not clear (to me) which is better/faster:
1) Extract a small number of rows 1 ID value at a time (say 100 out of 10e7 rows), modify the data, update the rows, then repeat this 1000 times. (ID can appear multiple times with different FREQ.)
This method is simpler b/c I can easily get the row coordinates w/ get_where_list(), extract the row data w/ read_coordinates(), modify the array data, then update row data w/ modify_coordinates().
(I have coded this process as a working prototype.)

2) Extract a large number of rows (for 1000 IDs, say 100,000 out of 10e7 rows), modify the data and update the rows, but only do it once.
This method follows the same process, but first iterates thru get_where_list() to get all 100,000 coordinates for all 1000 ID values, does read_coordinates(), modifies the data and finally updates w/ modify_coordinates().

So...is there any performance benefit to one method over the other?

Ken Walker

Oct 8, 2018, 2:47:59 PM10/8/18
to pytables-users
After writing that post, I realized I outlined changes required to measure the fastest method.
I wrote a second version that extracts a large number of rows (Method 2 above).
(Frankly, it took longer to create a large prototype dataset for testing than it did to rework the loop logic.)

I then ran both versions and compared timing data. What did I learn?
For both methods, the majority of time is spent on get_where_list().
In a 10 minute procedure, about 9:30 is spent on 3 get_where_list() calls.

Also, it's more efficient to call get_where_list() each time to find all rows than it is to repeatedly call get_where_list().
Reply all
Reply to author
0 new messages