Looking for guidance to optimize a process that modifies some elements of a 2D dataset (table).
This is about efficient order of operations, not code details.
Is there an efficiency trade-off of get_where_list(), read_coordinates(), and modify_coordinates() based on the number of coordinate values? (search time vs data access time) Is there a point when the number of coordinate values becomes inefficient?
Background:
I have to modify SOME columns/fields data on SOME rows.....so will do multiple slices on both rows and columns.
The extracted numpy dtypes look something like this (simplified, there are more fields):
dtype([('ID', '<i8'), ('X', '<f8'), ('Y', '<f8'), ('Z', '<f8'), ('FREQ', '<f8')])
'X','Y','Z' are the values to be modified (by a dot product w/ another vector)
It's not clear (to me) which is better/faster:
1) Extract a small number of rows 1 ID value at a time (say 100 out of 10e7 rows), modify the data, update the rows, then repeat this 1000 times. (ID can appear multiple times with different FREQ.)
This method is simpler b/c I can easily get the row coordinates w/ get_where_list(), extract the row data w/ read_coordinates(), modify the array data, then update row data w/ modify_coordinates().
(I have coded this process as a working prototype.)
2) Extract a large number of rows (for 1000 IDs, say 100,000 out of 10e7 rows), modify the data and update the rows, but only do it once.
This method follows the same process, but first iterates thru get_where_list() to get all 100,000 coordinates for all 1000 ID values, does read_coordinates(), modifies the data and finally updates w/ modify_coordinates().
So...is there any performance benefit to one method over the other?