for gw1000 dataset, i had been using top and iotop previously to ascertain that cpu, memory and i/o usage were extremely low, and ps-efl showed it was spending its time waiting on interrupt. i would usually just conclude it was slow disk and spending all its time waiting on i/o completion *except* this is only for the smaller gw1000 dataset not the larger vp2 dataset. it is something to do with the different nature of the data (perhaps something as simple as different missing data being calc’ed)
to give an idea of the magnitude of the difference, using built-in shell time to run command:
|
|
recs |
real /sec |
user /sec |
sys /sec |
Idle /% |
vp2 |
--rebuild-daily |
505,336 |
165 |
148 |
2 |
9 |
vp2 |
--calc-missing |
505,336 |
571 |
525 |
18 |
5 |
gw1000 |
--rebuild-daily |
162,882 |
86 |
81 |
1 |
5 |
gw1000 |
--calc-missing |
162,882 |
23,758 |
301 |
13 |
99 |
as it stands right now, for migration of production to split environment, i will have to
* take a database snapshot and build the equivalent temp gw1000.sdb before migrating
* do —calc-missing offline on the temp gw1000.sdb (7 hours !!)
* dump the were-missing values in temp gw1000.sdb into a file
* when dumped data avail, stop production system, split the databases, load the dumped were-missing values into gw1000.sdb
* run —calc-missing on the interval only between dump and now ← hopefully not long, gw1000 data being lost!
* start new production system on split databases
does anyone have insight into the origin of the wait-for-interrupt plaguing my gw1000 dataset migration? perhaps some wxxtypes in do_calculations() have a realtime delay built in? perhaps the yield in genBatchRecords() is not context switching to another thread effectively (internal python issue)? has anyone seen such behaviour elsewhere?