time per epoch increases and contains ocillations during k-fold training

۱۷ بازدید
رفتن به اولین پیام خوانده‌نشده

Ali Durmaz

خوانده‌نشده،
۲۹ آبان ۱۳۹۹، ۹:۵۷:۴۴۱۳۹۹/۸/۲۹
به ray-dev

Dear ray community,

 

in our recent trainings of DL networks  we applied a grid search for k-fold cross validation (each trial representing a single fold of a k-fold analysis).

Here we observed that the time per epoch (=time_this_iter_s) oscillates within a single trial but also has an offset from one trial to the next. Later trials (folds) seem to be more time consuming. This can be observed in the attached images. Furthermore, we see a saturation in time per epoch for the last fold k=5 in cases where we use few CPU cores for data loading and augmentations, see purple curve in “LOM_num_workers_7_BS_12_smoothing_factor_0.6.PNG”.

 

Therefore, we tried increasing the number of workers in the pytorch data loader from initially 7 to 24 (see both attached files) which resulted in the absence of this saturation, see “LOM_num_workers_24_BS_12_smoothing_factor_0.6.PNG” and an overall acceleration. This hints at us running into the CPU bottleneck of our calculations, which makes sense as we apply heavy online data augmentations during training.

 

Our setup is the following: ray grid search for k-fold cross validation in order to train a U-Net architecture with batch normalization but without drop out. The implementation of the network is done in pytorch. We keep track of the metrics within tensorboard. We run the training on a GPU node of a cluster but that should not affect this from my point of view. We apply a fixed set of online data augmentations.

 

Is there some kind of setting in ray (or pytorch) to avoid this increase in time per epoch from one trial to another? Does anybody have an intuition for the reason of the oscillations in the time per epoch? Is there a way to avoid the oscillations to remain at a low time per epoch?

 

Best regards

Ali 

LOM_num_workers_24_BS_12_smoothing_factor_0.6.PNG
LOM_num_workers_7_BS_12_smoothing_factor_0.6.PNG
پاسخ به همه
پاسخ به نویسنده
فرستادن
0 پیام جدید