Hi, sorry, I can’t recur your issue.
I tried run vgg example on 3 node, and 28 cores per node. It costs about 17 min to run 5 epoch and save model to local correctly.
Can you give more details about your problem? Where do you save your model? Maybe it can help.
Thanks,
Cherry
--
You received this message because you are subscribed to the Google Groups "BigDL User Group" group.
To unsubscribe from this group and stop receiving emails from it, send an email to bigdl-user-group+unsubscribe@googlegroups.com.
To post to this group, send email to bigdl-user-group@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/bigdl-user-group/fc4e79f0-5f54-419a-963e-429350555caa%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
To view this discussion on the web visit https://groups.google.com/d/msgid/bigdl-user-group/c3d4b2a4-f5ec-4e31-a819-5e3addabc263%40googlegroups.com.
Hi, sorry, I can’t recur your issue.
I tried run vgg example on 3 node, and 28 cores per node. It costs about 17 min to run 5 epoch and save model to local correctly.
Can you give more details about your problem? Where do you save your model? Maybe it can help.
Thanks,
Cherry
From: bigdl-us...@googlegroups.com [mailto:bigdl-us...@googlegroups.com] On Behalf Of prana...@gmail.com
Sent: Wednesday, March 8, 2017 1:27 PM
To: BigDL User Group <bigdl-us...@googlegroups.com>
Subject: [bigdl-user-group] BigDL program not exiting the save model stage
I am running VGG on cifar-10 in a 3 node Spark cluster. After running it for 5 epochs, the program enters the save model stage. From here, it does not seem to exit even after waiting 30 min, which is more than the training time for the entire dataset. The logs on both slave machines show "removing RDDs". I did not find any errors in the logs or the Spark UI.
Any help regarding this problem is appreciated.
Thanks
--
You received this message because you are subscribed to the Google Groups "BigDL User Group" group.
To unsubscribe from this group and stop receiving emails from it, send an email to bigdl-user-gro...@googlegroups.com.
To post to this group, send email to bigdl-u...@googlegroups.com.
Please share the command line and Spark cluster configurations, so that we can investigate more. BTW, are there a lot of Java GCs?Thanks,-Jason
On Wed, Mar 8, 2017 at 3:59 PM, Cherry Zhang <cherry...@intel.com> wrote:
Hi, sorry, I can’t recur your issue.
I tried run vgg example on 3 node, and 28 cores per node. It costs about 17 min to run 5 epoch and save model to local correctly.
Can you give more details about your problem? Where do you save your model? Maybe it can help.
Thanks,
Cherry
From: bigdl-us...@googlegroups.com [mailto:bigdl-us...@googlegroups.com] On Behalf Of prana...@gmail.com
Sent: Wednesday, March 8, 2017 1:27 PM
To: BigDL User Group <bigdl-us...@googlegroups.com>
Subject: [bigdl-user-group] BigDL program not exiting the save model stage
I am running VGG on cifar-10 in a 3 node Spark cluster. After running it for 5 epochs, the program enters the save model stage. From here, it does not seem to exit even after waiting 30 min, which is more than the training time for the entire dataset. The logs on both slave machines show "removing RDDs". I did not find any errors in the logs or the Spark UI.
Any help regarding this problem is appreciated.
Thanks
--
You received this message because you are subscribed to the Google Groups "BigDL User Group" group.
To unsubscribe from this group and stop receiving emails from it, send an email to bigdl-user-gro...@googlegroups.com.
To post to this group, send email to bigdl-u...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/bigdl-user-group/fc4e79f0-5f54-419a-963e-429350555caa%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
--
You received this message because you are subscribed to the Google Groups "BigDL User Group" group.
To unsubscribe from this group and stop receiving emails from it, send an email to bigdl-user-gro...@googlegroups.com.
To post to this group, send email to bigdl-us...@googlegroups.com.
To unsubscribe from this group and stop receiving emails from it, send an email to bigdl-user-group+unsubscribe@googlegroups.com.
To post to this group, send email to bigdl-user-group@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/bigdl-user-group/7874807d-4c79-442d-96f2-2ba7e364b590%40googlegroups.com.
I am running VGG on cifar-10 in a 3 node Spark cluster. After running it for 5 epochs, the program enters the save model stage. From here, it does not seem to exit even after waiting 30 min, which is more than the training time for the entire dataset. The logs on both slave machines show "removing RDDs". I did not find any errors in the logs or the Spark UI.