BigDL program not exiting the save model stage

65 views
Skip to first unread message

prana...@gmail.com

unread,
Mar 8, 2017, 12:26:30 AM3/8/17
to BigDL User Group
I am running VGG on cifar-10 in a 3 node Spark cluster. After running it for 5 epochs, the program enters the save model stage. From here, it does not seem to exit even after waiting 30 min, which is more than the training time for the entire dataset. The logs on both slave machines show "removing RDDs". I did not find any errors in the logs or the Spark UI.

Any help regarding this problem is appreciated.

Thanks
Message has been deleted

Cherry Zhang

unread,
Mar 8, 2017, 2:59:48 AM3/8/17
to BigDL User Group

Hi, sorry, I can’t recur your issue.

 

I tried run vgg example on 3 node, and 28 cores per node. It costs about 17 min to run 5 epoch and save model to local correctly.

 

Can you give more details about your problem? Where do you save your model? Maybe it can help.

 

Thanks,

 

Cherry


--
You received this message because you are subscribed to the Google Groups "BigDL User Group" group.
To unsubscribe from this group and stop receiving emails from it, send an email to bigdl-user-group+unsubscribe@googlegroups.com.
To post to this group, send email to bigdl-user-group@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/bigdl-user-group/fc4e79f0-5f54-419a-963e-429350555caa%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Jason Dai

unread,
Mar 8, 2017, 3:36:42 AM3/8/17
to Cherry Zhang, BigDL User Group
Please share the command line and Spark cluster configurations, so that we can investigate more. BTW, are there a lot of Java GCs?

Thanks,
-Jason

Pranav Nair

unread,
Mar 8, 2017, 4:01:44 AM3/8/17
to BigDL User Group
Firstly thanks for your efforts. Below I am posting some more relevant information.I ran the program on a variety of machines. Most recently I used a 3 node cluster with 48 cores per node and 100GB memory per node. It takes 37 min for me to reach the "save model" stage after running for 5 epochs. Eventually, I kill the process as it doesnt exit even after waiting for 1 hour. The problem persists with different core and node configurations as well as different epochs.

Moreover, I am saving the model file in ~/model directory of the master node. I can also see the state.xxx and model.xxx files in the mentioned location.

My spark-submit command is as follows:-

spark-submit --class com.intel.analytics.bigdl.models.vgg.Train --driver-memory 32g \
--conf "spark.driver.extraJavaOptions=-Dbigdl.check.singleton=false" \
--conf "spark.shuffle.reduceLocality.enabled=false" \
--master spark://192.168.1.202:7077 \
--executor-cores 48 \
--executor-memory 100G \
--total-executor-cores 98 \
  /root/BigDL/dist/lib/bigdl-0.1.0-SNAPSHOT-jar-with-dependencies.jar \
 -f /root/cifar-10-batches-bin/ \
--maxEpoch 5 \
--node 2 \
--core 48 \
--env spark \
-b 384 \
--checkpoint ~/model

Regards

Pranav


 

On Wednesday, 8 March 2017 13:29:48 UTC+5:30, Cherry Zhang wrote:

Hi, sorry, I can’t recur your issue.

 

I tried run vgg example on 3 node, and 28 cores per node. It costs about 17 min to run 5 epoch and save model to local correctly.

 

Can you give more details about your problem? Where do you save your model? Maybe it can help.

 

Thanks,

 

Cherry


From: bigdl-us...@googlegroups.com [mailto:bigdl-us...@googlegroups.com] On Behalf Of prana...@gmail.com
Sent: Wednesday, March 8, 2017 1:27 PM
To: BigDL User Group <bigdl-us...@googlegroups.com>
Subject: [bigdl-user-group] BigDL program not exiting the save model stage

 

I am running VGG on cifar-10 in a 3 node Spark cluster. After running it for 5 epochs, the program enters the save model stage. From here, it does not seem to exit even after waiting 30 min, which is more than the training time for the entire dataset. The logs on both slave machines show "removing RDDs". I did not find any errors in the logs or the Spark UI.

Any help regarding this problem is appreciated.

Thanks

--
You received this message because you are subscribed to the Google Groups "BigDL User Group" group.

To unsubscribe from this group and stop receiving emails from it, send an email to bigdl-user-gro...@googlegroups.com.
To post to this group, send email to bigdl-u...@googlegroups.com.

Pranav Nair

unread,
Mar 8, 2017, 4:10:39 AM3/8/17
to BigDL User Group
Hi Jason,

I did not find a large number of GCs. I use a 3 node Spark cluster with mostly default configurations. The only explicit configuration changes are in spark-submit which I have posted a few minutes back. 

Thanks

Pranav


On Wednesday, 8 March 2017 14:06:42 UTC+5:30, Jason Dai wrote:
Please share the command line and Spark cluster configurations, so that we can investigate more. BTW, are there a lot of Java GCs?

Thanks,
-Jason
On Wed, Mar 8, 2017 at 3:59 PM, Cherry Zhang <cherry...@intel.com> wrote:

Hi, sorry, I can’t recur your issue.

 

I tried run vgg example on 3 node, and 28 cores per node. It costs about 17 min to run 5 epoch and save model to local correctly.

 

Can you give more details about your problem? Where do you save your model? Maybe it can help.

 

Thanks,

 

Cherry


From: bigdl-us...@googlegroups.com [mailto:bigdl-us...@googlegroups.com] On Behalf Of prana...@gmail.com
Sent: Wednesday, March 8, 2017 1:27 PM
To: BigDL User Group <bigdl-us...@googlegroups.com>
Subject: [bigdl-user-group] BigDL program not exiting the save model stage

 

I am running VGG on cifar-10 in a 3 node Spark cluster. After running it for 5 epochs, the program enters the save model stage. From here, it does not seem to exit even after waiting 30 min, which is more than the training time for the entire dataset. The logs on both slave machines show "removing RDDs". I did not find any errors in the logs or the Spark UI.

Any help regarding this problem is appreciated.

Thanks

--
You received this message because you are subscribed to the Google Groups "BigDL User Group" group.

To unsubscribe from this group and stop receiving emails from it, send an email to bigdl-user-gro...@googlegroups.com.
To post to this group, send email to bigdl-u...@googlegroups.com.

--
You received this message because you are subscribed to the Google Groups "BigDL User Group" group.
To unsubscribe from this group and stop receiving emails from it, send an email to bigdl-user-gro...@googlegroups.com.
To post to this group, send email to bigdl-us...@googlegroups.com.

Jason Dai

unread,
Mar 8, 2017, 7:54:14 AM3/8/17
to Pranav Nair, BigDL User Group
BTW, which Spark and JVM versions are you using?

Thanks,
-Jason

To unsubscribe from this group and stop receiving emails from it, send an email to bigdl-user-group+unsubscribe@googlegroups.com.
To post to this group, send email to bigdl-user-group@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/bigdl-user-group/7874807d-4c79-442d-96f2-2ba7e364b590%40googlegroups.com.

Jason Dai

unread,
Mar 8, 2017, 8:32:43 AM3/8/17
to Pranav Nair, BigDL User Group
And if you saw state.xxx and model.xxx files in the local directory, the models are already saved multiple times (by default, BigDL saves a copy of the model/state after every epoch); so the long wait time (esp. "removing RDD") is mostly likely happening after the modes are saved and training are completed. I wonder if you can share the BigDL logs (in addition to Spark and JVM versions), so that we can take a look at what's going on.

Thanks,
-Jason

dingdi...@gmail.com

unread,
Mar 8, 2017, 7:08:25 PM3/8/17
to BigDL User Group
I tried both spark1.6.0 and 2.0.0 with your settings but failed to repro the issue. The application exists if model can be successfully saved. Could you share the BigDL logs (in addition to Spark and JVM versions), so that we can take a look at what's going on.

About "removing RDDs" log, it's expected as we have unpersist operations but it should not happened too often as we call unpersist occasionally. 

在 2017年3月7日星期二 UTC-8下午9:26:30,Pranav Nair写道:

dingdi...@gmail.com

unread,
Mar 8, 2017, 9:40:36 PM3/8/17
to BigDL User Group
Since there is no error and there is mode.xxx in the setting location, we suspect the application hung after saving model. We have checked in a patch to fix this. Could you try the latest code to see if it works? Thx.


在 2017年3月7日星期二 UTC-8下午9:26:30,Pranav Nair写道:
I am running VGG on cifar-10 in a 3 node Spark cluster. After running it for 5 epochs, the program enters the save model stage. From here, it does not seem to exit even after waiting 30 min, which is more than the training time for the entire dataset. The logs on both slave machines show "removing RDDs". I did not find any errors in the logs or the Spark UI.

Afsaar Shiekh

unread,
May 6, 2022, 7:12:07 PM5/6/22
to User Group for BigDL
I am facing the problem where RDD removing at the end is taking longer time and sometime it stuck .I am using G1GC also
Reply all
Reply to author
Forward
0 new messages