UR - Pio train fails with ExecutorLostFailure

89 views
Skip to first unread message

annevan...@gmail.com

unread,
May 12, 2016, 8:00:53 AM5/12/16
to actionml-user
Hello,

I have used the UR template with success for a sample dataset.

Now I would to use it for a large(r) dataset. But Prediction.IO fails during a 'pio deploy', with a ExecutorLostFailure. I also tried to increase the available memory with 'pio train -- --driver-memory 5g --executor-memory 5g', but it still doesn't work.
I'm running pio on a machine with 8G memory.

Is it possible that the dataset is too large? The json's have the following sizes:
- items.json: 23 MB ~ 50.000 items
- buys.json: 415 MB ~ 2.000.000 items
- views.json: 510 MB

ubuntu@xx$ pio train -- --driver-memory 4g --executor-memory 4g
[INFO] [Console$] Using existing engine manifest JSON at /home/ubuntu/universal/manifest.json
[INFO] [Runner$] Submission command: /home/ubuntu/PredictionIO/vendors/spark-1.6.1/bin/spark-submit --driver-memory 4g --executor-memory 4g --class io.prediction.workflow.CreateWorkflow --jars file:/home/ubuntu/k2go-universal/target/scala-2.10/template-scala-parallel-universal-recommendation_2.10-0.3.0.jar,file:/home/ubuntu/universal/target/scala-2.10/template-scala-parallel-universal-recommendation-assembly-0.3.0-deps.jar --files file:/home/ubuntu/PredictionIO/conf/log4j.properties,file:/home/ubuntu/PredictionIO/vendors/hbase-1.1.3/conf/hbase-site.xml --driver-class-path /home/ubuntu/PredictionIO/conf:/home/ubuntu/PredictionIO/lib/postgresql-9.4-1204.jdbc41.jar:/home/ubuntu/PredictionIO/lib/mysql-connector-java-5.1.37.jar:/home/ubuntu/PredictionIO/vendors/hbase-1.1.3/conf file:/home/ubuntu/PredictionIO/lib/pio-assembly-0.9.6.jar --engine-id ZtN02B5wpGkrjuYeyQp7l3tRQntXItSc --engine-version fdbef99c2e1092f5a4c62c264a68557ca4430754 --engine-variant file:/home/ubuntu/k2go-universal/engine.json --verbosity 0 --json-extractor Both --env PIO_STORAGE_SOURCES_HBASE_TYPE=hbase,PIO_ENV_LOADED=1,PIO_STORAGE_REPOSITORIES_METADATA_NAME=pio_meta,PIO_FS_BASEDIR=/home/ubuntu/.pio_store,PIO_STORAGE_SOURCES_HBASE_HOME=/home/ubuntu/PredictionIO/vendors/hbase-1.1.3,PIO_HOME=/home/ubuntu/PredictionIO,PIO_FS_ENGINESDIR=/home/ubuntu/.pio_store/engines,PIO_STORAGE_SOURCES_LOCALFS_PATH=/home/ubuntu/.pio_store/models,PIO_STORAGE_SOURCES_ELASTICSEARCH_TYPE=elasticsearch,PIO_STORAGE_REPOSITORIES_METADATA_SOURCE=ELASTICSEARCH,PIO_STORAGE_REPOSITORIES_MODELDATA_SOURCE=LOCALFS,PIO_STORAGE_REPOSITORIES_EVENTDATA_NAME=pio_event,PIO_STORAGE_SOURCES_ELASTICSEARCH_HOME=/home/ubuntu/PredictionIO/vendors/elasticsearch-1.7.5,PIO_FS_TMPDIR=/home/ubuntu/.pio_store/tmp,PIO_STORAGE_REPOSITORIES_MODELDATA_NAME=pio_model,PIO_STORAGE_REPOSITORIES_EVENTDATA_SOURCE=HBASE,PIO_CONF_DIR=/home/ubuntu/PredictionIO/conf,PIO_STORAGE_SOURCES_LOCALFS_TYPE=localfs
[INFO] [Engine] Extracting datasource params...
[INFO] [WorkflowUtils$] No 'name' is found. Default empty String will be used.
[INFO] [Engine] Datasource params: (,DataSourceParams(app,List(purchase, view),None))
[INFO] [Engine] Extracting preparator params...
[INFO] [Engine] Preparator params: (,Empty)
[INFO] [Engine] Extracting serving params...
[INFO] [Engine] Serving params: (,Empty)
[INFO] [Remoting] Starting remoting
[INFO] [Remoting] Remoting started; listening on addresses :[akka.tcp://sparkDriverActorSystem@xxx]
[INFO] [Engine$] EngineWorkflow.train
[INFO] [Engine$] DataSource: org.template.DataSource@482f7af0
[INFO] [Engine$] Preparator: org.template.Preparator@4c4c7d6c
[INFO] [Engine$] AlgorithmList: List(org.template.URAlgorithm@6f3bd37f)
[INFO] [Engine$] Data sanity check is on.
[Stage 4:>                                                          (0 + 2) / 2][WARN] [HeartbeatReceiver] Removing executor driver with no recent heartbeats: 204917 ms exceeds timeout 120000 ms
[ERROR] [Utils] Uncaught exception in thread driver-heartbeater
[ERROR] [Executor] Exception in task 0.0 in stage 4.0 (TID 8)
[WARN] [HConnectionManager$HConnectionImplementation] This client just lost it's session with ZooKeeper, closing it. It will be recreated next time someone needs it
[ERROR] [TaskSchedulerImpl] Lost executor driver on localhost: Executor heartbeat timed out after 204917 ms
[ERROR] [ActorSystemImpl] Uncaught fatal error from thread [sparkDriverActorSystem-akka.remote.default-remote-dispatcher-32] shutting down ActorSystem [sparkDriverActorSystem]
[ERROR] [ActorSystemImpl] exception on LARS’ timer thread
[ERROR] [ActorSystemImpl] Uncaught fatal error from thread [sparkDriverActorSystem-scheduler-1] shutting down ActorSystem [sparkDriverActorSystem]
[WARN] [transport] [Watcher] Transport response handler not found of id [88]
[WARN] [TaskSetManager] Lost task 0.0 in stage 4.0 (TID 8, localhost): ExecutorLostFailure (executor driver exited caused by one of the running tasks) Reason: Executor heartbeat timed out after 204917 ms
[ERROR] [SparkUncaughtExceptionHandler] Uncaught exception in thread Thread[Executor task launch worker-0,5,main]
[ERROR] [TaskSetManager] Task 0 in stage 4.0 failed 1 times; aborting job
[WARN] [TaskSetManager] Lost task 1.0 in stage 4.0 (TID 9, localhost): ExecutorLostFailure (executor driver exited caused by one of the running tasks) Reason: Executor heartbeat timed out after 204917 ms
[WARN] [SparkContext] Killing executors is only supported in coarse-grained mode

Pat Ferrel

unread,
May 12, 2016, 11:23:53 AM5/12/16
to annevan...@gmail.com, actionml-user
Can you send a screen shot of the timeline GUI for the job? This will show the line of code (actually the last closure) that was last executed.

This seems to indicate something taking waaaay too long. Perhaps a network issue but the line of code would give us more clues.

executor driver with no recent heartbeats: 204917 ms exceeds timeout 120000 ms
[ERROR] [Utils] Uncaught exception in thread driver-heartbeater

To save old dead jobs in a form you can see them in the GUI you have to set
     “spark.eventLog.enabled”: true,
  “spark.eventLog.dir”: “hdfs://your-hdfs-master/some-log-dir-you-create-by-hand

This can be in the sparkConf in engine.json or on in the spark defaults file.


--
You received this message because you are subscribed to the Google Groups "actionml-user" group.
To unsubscribe from this group and stop receiving emails from it, send an email to actionml-use...@googlegroups.com.
To post to this group, send email to action...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/actionml-user/31697db5-01e5-4f02-80fd-c7a4706a56b4%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

annevan...@gmail.com

unread,
May 13, 2016, 2:39:22 AM5/13/16
to actionml-user, annevan...@gmail.com, p...@occamsmachete.com
Hi Pat, 

Thank you! You can find the log here: http://pastebin.com/DRBqUFmz


Pat Ferrel

unread,
May 13, 2016, 10:51:42 AM5/13/16
to annevan...@gmail.com, actionml-user
Actually I was asking for the GUI job timeline, which will show the execution time of every task and which one ended the job.

However the pastebin seems to show a memory limit being reached. We made a change in PredicitonIO-0.9.7-aml that will be released today, You can get it from the master branch now here: https://github.com/actionml/PredictionIO which can drastically reduce memory needs while using the eventWindow.


On May 12, 2016, at 11:39 PM, annevan...@gmail.com wrote:

Hi Pat, 

Thank you! You can find the log here: http://pastebin.com/DRBqUFmz



--
You received this message because you are subscribed to the Google Groups "actionml-user" group.
To unsubscribe from this group and stop receiving emails from it, send an email to actionml-use...@googlegroups.com.
To post to this group, send email to action...@googlegroups.com.

annevan...@gmail.com

unread,
May 13, 2016, 11:47:37 AM5/13/16
to actionml-user, annevan...@gmail.com, p...@occamsmachete.com
I am already using the 0.9.7-aml version.

Just did a fresh install in a new Vagrant box and I keep hitting the memory limit.

The Vagrant box has 8Gb of memory. To train the model I'm using the following command (and all kind of variations with different values ;) )
pio train -- --driver-memory 2G --executor-memory 2G

But this exception keeps showing:
[WARN] [TaskSetManager] Lost task 0.0 in stage 4.0 (TID 8, localhost): java.lang.OutOfMemoryError: Java heap space
at java.lang.String.<init>(String.java:315)
at com.esotericsoftware.kryo.io.Input.readAscii(Input.java:562)
at com.esotericsoftware.kryo.io.Input.readString(Input.java:436)
at com.esotericsoftware.kryo.util.DefaultClassResolver.readName(DefaultClassResolver.java:132)
at com.esotericsoftware.kryo.util.DefaultClassResolver.readClass(DefaultClassResolver.java:115)
at com.esotericsoftware.kryo.Kryo.readClass(Kryo.java:610)
at com.esotericsoftware.kryo.serializers.FieldSerializer$ObjectField.read(FieldSerializer.java:599)
at com.esotericsoftware.kryo.serializers.FieldSerializer.read(FieldSerializer.java:221)
at com.esotericsoftware.kryo.Kryo.readObject(Kryo.java:651)
at com.esotericsoftware.kryo.serializers.FieldSerializer$ObjectField.read(FieldSerializer.java:605)
at com.esotericsoftware.kryo.serializers.FieldSerializer.read(FieldSerializer.java:221)
at com.esotericsoftware.kryo.Kryo.readObject(Kryo.java:651)
at com.esotericsoftware.kryo.serializers.FieldSerializer$ObjectField.read(FieldSerializer.java:605)
at com.esotericsoftware.kryo.serializers.FieldSerializer.read(FieldSerializer.java:221)
at com.esotericsoftware.kryo.Kryo.readObjectOrNull(Kryo.java:706)
at com.esotericsoftware.kryo.serializers.FieldSerializer$ObjectField.read(FieldSerializer.java:611)
at com.esotericsoftware.kryo.serializers.FieldSerializer.read(FieldSerializer.java:221)
at com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:732)
at org.apache.spark.serializer.KryoDeserializationStream.readObject(KryoSerializer.scala:228)
at org.apache.spark.serializer.DeserializationStream.readKey(Serializer.scala:169)
at org.apache.spark.serializer.DeserializationStream$$anon$2.getNext(Serializer.scala:201)
at org.apache.spark.serializer.DeserializationStream$$anon$2.getNext(Serializer.scala:198)
at org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:73)
at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
at org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32)
at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
at scala.collection.Iterator$class.foreach(Iterator.scala:727)
at org.apache.spark.InterruptibleIterator.foreach(InterruptibleIterator.scala:28)
at org.apache.spark.rdd.SubtractedRDD.integrate$1(SubtractedRDD.scala:122)
at org.apache.spark.rdd.SubtractedRDD.compute(SubtractedRDD.scala:127)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)

Is the dataset just too large?

Pat Ferrel

unread,
May 13, 2016, 12:22:32 PM5/13/16
to annevan...@gmail.com, actionml-user
Yes, the dataset is too large. Increase the driver and executor memory together until you run out or the job runs. For “real” data > 4g is minimum and 16g for driver and executor is not uncommon. Since you have everything running in an 8g VM you are probably over-constrained already because the other services need some memory. We recommend 16g for a minimum single machine setup.

--
You received this message because you are subscribed to the Google Groups "actionml-user" group.
To unsubscribe from this group and stop receiving emails from it, send an email to actionml-use...@googlegroups.com.
To post to this group, send email to action...@googlegroups.com.
Reply all
Reply to author
Forward
0 new messages