Re: UR template train error - stage 14

44 views
Skip to first unread message

Pat Ferrel

unread,
May 31, 2016, 4:10:47 PM5/31/16
to Ádám Krajcs, predictionio-user, actionml-user
You need to use 0.9.7-aml from here: actionml.com/docs/install

Sorry but the 0.9.6 version has a bug that uses too much memory with the eventWindow, adding more made it run for now but when you have more data, the effect of the bug will show up again.

BTW remember to use https://groups.google.com/forum/#!forum/actionml-user for better UR support (or the AML branch of PIO)


On May 31, 2016, at 2:13 AM, Ádám Krajcs <adam....@gmail.com> wrote:

After we increased the cpu number and the memory the train went smooth.

2016-05-29 22:45 GMT+02:00 Ádám Krajcs <adam....@gmail.com>:
The pio version says 0.9.6. I tried with the UR template master branch, but I've got the same error messages.

2016-05-26 22:08 GMT+02:00 Pat Ferrel <p...@occamsmachete.com>:
what does `pio version` say?


On May 26, 2016, at 10:04 AM, adam....@gmail.com wrote:

Hey Pat,

Something strange happens in stage 14, and I don't know the reason. We use UR template 3.0, with eventWindow property:
"eventWindow": {
        "duration": "28 days",
        "removeDuplicates": true,
        "compressProperties": true
      }

Everything worked 'till now. The train throws an error in stage 14:
I've attached the spark ui.
Maybe we should simply increase the timeout? Where should we set the timeout


[Stage 14:===================>                                      (1 + 2) / 3][WARN] [HeartbeatReceiver] Removing executor 0 with no recent heartbeats: 131961 ms exceeds timeout 120000 ms
[ERROR] [TaskSchedulerImpl] Lost executor 0 on sparkmaster.profession.hu: Executor heartbeat timed out after 131961 ms
[WARN] [TaskSetManager] Lost task 0.0 in stage 14.0 (TID 82, sparkmaster.profession.hu): ExecutorLostFailure (executor 0 exited caused by one of the running tasks) Reason: Executor heartbeat timed out after 131961 ms
[WARN] [TaskSetManager] Lost task 2.0 in stage 14.0 (TID 84, sparkmaster.profession.hu): ExecutorLostFailure (executor 0 exited caused by one of the running tasks) Reason: Executor heartbeat timed out after 131961 ms
[WARN] [TransportChannelHandler] Exception in connection from sparkmaster.profession.hu/172.31.3.141:63250
[ERROR] [TaskSchedulerImpl] Lost executor 0 on sparkmaster.profession.hu: Remote RPC client disassociated. Likely due to containers exceeding thresholds, or network issues. Check driver logs for WARN messages.
[WARN] [TaskSetManager] Lost task 2.1 in stage 14.0 (TID 85, sparkmaster.profession.hu): ExecutorLostFailure (executor 0 exited caused by one of the running tasks) Reason: Remote RPC client disassociated. Likely due to containers exceeding thresholds, or network issues. Check driver logs for WARN messages.
[WARN] [TaskSetManager] Lost task 2.2 in stage 14.0 (TID 87, sparkslave01.profession.hu): FetchFailed(BlockManagerId(0, sparkmaster.profession.hu, 61901), shuffleId=5, mapId=1, reduceId=2, message=
org.apache.spark.shuffle.FetchFailedException: Failed to connect to sparkmaster.profession.hu/172.31.3.141:61901
        at org.apache.spark.storage.ShuffleBlockFetcherIterator.throwFetchFailedException(ShuffleBlockFetcherIterator.scala:323)
        at org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:300)
        at org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:51)
        at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
        at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
        at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
        at org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32)
        at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
        at scala.collection.Iterator$class.foreach(Iterator.scala:727)
        at org.apache.spark.InterruptibleIterator.foreach(InterruptibleIterator.scala:28)
        at org.apache.spark.rdd.SubtractedRDD.integrate$1(SubtractedRDD.scala:122)
        at org.apache.spark.rdd.SubtractedRDD.compute(SubtractedRDD.scala:127)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
        at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
        at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
        at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
        at org.apache.spark.scheduler.Task.run(Task.scala:89)
        at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
        at java.lang.Thread.run(Thread.java:745)
Caused by: java.io.IOException: Failed to connect to sparkmaster.profession.hu/172.31.3.141:61901
        at org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:216)
        at org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:167)
        at org.apache.spark.network.netty.NettyBlockTransferService$$anon$1.createAndStart(NettyBlockTransferService.scala:90)
        at org.apache.spark.network.shuffle.RetryingBlockFetcher.fetchAllOutstanding(RetryingBlockFetcher.java:140)
        at org.apache.spark.network.shuffle.RetryingBlockFetcher.access$200(RetryingBlockFetcher.java:43)
        at org.apache.spark.network.shuffle.RetryingBlockFetcher$1.run(RetryingBlockFetcher.java:170)
        at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
        at java.util.concurrent.FutureTask.run(FutureTask.java:266)
        ... 3 more
Caused by: java.net.ConnectException: Connection refused: sparkmaster.profession.hu/172.31.3.141:61901
        at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
        at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717)
        at io.netty.channel.socket.nio.NioSocketChannel.doFinishConnect(NioSocketChannel.java:224)
        at io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(AbstractNioChannel.java:289)
        at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:528)
        at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468)
        at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382)
        at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354)
        at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:111)
        ... 1 more

)
[WARN] [TaskSetManager] Lost task 0.1 in stage 14.0 (TID 86, sparkslave01.profession.hu): FetchFailed(BlockManagerId(0, sparkmaster.profession.hu, 61901), shuffleId=5, mapId=1, reduceId=0, message=
org.apache.spark.shuffle.FetchFailedException: Failed to connect to sparkmaster.profession.hu/172.31.3.141:61901
        at org.apache.spark.storage.ShuffleBlockFetcherIterator.throwFetchFailedException(ShuffleBlockFetcherIterator.scala:323)
        at org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:300)
        at org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:51)
        at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
        at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
        at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
        at org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32)
        at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
        at scala.collection.Iterator$class.foreach(Iterator.scala:727)
        at org.apache.spark.InterruptibleIterator.foreach(InterruptibleIterator.scala:28)
        at org.apache.spark.rdd.SubtractedRDD.integrate$1(SubtractedRDD.scala:122)
        at org.apache.spark.rdd.SubtractedRDD.compute(SubtractedRDD.scala:127)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
        at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)

Regards,
Adam Krajcs
<PredictionIO Training  org.template.RecommendationEngine - Details for Stage 14 (Attempt 0).htm>




Reply all
Reply to author
Forward
0 new messages