Error on train - (0 + 2) / 2][WARN] [HeartbeatReceiver] Removing executor 0 with no recent heartbeat

৬৩৭টি ভিউ
প্রথম অপঠিত মেসেজটিতে চলে আসুন

Federico Reggiani

পড়া হয়নি,
৭ জুন, ২০১৬, ১:৪৯:২৭ PM৭/৬/১৬
প্রাপক actionml-user
[Stage 4:>                                                          (0 + 2) / 2][WARN] [HeartbeatReceiver] Removing executor 0 with no recent heartbeats: 157912 ms exceeds timeout 120000 ms
[ERROR] [TaskSchedulerImpl] Lost executor 0 on some-master: Executor heartbeat timed out after 157912 ms
[WARN] [TaskSetManager] Lost task 0.0 in stage 4.0 (TID 8, some-master): ExecutorLostFailure (executor 0 exited caused by one of the running tasks) Reason: Executor heartbeat timed out after 157912 ms
[WARN] [TaskSetManager] Lost task 1.0 in stage 4.0 (TID 9, some-master): ExecutorLostFailure (executor 0 exited caused by one of the running tasks) Reason: Executor heartbeat timed out after 157912 ms
[ERROR] [TaskSchedulerImpl] Lost executor 0 on some-master: Remote RPC client disassociated. Likely due to containers exceeding thresholds, or network issues. Check driver logs for WARN messages.
[Stage 4:>                                                          (0 + 2) / 2][WARN] [TaskSetManager] Lost task 1.1 in stage 4.0 (TID 10, some-master): FetchFailed(null, shuffleId=0, mapId=-1, reduceId=1, message=
org.apache.spark.shuffle.MetadataFetchFailedException: Missing an output location for shuffle 0
    at org.apache.spark.MapOutputTracker$$anonfun$org$apache$spark$MapOutputTracker$$convertMapStatuses$2.apply(MapOutputTracker.scala:542)
    at org.apache.spark.MapOutputTracker$$anonfun$org$apache$spark$MapOutputTracker$$convertMapStatuses$2.apply(MapOutputTracker.scala:538)
    at scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:772)
    at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
    at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108)
    at scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:771)
    at org.apache.spark.MapOutputTracker$.org$apache$spark$MapOutputTracker$$convertMapStatuses(MapOutputTracker.scala:538)
    at org.apache.spark.MapOutputTracker.getMapSizesByExecutorId(MapOutputTracker.scala:155)
    at org.apache.spark.shuffle.BlockStoreShuffleReader.read(BlockStoreShuffleReader.scala:47)
    at org.apache.spark.rdd.SubtractedRDD.integrate$1(SubtractedRDD.scala:121)
    at org.apache.spark.rdd.SubtractedRDD.compute(SubtractedRDD.scala:127)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
    at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
    at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
    at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
    at org.apache.spark.scheduler.Task.run(Task.scala:89)
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
    at java.lang.Thread.run(Thread.java:745)

)


I have 4 million events, 100000 users and 50000 items.
Is that error a memory (32gb) or CPU (8 cores) problem?

Federico Reggiani

পড়া হয়নি,
৭ জুন, ২০১৬, ১:৫৮:১২ PM৭/৬/১৬
প্রাপক actionml-user
engine.json


{
  "comment":" This config file uses default settings for all but the required values see README.md for docs",
  "id": "default",
  "description": "Default settings",
  "engineFactory": "org.template.RecommendationEngine",
  "datasource": {
    "params" : {
      "name": "some-data",
      "appName": "myapp",
      "eventNames": ["purchase", "view", "addtocart"]
      "eventWindow": {
        "duration": "365 days",
        "removeDuplicates": false,
        "compressProperties": false
      }
    }
  },
  "sparkConf": {
    "spark.serializer": "org.apache.spark.serializer.KryoSerializer",
    "spark.kryo.registrator": "org.apache.mahout.sparkbindings.io.MahoutKryoRegistrator",
    "spark.kryo.referenceTracking": "false",
    "spark.kryoserializer.buffer": "300m",
    "spark.executor.memory": "11g",
    "spark.driver.memory": "11g",
    "spark.executor.cores": "8",
    "spark.task.cpus": "4",
    "spark.default.parallelism": "16",
    "es.index.auto.create": "true"
  },
  "algorithms": [
    {
      "comment": "simplest setup where all values are default, popularity based backfill, must add eventsNames",
      "name": "ur",
      "params": {
        "appName": "myapp",
        "indexName": "urindex",
        "typeName": "items",
        "comment": "must have data for the first event or the model will not build, other events are optional",
        "eventNames": ["purchase", "view", "addtocart"],
        "backfillField": {
            "name": "popRank"
            "backfillType": "trending",
            "eventNames": ["purchase", "view"],
            "duration": "3 days",
        },
      }
    }
  ]

Pat Ferrel

পড়া হয়নি,
৭ জুন, ২০১৬, ২:১৩:১৩ PM৭/৬/১৬
প্রাপক Federico Reggiani,actionml-user
The driver is launched before the sparkConf is read (the way you have it in engine.json only works in an obscure mode of Spark using Yarn) so pass --driver-memory on Spark side of the command line.

You have different eventName lists for datasource and algorithm, which could cause problems. Also you have a lot of tuning params that I would leave out until you get a baseline completed task before you try to tune it.

remove:
    "spark.driver.memory": "11g",
    "spark.executor.cores": "8",
    "spark.task.cpus": "4",
    "spark.default.parallelism": "16",

you are likely to need to tune only memory to get it running (if all other things are correct) so I usually add the following to the train CLI

pio train -- --driver-memory 11g --executor-memory 11g --master spark://some-master:7077

If you have trouble with your cluster try with a smaller subset until you get it running, then try the entire dataset to tune memory needed.

-- 
You received this message because you are subscribed to the Google Groups "actionml-user" group.
To unsubscribe from this group and stop receiving emails from it, send an email to actionml-use...@googlegroups.com.
To post to this group, send email to action...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/actionml-user/346049c7-be11-47fd-91dc-c006ac9400aa%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Federico Reggiani

পড়া হয়নি,
৭ জুন, ২০১৬, ৩:০২:০২ PM৭/৬/১৬
প্রাপক actionml-user,p...@occamsmachete.com
I don't understand this part


"You have different eventName lists for datasource and algorithm"

I have ["purchase", "view", "addtocart"] in both cases, no?

Is in backfillField where I have only two, you mean there?

Pat Ferrel

পড়া হয়নি,
৭ জুন, ২০১৬, ৪:২৫:৫৬ PM৭/৬/১৬
প্রাপক Federico Reggiani,actionml-user
Sorry, I misread. It’s ok to have 2 in backfill.

Not sure what is causing a timeout, either memory or disk for temp storage.  Have you gotten things to work on a smaller subset of the data? You’ll have to size your cluster once you have gotten the code/config/data to work. 

Are you running on a single machine?

Federico Reggiani

পড়া হয়নি,
৮ জুন, ২০১৬, ৯:১৯:৩৫ AM৮/৬/১৬
প্রাপক actionml-user,p...@occamsmachete.com
Yes, this is resources problem.
I have increased all (disk, cpu, memory) and it works now
Yes, it's a single machine an Amazon AWS m4.4xlarge (64gb memory, 16 cores)

Pat Ferrel

পড়া হয়নি,
৮ জুন, ২০১৬, ৮:৪০:৫৬ PM৮/৬/১৬
প্রাপক Federico Reggiani,actionml-user
Horizontal scaling make more sense, since the only use of Spark is during training and that occurs only periodically. We create a Spark cluster using some scripts and Docker, do the training, then tear it down. Paying for 100% of a m4.4xlarge is far more expensive. We will release some tools that do this eventually but you may want to look into it yourself. 

When running Spark on one machine it needs 2x the memory since it runs the driver and executor. And you have to pay for the memory even when you aren’t running Spark. When you scale horizontally each machine running Spark can be scaled back to 1x whatever memory you need. Also you can use ephemeral storage since Spark only needs temp storage and persists data to HBase. 
 

Federico Reggiani

পড়া হয়নি,
৯ জুন, ২০১৬, ১১:৪৪:৩৪ AM৯/৬/১৬
প্রাপক actionml-user,p...@occamsmachete.com
Yes, you are right.
I'm just making some test to see the requirements and to see the first results.
When I launch it in live, I will use some Spark slaves, sure.

Federico Reggiani

পড়া হয়নি,
১২ নভে, ২০১৭, ৩:০৬:২৪ PM১২/১১/১৭
প্রাপক actionml-user
Hello again Pat
What do you mean with separated Spark?
One master in a different machine? Because your docs only show a cluster.
Is possible to have just one Spark master (not in Master server, a separated one just for Spark), and turn it on only at training?
O there must be a Spark master in Master server and slaves that you turn on only for training?

Pat Ferrel

পড়া হয়নি,
১২ নভে, ২০১৭, ৪:৪৭:১৭ PM১২/১১/১৭
প্রাপক Federico Reggiani,actionml-user
The ActionML.com docs show clustered and non-clusered setup and the AWS AMI is an all-in-one setup for developer experimentation (not recommended for production).

“Standalone” Spark has 3 types of machines 1) the Driver machine that runs `pio train` for the UR 2) the Spark Master and 3) Spark Executors. In our ideal deployment we run a machine for he Driver that runs `pio train` and several for Spark Executors, one of which is also a Master (masters have little hard work to do).

We turn on the Driver and all Executors (including the one with the Master) then run `pio train` then turn them off since the model is saved in Elasticsearch running on it’s separate machines. This means you pay for whatever duty cycle you need for Spark based on how often you train.

We have automation tools for all this based on Docker, Terraform, and Chef. Many are in our repos but not supported since they are not polished to be community supported.
--
You received this message because you are subscribed to the Google Groups "actionml-user" group.
To unsubscribe from this group and stop receiving emails from it, send an email to actionml-use...@googlegroups.com.
To post to this group, send email to action...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/actionml-user/505b0395-5890-400a-86e6-d435b268ace7%40googlegroups.com.
সকলকে উত্তর দিন
লেখককে লিখে পাঠান
ফরওয়ার্ড করুন
0টি নতুন মেসেজ