Your memory size and cores seem far too low for the data size. Spark gets its speed from the fact that all data including intermediate results are in-memory somewhere in the cluster as they are being used. The driver memory should be almost the same size as executor memory (can be less).I have never seen a case where the physical architecture was not influenced heavily by the data size and yours is large. We have deployments that get that much data every day and training is 1.5hrs. So the system must be scaled right.Also if you are using Yarn, is this on a shared Cluster? We do not recommend this since other jobs may be allocated resources that affect your execution time. Sharing an analytics cluster with something that is a business requirement (recommendations) can be problematic. We tend to favor spinning up a dedicated Spark cluster with as many nodes as you need (we have tools to do this on AWS) and after training is done, stop them so you don’t pay anything for them when not in use. With this system training times become quite nicely predictable and arbitrarily short at minimal cost.
On Jul 7, 2017, at 11:51 AM, namita.s...@gmail.com wrote:
I am getting issues in the training of Universal Recommender model.UR model is reading 3 kind of events (purchase,view and atb) from the tsv file.Model create 3 RDDs. The following is the size of RDD2017-07-06 17:02:44,797 INFO com.macys.ur.flexible.DataSource [main] - Partitions after zip end of DataSource: 322017-07-06 17:02:44,798 INFO com.macys.ur.flexible.DataSource [main] - Received events List(purchase, view, atb)2017-07-06 17:03:52,773 INFO com.macys.ur.flexible.DataSource [main] - Number of events List(68032180, 7551743, 196947013)Before dumping into the elastic search the jobs seems stuck and dont even finish. Taking 23 hours or moreI been playing with the driver memory and number of executors but nothing is helpingCommand to submit the jobnohup pio train -- --master yarn --conf="spark.driver.memory=32G" --executor-memory 12G --executor-cores 2 --num-executors 16 >/data/logs/flexible-ur/train2017-07-07-small.log 2>&1 &Please provide some guidance on the issue.--
You received this message because you are subscribed to the Google Groups "actionml-user" group.
To unsubscribe from this group and stop receiving emails from it, send an email to actionml-user+unsubscribe@googlegroups.com.
spark.default.parallelism parameter =144
nohup pio train -- --master yarn --driver-memory 20G --executor-memory 24G --executor-cores 7 --num-executors 14 >/data/logs/flexible-ur/train2017-07-12-small.log 2>&1 &
Also noticed that the following steps taking so much time and the tasks. Can you please tell me if we can somehow improve these steps ?
![]()
Also the Job is creating huge DAG

Thanks
To post to this group, send email to actionml-user@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/actionml-user/feca2f6e-cc0a-4f06-80c7-061ae8f0eb82%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
--
You received this message because you are subscribed to the Google Groups "actionml-user" group.
To unsubscribe from this group and stop receiving emails from it, send an email to actionml-user+unsubscribe@googlegroups.com.