PIO UR model issue in training

56 views
Skip to first unread message

namita.s...@gmail.com

unread,
Jul 7, 2017, 2:51:09 PM7/7/17
to actionml-user
I am getting issues in the  training of  Universal Recommender model.

UR model is reading 3 kind of events (purchase,view and atb) from the tsv file.
Model create 3 RDDs. The following is the size of RDD

2017-07-06 17:02:44,797 INFO  com.macys.ur.flexible.DataSource [main] - Partitions after zip end of DataSource: 32
2017-07-06 17:02:44,798 INFO  com.macys.ur.flexible.DataSource [main] - Received events List(purchase, view, atb)
2017-07-06 17:03:52,773 INFO  com.macys.ur.flexible.DataSource [main] - Number of events List(68032180, 7551743, 196947013)

Before dumping into the elastic search the jobs seems stuck and dont even finish. Taking 23 hours or more 

I been playing with the driver memory and number of executors but nothing is helping  

Command to submit the job 
nohup pio train -- --master yarn --conf="spark.driver.memory=32G" --executor-memory 12G --executor-cores 2 --num-executors 16 >/data/logs/flexible-ur/train2017-07-07-small.log 2>&1 &

Please provide some guidance on the issue.

Pat Ferrel

unread,
Jul 7, 2017, 3:35:49 PM7/7/17
to namita.s...@gmail.com, actionml-user
Your memory size and cores seem far too low for the data size. Spark gets its speed from the fact that all data including intermediate results are in-memory somewhere in the cluster as they are being used. The driver memory should be almost the same size as executor memory (can be less).

I have never seen a case where the physical architecture was not influenced heavily by the data size and yours is large. We have deployments that get that much data every day and training is 1.5hrs. So the system must be scaled right.

Also if you are using Yarn, is this on a shared Cluster? We do not recommend this since other jobs may be allocated resources that affect your execution time. Sharing an analytics cluster with something that is a business requirement (recommendations) can be problematic. We tend to favor spinning up a dedicated Spark cluster with as many nodes as you need (we have tools to do this on AWS) and after training is done, stop them so you don’t pay anything for them when not in use. With this system training times become quite nicely predictable and arbitrarily short at minimal cost.


--
You received this message because you are subscribed to the Google Groups "actionml-user" group.
To unsubscribe from this group and stop receiving emails from it, send an email to actionml-use...@googlegroups.com.
To post to this group, send email to action...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/actionml-user/feca2f6e-cc0a-4f06-80c7-061ae8f0eb82%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

namita Sharma

unread,
Jul 7, 2017, 4:48:31 PM7/7/17
to Pat Ferrel, actionml-user
Hi Pat,

Thanks for the quick reply. Cluster is not a shared cluster. We use this cluster for model training purpose only. 

Executor memory we are using 16GB. Can you please suggest ideal number we should be using. Eventually we need to train on more data. This data is only for 20 days and we need to train it for at-least 60 days of data 
nohup pio train -- --master yarn --conf="spark.driver.memory=32G" --executor-memory 16G --executor-cores 2 --num-executors 16 >/data/logs/flexible-ur/train2017-07-07-small.log 2>&1 &

Thanks 

On Fri, Jul 7, 2017 at 12:35 PM, Pat Ferrel <p...@occamsmachete.com> wrote:
Your memory size and cores seem far too low for the data size. Spark gets its speed from the fact that all data including intermediate results are in-memory somewhere in the cluster as they are being used. The driver memory should be almost the same size as executor memory (can be less).

I have never seen a case where the physical architecture was not influenced heavily by the data size and yours is large. We have deployments that get that much data every day and training is 1.5hrs. So the system must be scaled right.

Also if you are using Yarn, is this on a shared Cluster? We do not recommend this since other jobs may be allocated resources that affect your execution time. Sharing an analytics cluster with something that is a business requirement (recommendations) can be problematic. We tend to favor spinning up a dedicated Spark cluster with as many nodes as you need (we have tools to do this on AWS) and after training is done, stop them so you don’t pay anything for them when not in use. With this system training times become quite nicely predictable and arbitrarily short at minimal cost.


On Jul 7, 2017, at 11:51 AM, namita.s...@gmail.com wrote:

I am getting issues in the  training of  Universal Recommender model.

UR model is reading 3 kind of events (purchase,view and atb) from the tsv file.
Model create 3 RDDs. The following is the size of RDD

2017-07-06 17:02:44,797 INFO  com.macys.ur.flexible.DataSource [main] - Partitions after zip end of DataSource: 32
2017-07-06 17:02:44,798 INFO  com.macys.ur.flexible.DataSource [main] - Received events List(purchase, view, atb)
2017-07-06 17:03:52,773 INFO  com.macys.ur.flexible.DataSource [main] - Number of events List(68032180, 7551743, 196947013)

Before dumping into the elastic search the jobs seems stuck and dont even finish. Taking 23 hours or more 

I been playing with the driver memory and number of executors but nothing is helping  

Command to submit the job 
nohup pio train -- --master yarn --conf="spark.driver.memory=32G" --executor-memory 12G --executor-cores 2 --num-executors 16 >/data/logs/flexible-ur/train2017-07-07-small.log 2>&1 &

Please provide some guidance on the issue.

--
You received this message because you are subscribed to the Google Groups "actionml-user" group.
To unsubscribe from this group and stop receiving emails from it, send an email to actionml-user+unsubscribe@googlegroups.com.

Pat Ferrel

unread,
Jul 7, 2017, 5:17:05 PM7/7/17
to namita Sharma, actionml-user
Why not use all cores and all memory, otherwise you are letting it go to waste.

namita Sharma

unread,
Jul 10, 2017, 1:38:21 PM7/10/17
to Pat Ferrel, actionml-user
Hi Pat ,

I am running the following cofig now 
nohup pio train -- --master yarn --conf="spark.driver.memory=32G" --executor-memory 20G --executor-cores 8 --num-executors 16 >/data/logs/flexible-ur/train2017-07-07-small.log 2>&1 &


I noticed that some of the tasks are using only 1 CPU  and stuck on collect at URModel.scala:98

example  filter at package.scala:126 +details 2017/07/06 19:39:00 11.4 h

Thanks
Namita

Pat Ferrel

unread,
Jul 10, 2017, 3:54:56 PM7/10/17
to namita Sharma, actionml-user
It’s pretty hard to debug this without access to the system. What we often do using AWS is keep increasing the size of the machines until your data is processed, each time increasing the driver and executor memory. This is why we favor a temporary Spark cluster. You are in big-data land now.  

Why would you give the driver 32g and the executor only 20g? Is the executor machine smaller than the driver? Do the machines only have 32g? This is probably not enough for your data. Also 2 cores? this seems undersized and so you will see bottlenecks until you get this scaled out.

Collect uses only one machine since it is creating an in-memory collection of all of some part of the data. Set the maxParallelism of Spark and this may help repartition things and use more threads and cores for other tasks. 


To unsubscribe from this group and stop receiving emails from it, send an email to actionml-use...@googlegroups.com.

To post to this group, send email to action...@googlegroups.com.

namita Sharma

unread,
Jul 13, 2017, 1:52:37 PM7/13/17
to Pat Ferrel, actionml-user
Hi Pat,

Our model is finally training in 10.5 hours now with 3 times more data. We are using the following configuration now 

spark.default.parallelism parameter =144

nohup pio train -- --master yarn --driver-memory 20G --executor-memory 24G --executor-cores 7 --num-executors 14 >/data/logs/flexible-ur/train2017-07-12-small.log 2>&1 &

Also noticed that the following steps taking so much time and the tasks. Can you please tell me if we can somehow improve these steps ?

Inline image 1

Also the Job is creating huge DAG 

Inline image 2


Thanks



To post to this group, send email to actionml-user@googlegroups.com.

-- 
You received this message because you are subscribed to the Google Groups "actionml-user" group.
To unsubscribe from this group and stop receiving emails from it, send an email to actionml-user+unsubscribe@googlegroups.com.
To post to this group, send email to actionml-user@googlegroups.com.

Pat Ferrel

unread,
Jul 13, 2017, 1:58:40 PM7/13/17
to namita Sharma, actionml-user
To do this faster you need a larger Spark cluster with more memory. This is why we always deploy temporary Spark to big-data clients so you can train in an hour and not pay for the time Spark isn’t running.

BTW the rule of thumb for max parallelism is 4x the total cores in the cluster.

Are you using UR 0.6.0 unmodified?


To unsubscribe from this group and stop receiving emails from it, send an email to actionml-use...@googlegroups.com.
To post to this group, send email to action...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/actionml-user/CANeDa-4SaW8NTDQdBXaHjOkiqV3pnO14tTpigpYBdTs06MtGOQ%40mail.gmail.com.

Pat Ferrel

unread,
Jul 13, 2017, 2:33:20 PM7/13/17
to actionml-user


Begin forwarded message:

From: Pat Ferrel <p...@occamsmachete.com>
Subject: Re: PIO UR model issue in training
Date: July 13, 2017 at 11:32:19 AM PDT
To: namita Sharma <namita.s...@gmail.com>

That is 4g per core, 64g per executor? When you set the driver and executor memory they can’t be rmore than you physically have. So set executor memory to 62g to leave a bit for Spark overhead and the Master. Remember that the driver runs on the `pio train` machine and needs almost as much memory as the executors. Ideally it has only pio train and the Spark driver running on it.

You definitely should upgrade, there are bugs fixed in newer versions.


On Jul 13, 2017, at 11:19 AM, namita Sharma <namita.s...@gmail.com> wrote:

Thanks Pat 
cluster size is 8 nodes 64 GB 16 core each. So I have increased the parallelism to 512.
Currently using UR 0.5.0

Reply all
Reply to author
Forward
0 new messages