Yarn Scheduling Issues

Anahita Shayesteh

unread,

Dec 22, 2016, 2:20:00 PM12/22/16

to Big Data Benchmark for BigBench

Hi all,

I am running TPCx-BB on CDH.5.8 with scale factor of 1000 and I am able to run on Hive and Spark SQL engines. (Separate installation of Spark 2.1). However when running on Hive, I see several failed ApplicationMaster attempts which seems to be due to Yarn not providing resources. Usually if I run the benchmark back to back, my second MapReduce application (dataGen-refersh) fails. If I reboot my machines before starting the benchmark, it runs without errors. I see similar issue during query execution, when first ApplicationMaster attempt fails, but the second one runs successfully, creating unnecessary delay and latencies during query run.

I also see similar behavior when I run on Spark SQL. I get this warning:

WARN cluster.YarnScheduler: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources

But after ten minutes the resources are assigned and spark SQL job proceeds to run successfully. Usually I see this only for first query. I am not running any other application on these machines/cluster, so the delay in scheduling does not make sense. Has any one seen similar behavior? When I search for this, usually there is a real resource issue, but in my case resources are available, just not getting allocated properly which could be a bug in Yarn. I am attaching my failed MapReduce job log.

I appreciate your comments.

Thank you,

Anahita

application_0167.log

Message has been deleted

Anahita Shayesteh

unread,

Dec 29, 2016, 5:04:02 PM12/29/16

to Big Data Benchmark for BigBench

I have asked the question on Cloudera forum and seems like the problem is not with Resource Manager.

http://community.cloudera.com/t5/Batch-Processing-and-Workflow/Yarn-Resource-Manager-issues-MapReduce-job-pending/m-p/48764#U48764

Here is some logs that might help in root-causing the issue.

Thank you,

Anahita

application_1482947375640_0166.yarn.log

application_1482947375640_0166_container_1482947375640_0166_01_000001.yarn.log

Anahita Shayesteh

unread,

Dec 29, 2016, 5:24:46 PM12/29/16

to Big Data Benchmark for BigBench

On Thursday, December 22, 2016 at 11:20:00 AM UTC-8, Anahita Shayesteh wrote:

hadoop-cmf-yarn-NODEMANAGER-msl-dpe-perf45p10.log.out

Anahita Shayesteh

unread,

Dec 29, 2016, 8:06:30 PM12/29/16

to Big Data Benchmark for BigBench

On Thursday, December 22, 2016 at 11:20:00 AM UTC-8, Anahita Shayesteh wrote:

syslog

Michael Frank

unread,

Dec 30, 2016, 6:41:30 AM12/30/16

to Big Data Benchmark for BigBench

Hi Anahita,

Please have a look at the following link. The error you encountered is discussed there.
http://www.datastax.com/dev/blog/common-spark-troubleshooting

But reffering to the information posted on cloudera's forum: (answer below)

    Cloudera Enterprise Data Hub Edition Trial 5.9.0 (had the same issue with 5.8.0)
    Java Version 1.7.0.
    I have a four node cluster with one dedicated to server roles (HDFS NameNode, HiveServer, ClouderaManagement Server, Yarn Server, etc..) and the other three to worker roles (DataNode, NodeManager).
    Each node has a dual socket Xeon E5-2670 with total 24 physical cores (48 with hyper-threading) and 256 GB memory and one PCIe SSD with 3TB of storage.
    Yarn configuration options:
    FairScheduler
    Container virtual cpu cores: 44
    Container memory : 160GB
    Map(Reduce) task memory: 4GB
    Map(Reduce) heap size: 3GB
    Resource manager shows 480GB memory and 132 vcores as expected.

I suspect you misconfigured your cluster. If you really assigned 44 VCores and 160GB of memory per container you can only start 4 containers in your cluster!
Reasoning: many applications (including Hive) can only use a single Vcore per Container!. You cant fully utilize your cluster if youd dont have a container per usable Vcore.
The error you experienced occures because: hive/spark try to allocate resources on a container basis. But since you allocate this much resources per container, you cant start much containers. So the ApplicationManager of hive/spark waits for more containers to become available, or for the (huge) amount of resources to become avaliable in each container. Which will not happen.
Then after some timeout, YARN will detect the life-lock and start shutting down containers to allow the remaining containers to procced.

In Short: you need more containers. Since you have plenty of RAM, the Vcores are the sparse resource. You want to configure YARN to have a Container for each VCores .

I suggest you set:
Reserved resources per node (not used for containers but for host os, cloudera manager, nodemanager, historyserver and so on): 4 Vcores and 36GB
This leaves each node with the following resoures for yarn containers per node: 220GB and 44 VCores
I would go for at least min 4 containers per node and a maximum of 44 Containers (one per vcore) per node.

This will yield the following min/max settings:
Resources available for containers on each node:
yarn.nodemanager.resource.cpu-vcores=44
yarn.nodemanager.resource.memory-mb=220GB

We want 4 - 44 Containers per node:
yarn.scheduler.minimum-allocation-mb=5GB ( 220GB / 44VCores)
yarn.scheduler.maximum-allocation-mb=55GB (220GB / 4Containers)
yarn.scheduler.minimum-allocation-vcores=1
yarn.scheduler.maximum-allocation-vcores=11 (44Vcores / 4Containers)
yarn.scheduler.increment-allocation-vcores=1
yarn.scheduler.increment-allocation-mb=1GB

Mapper/Reducer/javaopts settings: (guideline for the various java opts settings: *opts=-Xmx<*.mb * 0.8>m
e.g: mapreduce.map.java.opts=-Xmx<mapreduce.map.memory.mb * 0.8>m=-Xmx4096m

mapreduce.map.memory.mb=5120
mapreduce.map.java.opts=-Xmx4096m
mapreduce.reduce.memory.mb=5120
mapreduce.reduce.java.opts=-Xmx4096m
yarn.app.mapreduce.am.resource.mb=6144
yarn.app.mapreduce.am.command-opts=-Xmx4096m
...
<Dont forget the various other memory setting: Hive, TEZ, Spark,.. Spark can use more then 1 vcore per container, so sparks settings may differ greatly from hive's settings>

This should give you a cluster (@ 3 Worker nodes) with a total of:
132 Vcores
660GB Ram
Capable of running:
max 132 Containers (@ 1 Vcore & 5GB per container)
min 12 Containers (@ 4 Vcores and 55GB per container)
https://www.cloudera.com/documentation/enterprise/5-5-x/topics/cdh_ig_yarn_tuning1.html.yarn-site.xml

Now you have to tune your hive settings to match your cluster AND choosen scale factor of bigbench (=data volume)
for 132 Containers and sf 10000 (1TB raw file size) i suggest setting the following benchmark, scalefactor and cluster specific parameters in /engines/hive/conf/engineSettings.sql
set mapreduce.input.fileinputformat.split.minsize=4194304;
set mapreduce.input.fileinputformat.split.maxsize=134217728;
set hive.exec.reducers.bytes.per.reducer=67108864;

This should result in fairly good utilization of your cluster

NOTE FOR OTHER READER: these settings are highly specific to the benchmark, scalefactor and cluster configuration!
What you aim fo: most querys spawning enough map/reduce tasks to fully saturate your clusters Containers. You can use q10 for good measure to start tuning. If q10 allocates less mappers then containers in your cluster decrease mapreduce.input.fileinputformat.split.minsize, if it starts far to manny increase it. Same for bytes.per.reducer.
My FAQ in readme.md of https://github.com/intel-hadoop/Big-Data-Benchmark-for-Big-Bench also covers this topic.

Cheers
Michael

Anahita Shayesteh

unread,

Dec 30, 2016, 1:30:26 PM12/30/16

to Big Data Benchmark for BigBench

Hi Michael,

Thank you for your response. I appreciate your detailed explanation and tuning guides on this. I wanted to clarify a few configurations in my cluster specifically on min/max settings. Currently I have:

yarn.scheduler.minimum-allocation-mb=4GB

yarn.scheduler.maximum-allocation-mb=160GB

yarn.scheduler.increment-allocation-mb=1GB

yarn.scheduler.minimum-allocation-vcores=1

yarn.scheduler.maximum-allocation-vcores=44

yarn.scheduler.increment-allocation-vcores=1

In this case I still have the maximum of 44 containers per node, but my minimum is 1, instead of 4 that you suggested. I see why you would recommend tuning it with min of 4 containers, but wonder if the issue I am seeing is related to this.

What confuses me most is that I am able to fully utilize the cores. When jobs run I see all 132 vcores are used and my cluster resources are fully utilized. However at times, first attempt at AppliationMaster does not make any progress and is killed but second attempt is successful.

Specifically if I don't reboot my machines running TPCx-BB multiple times, I see that second MR application (dataGen-refresh) fails most of the time.

As for Spark, only for first query there is a huge delay before resources get allocated. Since I am not running any other application on my cluster, I don't think it is a real lack of resource like what others experience. (like this: http://www.datastax.com/dev/blog/common-spark-troubleshooting)