Is there something wrong with how I am trying to start alluxio in hdinsight?

567 views
Skip to first unread message

slybyu...@gmail.com

unread,
Nov 14, 2016, 11:51:22 AM11/14/16
to Alluxio Users
Hello, everyone.

I am trying to make alluxio work in ms hdinsight cluster, and hit a roadblock.

I succeeded in making alluxio work with the headnode, and was trying to make it work on yarn.

What happens is, job registers and run in yarn, but it doesn't start the ui(which I assume, should start, doesn't it?).

environment: 
hdinsight linux cluster for spark- 3.5
Alluxio-1.3.0
Hadoop / hdfs / yarn - 2.7.1



 integration/bin/alluxio-yarn.sh 2 wasbs://machine-name@stoagegroup/tmp localhost

that wasbs address is the address I use to access the files in hdfs, so I don't think it went wrong there..

and the output I see is:

Using $HADOOP_HOME set to '/usr/hdp/2.5.1.0-56/hadoop/'
Uploading files to HDFS to distribute alluxio runtime
Starting YARN client to launch Alluxio on YARN
Initializing Client
Starting Client
2016-11-14 16:08:46,478 INFO  TimelineClientImpl (TimelineClientImpl.java:serviceInit) - Timeline service address: http://headnodehost:8188/ws/v1/timeline/
2016-11-14 16:08:46,736 INFO  AHSProxy (AHSProxy.java:createAHSProxy) - Connecting to Application History server at headnodehost/10.0.0.13:10200
2016-11-14 16:08:47,010 INFO  MetricsConfig (MetricsConfig.java:loadFirst) - loaded properties from hadoop-metrics2-azure-file-system.properties
2016-11-14 16:08:47,015 INFO  WasbAzureIaasSink (?:init) - Init starting.
2016-11-14 16:08:47,015 INFO  AzureIaasSink (?:init) - Init starting. Initializing MdsLogger.
2016-11-14 16:08:47,017 INFO  AzureIaasSink (?:init) - Init completed.
2016-11-14 16:08:47,017 INFO  WasbAzureIaasSink (?:init) - Init completed.
2016-11-14 16:08:47,022 INFO  MetricsSinkAdapter (MetricsSinkAdapter.java:start) - Sink azurefs2 started
2016-11-14 16:08:47,077 INFO  MetricsSystemImpl (MetricsSystemImpl.java:startTimer) - Scheduled snapshot period at 60 second(s).
2016-11-14 16:08:47,083 INFO  MetricsSystemImpl (MetricsSystemImpl.java:start) - azure-file-system metrics system started
ApplicationMaster command: ./alluxio-yarn-setup.sh application-master -num_workers 2 -master_address localhost -resource_path wasbs address 1><LOG_DIR>/stdout 2><LOG_DIR>/stderr
Submitting application of id application_1476888496011_0177 to ResourceManager
2016-11-14 16:08:48,006 INFO  YarnClientImpl (YarnClientImpl.java:submitApplication) - Submitted application application_1476888496011_0177
Application is in state ACCEPTED. Waiting.
2016-11-14 16:08:58,029 INFO  MetricsSystemImpl (MetricsSystemImpl.java:stop) - Stopping azure-file-system metrics system...
2016-11-14 16:08:58,037 INFO  MetricsSinkAdapter (MetricsSinkAdapter.java:publishMetricsFromQueue) - azurefs2 thread interrupted.
2016-11-14 16:08:58,037 INFO  MetricsSystemImpl (MetricsSystemImpl.java:stop) - azure-file-system metrics system stopped.
2016-11-14 16:08:58,038 INFO  MetricsSystemImpl (MetricsSystemImpl.java:shutdown) - azure-file-system metrics system shutdown complete.

and when I look at yarn application list, it shows the application there is running, but webui is not up.
I also can't find any logs in the logs folder, it just shows my local attempts last week(which I ended up succeeding) and nothing about cluster.

Some pointers on where I should look and any suggestions will be appreciated.

Thanks in advance!

Byungjoon Yoon

and...@alluxio.com

unread,
Nov 14, 2016, 4:41:19 PM11/14/16
to Alluxio Users
Hi Byungjoon,

I think the issue is that you're using "localhost" for the master address. This will tell YARN that the master must be launched on the node named "localhost", but YARN probably doesn't have any node with this name. I've created a ticket for improving the error message in this case: https://alluxio.atlassian.net/browse/ALLUXIO-2425

Can you try replacing localhost with the name of a YARN node? Alternately, if you omit the master hostname, an arbitrary host will be used for the master.

Hope that helps,
Andrew

slybyu...@gmail.com

unread,
Nov 15, 2016, 9:43:59 AM11/15/16
to Alluxio Users
Tried both, but still same results- ui doens't work, mr job running. I checked mr logs to see if there is anything particular with it, but with no success.
Message has been deleted

slybyu...@gmail.com

unread,
Nov 15, 2016, 11:28:32 AM11/15/16
to Alluxio Users
Looked into yarn logs a bit more, and seems like whether I use FQDN, or short name or local host doesn't matter, it all resolves to default rack.

Below is the logs from yarn application.

export HADOOP_CONF_DIR=${HADOOP_CONF_DIR:-"/usr/hdp/current/hadoop-client/conf"}
export MAX_APP_ATTEMPTS="5"
export JAVA_HOME=${JAVA_HOME:-"/usr/lib/jvm/java-8-openjdk-amd64"}
export APP_SUBMIT_TIME_ENV="1479139727769"
export NM_HOST="10.0.0.6"
export LOGNAME="sshadmin"
export JVM_PID="$$"
export PWD="/mnt/resource/hadoop/yarn/local/usercache/sshadmin/appcache/application_1476888496011_0177/container_1476888496011_0177_01_000001"
export LOCAL_DIRS="/mnt/resource/hadoop/yarn/local/usercache/sshadmin/appcache/application_1476888496011_0177"
export APPLICATION_WEB_PROXY_BASE="/proxy/application_1476888496011_0177"
export NM_HTTP_PORT="30060"
export LOG_DIRS="/mnt/resource/hadoop/yarn/log/application_1476888496011_0177/container_1476888496011_0177_01_000001"
export NM_AUX_SERVICE_mapreduce_shuffle="AAA0+gAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA=
"
export NM_PORT="30050"
export USER="sshadmin"
export HADOOP_YARN_HOME=${HADOOP_YARN_HOME:-"/usr/hdp/current/hadoop-yarn-nodemanager"}
export CLASSPATH="$HADOOP_CONF_DIR:/usr/hdp/current/hadoop-client/*:/usr/hdp/current/hadoop-client/lib/*:/usr/hdp/current/hadoop-hdfs-client/*:/usr/hdp/current/hadoop-hdfs-client/lib/*:/usr/hdp/current/hadoop-yarn-client/*:/usr/hdp/current/hadoop-yarn-client/lib/*:$PWD/*"
export ALLUXIO_HOME="$PWD"
export HADOOP_TOKEN_FILE_LOCATION="/mnt/resource/hadoop/yarn/local/usercache/sshadmin/appcache/application_1476888496011_0177/container_1476888496011_0177_01_000001/container_tokens"
export NM_AUX_SERVICE_spark_shuffle=""
export LOCAL_USER_DIRS="/mnt/resource/hadoop/yarn/local/usercache/sshadmin/"
export HOME="/home/"
export NM_AUX_SERVICE_spark2_shuffle=""
export CONTAINER_ID="container_1476888496011_0177_01_000001"
export MALLOC_ARENA_MAX="4"
ln -sf "/mnt/resource/hadoop/yarn/local/filecache/20/alluxio.jar" "alluxio.jar"
hadoop_shell_errorcode=$?
if [ $hadoop_shell_errorcode -ne 0 ]
then
  exit $hadoop_shell_errorcode
fi
ln -sf "/mnt/resource/hadoop/yarn/local/filecache/21/alluxio.tar.gz" "alluxio.tar.gz"
hadoop_shell_errorcode=$?
if [ $hadoop_shell_errorcode -ne 0 ]
then
  exit $hadoop_shell_errorcode
fi
ln -sf "/mnt/resource/hadoop/yarn/local/filecache/22/alluxio-yarn-setup.sh" "alluxio-yarn-setup.sh"
hadoop_shell_errorcode=$?
if [ $hadoop_shell_errorcode -ne 0 ]
then
  exit $hadoop_shell_errorcode
fi
# Creating copy of launch script
cp "launch_container.sh" "/mnt/resource/hadoop/yarn/log/application_1476888496011_0177/container_1476888496011_0177_01_000001/launch_container.sh"
chmod 640 "/mnt/resource/hadoop/yarn/log/application_1476888496011_0177/container_1476888496011_0177_01_000001/launch_container.sh"
# Determining directory contents
echo "ls -l:" 1>"/mnt/resource/hadoop/yarn/log/application_1476888496011_0177/container_1476888496011_0177_01_000001/directory.info"
ls -l 1>>"/mnt/resource/hadoop/yarn/log/application_1476888496011_0177/container_1476888496011_0177_01_000001/directory.info"
echo "find -L . -maxdepth 5 -ls:" 1>>"/mnt/resource/hadoop/yarn/log/application_1476888496011_0177/container_1476888496011_0177_01_000001/directory.info"
find -L . -maxdepth 5 -ls 1>>"/mnt/resource/hadoop/yarn/log/application_1476888496011_0177/container_1476888496011_0177_01_000001/directory.info"
echo "broken symlinks(find -L . -maxdepth 5 -type l -ls):" 1>>"/mnt/resource/hadoop/yarn/log/application_1476888496011_0177/container_1476888496011_0177_01_000001/directory.info"
find -L . -maxdepth 5 -type l -ls 1>>"/mnt/resource/hadoop/yarn/log/application_1476888496011_0177/container_1476888496011_0177_01_000001/directory.info"
exec /bin/bash -c "./alluxio-yarn-setup.sh application-master -num_workers 2 -master_address localhost -resource_path wasbs address 1>/mnt/resource/hadoop/yarn/log/application_1476888496011_0177/container_1476888496011_0177_01_000001/stdout 2>/mnt/resource/hadoop/yarn/log/application_1476888496011_0177/container_1476888496011_0177_01_000001/stderr "
hadoop_shell_errorcode=$?
if [ $hadoop_shell_errorcode -ne 0 ]
then
  exit $hadoop_shell_errorcode
fi

End of LogType:launch_container.sh

LogType:stderr
Log Upload Time:Tue Nov 15 14:08:22 +0000 2016
LogLength:740
Log Contents:
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/mnt/resource/hadoop/yarn/local/usercache/sshadmin/appcache/application_1476888496011_0177/container_1476888496011_0177_01_000001/assembly/target/alluxio-assemblies-1.3.0-jar-with-dependencies.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/usr/hdp/2.5.1.0-56/hadoop/lib/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/mnt/resource/hadoop/yarn/local/filecache/20/alluxio.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]

End of LogType:stderr

LogType:stdout
Log Upload Time:Tue Nov 15 14:08:22 +0000 2016
LogLength:1075
Log Contents:
Launching Application Master
2016-11-14 16:08:53,706 INFO  type (ApplicationMaster.java:main) - Starting Application Master with args [-num_workers, 2, -master_address, localhost, -resource_path, wasbs address]
2016-11-14 16:08:54,291 INFO  ContainerManagementProtocolProxy (ContainerManagementProtocolProxy.java:<init>) - yarn.client.max-cached-nodemanagers-proxies : 0
2016-11-14 16:08:54,842 INFO  TimelineClientImpl (TimelineClientImpl.java:serviceInit) - Timeline service address: http://headnodehost:8188/ws/v1/timeline/
2016-11-14 16:08:55,180 INFO  type (ApplicationMaster.java:start) - ApplicationMaster registered
2016-11-14 16:08:55,183 INFO  type (ContainerAllocator.java:requestContainers) - Requesting 1 master containers
2016-11-14 16:08:55,188 INFO  type (ContainerAllocator.java:requestContainers) - Making 1 resource request(s) for Alluxio masters with cpu 1 memory 1024MB on hosts [localhost]
2016-11-14 16:08:55,229 INFO  RackResolver (RackResolver.java:coreResolve) - Resolved localhost to /default-rack

End of LogType:stdout

others were pretty similar, ended up resolving to default rack.

Still gives me no clue on where the service hangs though.

Andrew Audibert

unread,
Nov 15, 2016, 3:20:17 PM11/15/16
to slybyu...@gmail.com, Alluxio Users

From the logs it looks like the application master is definitely making the request to YARN for 1 cpu and 1024MB on host localhost. Are there any later logs along the lines of "Launching container {} for Alluxio master on {} with master command: {}"? It should print that when YARN satisfies the request. The application master will hang while waiting for YARN to satisfy the request - are you sure you have 1 cpu and 1024MB available to YARN? 


--
You received this message because you are subscribed to the Google Groups "Alluxio Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to alluxio-user...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

slybyu...@gmail.com

unread,
Nov 15, 2016, 3:55:33 PM11/15/16
to Alluxio Users, slybyu...@gmail.com
cluster definitely has it - 50 gb ram and 16 vcores. Also, they allocate stuff. this is from running container:

Application Attempt State:RUNNING
Started:Tue Nov 15 20:15:32 +0000 2016
Elapsed:38mins, 47sec
AM Container:container_1476888496011_0190_01_000001
Node:10.0.0.7:0
Tracking URL:ApplicationMaster
Diagnostics Info:
Blacklisted Nodes:-
Application Attempt Metrics
Application Attempt Headroom :<memory:25088, vCores:1>

Total Allocated Containers: 1

Each table cell represents the number of NodeLocal/RackLocal/OffSwitch containers satisfied by NodeLocal/RackLocal/OffSwitch resource requests.

Node Local RequestRack Local RequestOff Switch Request
Num Node Local Containers (satisfied by)0
Num Rack Local Containers (satisfied by)00
Num Off Switch Containers (satisfied by)001

Total Outstanding Resource Requests: <memory:1024, vCores:1>

Priority
ResourceName
Capability
NumContainers
RelaxLocality
NodeLabelExpression
100hn1-sl-spa.st5eymmsxyqebl3eyj5dlvlu3e.bx.internal.cloudapp.net<memory:1024, vCores:1>1true
100/default-rack<memory:1024, vCores:1>1false
100*<memory:1024, vCores:1>1

Andrew Audibert

unread,
Nov 15, 2016, 4:11:27 PM11/15/16
to slybyu...@gmail.com, Alluxio Users
The Application Attempt Headroom says there is only <memory:25088, vCores:1> available. The Application master itself uses 1 vCore, so it looks like there aren't enough resources available to start the Alluxio Master, which also requires 1 vCore.

slybyu...@gmail.com

unread,
Nov 15, 2016, 4:18:37 PM11/15/16
to Alluxio Users, slybyu...@gmail.com

That one I think refers to only that container... this is the overall cluster metrics, which I think should have no trouble running.

Cluster Metrics

Apps SubmittedApps PendingApps RunningApps CompletedContainers RunningMemory UsedMemory TotalMemory ReservedVCores UsedVCores TotalVCores ReservedActive NodesDecommissioned NodesLost NodesUnhealthy NodesRebooted Nodes
1870318433.50 GB50 GB0 B314020000

slybyu...@gmail.com

unread,
Nov 16, 2016, 11:26:23 AM11/16/16
to Alluxio Users, slybyu...@gmail.com
So apparantly there are only 2 nodes in the cluster - both worker nodes. After setting up in worker node and running from there, I found this error:

Exception in thread "main" java.lang.NoSuchMethodError: org.apache.hadoop.security.ProviderUtils.excludeIncompatibleCredentialProviders(Lorg/apache/hadoop/conf/Configuration;Ljava/lang/Class;)Lorg/apache/hadoop/conf/Configuration;
        at org.apache.hadoop.fs.azure.SimpleKeyProvider.getStorageAccountKey(SimpleKeyProvider.java:45)
        at org.apache.hadoop.fs.azure.ShellDecryptionKeyProvider.getStorageAccountKey(ShellDecryptionKeyProvider.java:40)
        at org.apache.hadoop.fs.azure.AzureNativeFileSystemStore.getAccountKeyFromConfiguration(AzureNativeFileSystemStore.java:852)
        at org.apache.hadoop.fs.azure.AzureNativeFileSystemStore.createAzureStorageSession(AzureNativeFileSystemStore.java:932)
        at org.apache.hadoop.fs.azure.AzureNativeFileSystemStore.initialize(AzureNativeFileSystemStore.java:450)
        at org.apache.hadoop.fs.azure.NativeAzureFileSystem.initialize(NativeAzureFileSystem.java:1209)
        at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2653)
        at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:92)
        at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2687)
        at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2669)
        at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:371)
        at alluxio.yarn.YarnUtils.createLocalResourceOfFile(YarnUtils.java:91)
        at alluxio.yarn.ApplicationMaster.setupLocalResources(ApplicationMaster.java:404)
        at alluxio.yarn.ApplicationMaster.launchMasterContainer(ApplicationMaster.java:363)
        at alluxio.yarn.ApplicationMaster.requestAndLaunchContainers(ApplicationMaster.java:321)
        at alluxio.yarn.ApplicationMaster.runApplicationMaster(ApplicationMaster.java:228)
        at alluxio.yarn.ApplicationMaster.main(ApplicationMaster.java:206)

because the file system is azure blob storage, it needs key to access that. I would assume it would need to have a way to provide keys, much like what it does with swift. Looking at the code, it looks like you just add what's given without checking actually what it is, so it might just work if I add this... let me try this.

slybyu...@gmail.com

unread,
Nov 16, 2016, 1:28:27 PM11/16/16
to Alluxio Users, slybyu...@gmail.com
nvm, upon closer look, it will need more changes than that.... I see https://alluxio.atlassian.net/browse/ALLUXIO-1203 addresses this. Any updates on that JIRA?

jan.he...@ultratendency.com

unread,
Nov 17, 2016, 6:30:02 AM11/17/16
to Alluxio Users, slybyu...@gmail.com
I'm the one assigned to this issue. There are currently no updates and I don't know when I have time to work on this one. If you want, you can pick up this ticket.
Reply all
Reply to author
Forward
0 new messages