Namenode connection conflict

103 views
Skip to first unread message

Pitt Fagan

unread,
Apr 9, 2015, 10:10:18 AM4/9/15
to geotrel...@googlegroups.com
Hi guys,

OK, so I have the Mesos-leader and one Mesos-follower up on AWS. Running the example of parallelizing a list of numbers and collecting a filtered list back to the driver (in the README file of the GitHub repo) works fine. When running the attached ingestion script, the rasters fail to be ingested into Accumulo. From the command line, if I run something like: hadoop fs -ls /accumulo, I get back a directory listing. I was able to create directories and place files in HDFS manually. I believe that the issue is with the value for the CATALOG variable on L22 of the attached file. The current CATALOG value is 'hdfs://namenode.service.geotrellis-spark.internal/accumulo/data/catalog'  This directory exists in HDFS and is empty.

Any assistance would be appreciated.

Thanks,
Pitt

Below is the entire output from the script.

ubuntu@ip-10-0-1-42:~$ python ./scripts/raster_processing.py
Input file size is 2591, 2502
0...10...20...30...40...50...60...70...80...90...100 - done.
Spark assembly has been built with Hive, including Datanucleus jars on classpath
13:55:55 Slf4jLogger: Slf4jLogger started
13:55:55 Remoting: Starting remoting
13:55:55 Remoting: Remoting started; listening on addresses :[akka.tcp://spark...@zookeeper.service.geotrellis-spark.internal:42507]
13:55:55 NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
I0409 13:55:56.597615 16781 sched.cpp:137] Version: 0.21.1
2015-04-09 13:55:56,597:16658(0x7f0658fd4700):ZOO_INFO@log_env@712: Client environment:zookeeper.version=zookeeper C client 3.4.5
2015-04-09 13:55:56,597:16658(0x7f0658fd4700):ZOO_INFO@log_env@716: Client environment:host.name=ip-10-0-1-42
2015-04-09 13:55:56,597:16658(0x7f0658fd4700):ZOO_INFO@log_env@723: Client environment:os.name=Linux
2015-04-09 13:55:56,597:16658(0x7f0658fd4700):ZOO_INFO@log_env@724: Client environment:os.arch=3.13.0-48-generic
2015-04-09 13:55:56,597:16658(0x7f0658fd4700):ZOO_INFO@log_env@725: Client environment:os.version=#80-Ubuntu SMP Thu Mar 12 11:16:15 UTC 2015
2015-04-09 13:55:56,597:16658(0x7f0658fd4700):ZOO_INFO@log_env@733: Client environment:user.name=ubuntu
2015-04-09 13:55:56,597:16658(0x7f0658fd4700):ZOO_INFO@log_env@741: Client environment:user.home=/home/ubuntu
2015-04-09 13:55:56,598:16658(0x7f0658fd4700):ZOO_INFO@log_env@753: Client environment:user.dir=/home/ubuntu
2015-04-09 13:55:56,598:16658(0x7f0658fd4700):ZOO_INFO@zookeeper_init@786: Initiating client connection, host=zookeeper.service.geotrellis-spark.internal:2181 sessionTimeout=10000 watcher=0x7f065ae8e6a0 sessionId=0 sessionPasswd=<null> context=0x7f0654010cb0 flags=0
2015-04-09 13:55:56,600:16658(0x7f0650ff9700):ZOO_INFO@check_events@1703: initiated connection to server [10.0.1.42:2181]
2015-04-09 13:55:56,601:16658(0x7f0650ff9700):ZOO_INFO@check_events@1750: session establishment complete on server [10.0.1.42:2181], sessionId=0x14c7724138c0061, negotiated timeout=10000
I0409 13:55:56.602052 16782 group.cpp:313] Group process (group(1)@10.0.1.42:34543) connected to ZooKeeper
I0409 13:55:56.602093 16782 group.cpp:790] Syncing group operations: queue size (joins, cancels, datas) = (0, 0, 0)
I0409 13:55:56.602123 16782 group.cpp:385] Trying to create path '/mesos' in ZooKeeper
I0409 13:55:56.602905 16782 detector.cpp:138] Detected a new leader: (id='2')
I0409 13:55:56.603024 16782 group.cpp:659] Trying to get '/mesos/info_0000000002' in ZooKeeper
I0409 13:55:56.603466 16786 detector.cpp:433] A new leading master (UPID=mas...@10.0.1.42:5050) is detected
I0409 13:55:56.603582 16782 sched.cpp:234] New master detected at mas...@10.0.1.42:5050
I0409 13:55:56.603708 16782 sched.cpp:242] No credentials provided. Attempting to register without authentication
I0409 13:55:56.604648 16783 sched.cpp:408] Framework registered with 20150401-224001-704708618-5050-1958-0086
Exception in thread "main" java.io.IOException: Failed on local exception: com.google.protobuf.InvalidProtocolBufferException: Message missing required fields: callId, status; Host Details : local host is: "ip-10-0-1-42/10.0.1.42"; destination host is: "namenode.service.geotrellis-spark.internal":8020;
        at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:760)
        at org.apache.hadoop.ipc.Client.call(Client.java:1229)
        at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:202)
        at com.sun.proxy.$Proxy15.getFileInfo(Unknown Source)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:606)
        at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:164)
        at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:83)
        at com.sun.proxy.$Proxy15.getFileInfo(Unknown Source)
        at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.getFileInfo(ClientNamenodeProtocolTranslatorPB.java:628)
        at org.apache.hadoop.hdfs.DFSClient.getFileInfo(DFSClient.java:1532)
        at org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:803)
        at org.apache.hadoop.fs.FileSystem.exists(FileSystem.java:1332)
        at geotrellis.spark.io.hadoop.HdfsUtils$.ensurePathExists(HdfsUtils.scala:45)
        at geotrellis.spark.io.hadoop.HadoopCatalog$.apply(HadoopCatalog.scala:229)
        at geotrellis.spark.ingest.HadoopIngestCommand$.main(HadoopIngestCommand.scala:27)
        at geotrellis.spark.ingest.HadoopIngestCommand$.main(HadoopIngestCommand.scala:19)
        at com.quantifind.sumac.ArgMain$class.mainHelper(ArgApp.scala:45)
        at com.quantifind.sumac.ArgMain$class.main(ArgApp.scala:34)
        at geotrellis.spark.ingest.HadoopIngestCommand$.main(HadoopIngestCommand.scala:19)
        at geotrellis.spark.ingest.HadoopIngestCommand.main(HadoopIngestCommand.scala)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:606)
        at org.apache.spark.deploy.SparkSubmit$.launch(SparkSubmit.scala:358)
        at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:75)
        at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Caused by: com.google.protobuf.InvalidProtocolBufferException: Message missing required fields: callId, status
        at com.google.protobuf.UninitializedMessageException.asInvalidProtocolBufferException(UninitializedMessageException.java:81)
        at org.apache.hadoop.ipc.protobuf.RpcPayloadHeaderProtos$RpcResponseHeaderProto$Builder.buildParsed(RpcPayloadHeaderProtos.java:1094)
        at org.apache.hadoop.ipc.protobuf.RpcPayloadHeaderProtos$RpcResponseHeaderProto$Builder.access$1300(RpcPayloadHeaderProtos.java:1028)
        at org.apache.hadoop.ipc.protobuf.RpcPayloadHeaderProtos$RpcResponseHeaderProto.parseDelimitedFrom(RpcPayloadHeaderProtos.java:986)
        at org.apache.hadoop.ipc.Client$Connection.receiveResponse(Client.java:938)
        at org.apache.hadoop.ipc.Client$Connection.run(Client.java:836)
raster_processing.py

Pitt Fagan

unread,
Apr 9, 2015, 10:21:20 AM4/9/15
to geotrel...@googlegroups.com
Forgot to mention that I am running the Spark job with Spark version 1.2.0. Looking at some posts for this error online it possibly could be a Hadoop version mistmatch. Eugene recommended that I export the following variables and recreate the jar file. I did this prior to receiving the error.

export SPARK_HADOOP_VERSION="2.5.0-cdh5.3.3"

export SPARK_VERSION="1.2.0-cdh5.3.3"

I then recreated the uber jar file:  ./sbt "project spark" assembly 


On Thursday, April 9, 2015 at 9:10:18 AM UTC-5, Pitt Fagan wrote:
Hi guys,

OK, so I have the Mesos-leader and one Mesos-follower up on AWS. Running the example of parallelizing a list of numbers and collecting a filtered list back to the driver (in the README file of the GitHub repo) works fine. When running the attached ingestion script, the rasters fail to be ingested into Accumulo. From the command line, if I run something like: hadoop fs -ls /accumulo, I get back a directory listing. I was able to create directories and place files in HDFS manually. I believe that the issue is with the value for the CATALOG variable on L22 of the attached file. The current CATALOG value is 'hdfs://namenode.service.geotrellis-spark.internal/accumulo/data/catalog'  This directory exists in HDFS and is empty.

Any assistance would be appreciated.

Thanks,
Pitt

Below is the entire output from the script.

ubuntu@ip-10-0-1-42:~$ python ./scripts/raster_processing.py
Input file size is 2591, 2502
0...10...20...30...40...50...60...70...80...90...100 - done.
Spark assembly has been built with Hive, including Datanucleus jars on classpath
13:55:55 Slf4jLogger: Slf4jLogger started
13:55:55 Remoting: Starting remoting
13:55:55 Remoting: Remoting started; listening on addresses :[akka.tcp://sparkDriver@zookeeper.service.geotrellis-spark.internal:42507]

Eugene Cheipesh

unread,
Apr 9, 2015, 10:59:45 AM4/9/15
to geotrel...@googlegroups.com
Hi Pitt,

I completely didn’t catch it looking at your script the first time that you are using your own distro of spark.

The ansible roles that are part of the AMI creation install the Cloudera ubuntu packages for spark and HDFS on all nodes.

Then the trick is to match the geotrellis assembly to depend on Cloudera distributed maven artifacts by setting those environment variables you mentioned. This ensures that all the transient dependencies versions match when you build the geotrellis assembly.

Those versions float, so it’s good to check what is actually installed by using: 
apt-cache show spark-core | grep Version

Make sure "MESOS_NATIVE_LIBRARY=/usr/local/lib/libmesos.so” is in your environment

Then you should be able to use spark-submit that is already on the machine like so:

spark-submit \
--class geotrellis.spark.ingest.HadoopIngestCommandt \
--master mesos://zk://zookeeper.service.geotrellis-spark.internal:2181/mesos \
--conf spark.mesos.coarse=true \
--conf spark.executor.memory=20g \
--conf spark.executorEnv.SPARK_LOCAL_DIRS="/media/ephemeral0,/media/ephemeral1" \
--driver-library-path /usr/local/lib spark/target/scala-2.10/geotrellis-spark-assembly-0.10.0-SNAPSHOT.jar \
--input s3n://$AWS_ID:$AWS_KEY@bucket/key
--layerName myLayer --crs EPSG:3857 --clobber true \
--catalog hdfs://namenode.service.geotrellis-spark.internal/gt-catalog

Adjust the values in bold to match the machine types you’re using, ex: m3 larges only have one ephemeral mount point.
Note: "--driver-library-path" is given so spark job can find the GDAL JNI bindings which are installed across the cluster.

-- 
Eugene Cheipesh
--
You received this message because you are subscribed to the Google Groups "geotrellis-user" group.
To unsubscribe from this group and stop receiving emails from it, send an email to geotrellis-us...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Pitt Fagan

unread,
Apr 9, 2015, 11:30:00 AM4/9/15
to geotrel...@googlegroups.com
OK thanks Eugene.  

The apt-cache command results in the following output:

ubuntu@ip-10-0-1-42:~/geotrellis$ apt-cache show spark-core | grep Version
Version: 1.2.0+cdh5.3.2+369-1.cdh5.3.2.p0.17~trusty-cdh5.3.2

This matches the value of the SPARK_VERSION variable I specified.

I already had the $MESOS_NATIVE_LIBRARY set before starting this process, but oddly enough when I try to list out the variable I get this result, which is odd. 

ubuntu@ip-10-0-1-42:~/geotrellis$ export MESOS_NATIVE_LIBRARY="/usr/local/lib/libmesos.so"
ubuntu@ip-10-0-1-42:~/geotrellis$ $MESOS_NATIVE_LIBRARY
Segmentation fault (core dumped)

I do not need to roll with my own Spark. I would be happy using the Spark distribution that comes with GeoTrellis if it would smooth things out.

At any rate, I will get to work making these changes that you specify and I'll let you know how it goes!

Pitt



On Thursday, April 9, 2015 at 9:10:18 AM UTC-5, Pitt Fagan wrote:
Hi guys,

OK, so I have the Mesos-leader and one Mesos-follower up on AWS. Running the example of parallelizing a list of numbers and collecting a filtered list back to the driver (in the README file of the GitHub repo) works fine. When running the attached ingestion script, the rasters fail to be ingested into Accumulo. From the command line, if I run something like: hadoop fs -ls /accumulo, I get back a directory listing. I was able to create directories and place files in HDFS manually. I believe that the issue is with the value for the CATALOG variable on L22 of the attached file. The current CATALOG value is 'hdfs://namenode.service.geotrellis-spark.internal/accumulo/data/catalog'  This directory exists in HDFS and is empty.

Any assistance would be appreciated.

Thanks,
Pitt

Below is the entire output from the script.

ubuntu@ip-10-0-1-42:~$ python ./scripts/raster_processing.py
Input file size is 2591, 2502
0...10...20...30...40...50...60...70...80...90...100 - done.
Spark assembly has been built with Hive, including Datanucleus jars on classpath
13:55:55 Slf4jLogger: Slf4jLogger started
13:55:55 Remoting: Starting remoting
13:55:55 Remoting: Remoting started; listening on addresses :[akka.tcp://sparkDriver@zookeeper.service.geotrellis-spark.internal:42507]

Hector Castro

unread,
Apr 9, 2015, 11:48:59 AM4/9/15
to geotrel...@googlegroups.com
On Thu, Apr 9, 2015 at 11:30 AM, Pitt Fagan <pitt...@gmail.com> wrote:
> OK thanks Eugene.
>
> The apt-cache command results in the following output:
>
> ubuntu@ip-10-0-1-42:~/geotrellis$ apt-cache show spark-core | grep Version
> Version: 1.2.0+cdh5.3.2+369-1.cdh5.3.2.p0.17~trusty-cdh5.3.2
>
> This matches the value of the SPARK_VERSION variable I specified.

Catching up on this thread, it looks like you may have specified
SPARK_VERSION with a trailing `cdh5.3.3` vs. `cdh5.3.2`. Not 100% on
the difference that makes in this context, but one thing I know makes
a difference is that you were executing a prebuilt Spark distribution
for cdh4 from your Python script.

When Eugene said "should be able to use spark-submit that is already
on the machine", that means the Spark version installed via APT
automatically places the `spark-submit` and `spark-shell` binaries in
a location that is part of the default PATH. Your subprocess for
`spark-submit` in your Python script should end up being something
like:

spark-submit ....

Versus the current:

/home/ubuntu/spark-1.2.0-bin-cdh4/bin/spark-submit ...

> I already had the $MESOS_NATIVE_LIBRARY set before starting this process,
> but oddly enough when I try to list out the variable I get this result,
> which is odd.
>
> ubuntu@ip-10-0-1-42:~/geotrellis$ export
> MESOS_NATIVE_LIBRARY="/usr/local/lib/libmesos.so"
> ubuntu@ip-10-0-1-42:~/geotrellis$ $MESOS_NATIVE_LIBRARY
> Segmentation fault (core dumped)

This one is going to require that you prefix the environment variable
name with `echo`.
>> :[akka.tcp://spark...@zookeeper.service.geotrellis-spark.internal:42507]

Pitt Fagan

unread,
Apr 9, 2015, 12:07:29 PM4/9/15
to geotrel...@googlegroups.com
Hi Hector,

Yes, after I first posted the message, I caught the 5.3.3 vs 5.3.2 so I had already changed the SPARK_VERSION variable to reflect this.

Also, thanks for the tip about always using echo! the SPARK_VERSION variable listed out without any need for this but the MESOS_NATIVE_LIBRARY variable needs it.

Anyway, I'm almost done making Eugene's recommended changes so will hopefully post something very soon.

Pitt
>> :[akka.tcp://sparkDriver@zookeeper.service.geotrellis-spark.internal:42507]

Pitt Fagan

unread,
Apr 9, 2015, 12:32:16 PM4/9/15
to geotrel...@googlegroups.com
Hi guys,

OK, so the Mesos-leader is an r3.large and the one Mesos-follower is an m3.large. For the --input arguement below, there is one GeoTiff file in this directory.

Here is the command I am running from the command line (I put the backslashes here for readability):

ubuntu@ip-10-0-1-42:~/geotrellis$ spark-submit \
--class geotrellis.spark.ingest.HadoopIngestCommand \
--master mesos://zk://zookeeper.service.geotrellis-spark.internal:2181/mesos \
--conf spark.mesos.coarse=true \
--conf spark.executor.memory=5g \
--conf spark.executorEnv.SPARK_LOCAL_DIRS="/media/ephemeral0" \
--driver-library-path /usr/local/lib /home/ubuntu/geotrellis/spark/target/scala-2.10/geotrellis-spark-assembly-0.10.0-SNAPSHOT.jar \
--crs EPSG:3857 \
--pyramid false \
--clobber true \
--input file:/home/ubuntu/datasets/s3/backups/2015/04/09/16/tiles/ls8r/LC80340322013292LGN00/1295534/calibration/ \
--catalog hdfs://namenode.service.geotrellis-spark.internal:8020/accumulo/data/catalog \
--layerName s7


The good news is that I am past the previous issue, so thanks for that! Here is the current output.

15/04/09 16:29:55 INFO spark.SecurityManager: Changing view acls to: ubuntu
15/04/09 16:29:55 INFO spark.SecurityManager: Changing modify acls to: ubuntu
15/04/09 16:29:55 INFO spark.SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(ubuntu); users with modify permissions: Set(ubuntu)
15/04/09 16:29:56 INFO slf4j.Slf4jLogger: Slf4jLogger started
15/04/09 16:29:56 INFO Remoting: Starting remoting
15/04/09 16:29:56 INFO Remoting: Remoting started; listening on addresses :[akka.tcp://spark...@zookeeper.service.geotrellis-spark.internal:38369]
15/04/09 16:29:56 INFO Remoting: Remoting now listens on addresses: [akka.tcp://spark...@zookeeper.service.geotrellis-spark.internal:38369]
15/04/09 16:29:56 INFO util.Utils: Successfully started service 'sparkDriver' on port 38369.
15/04/09 16:29:56 INFO spark.SparkEnv: Registering MapOutputTracker
15/04/09 16:29:56 INFO spark.SparkEnv: Registering BlockManagerMaster
15/04/09 16:29:56 INFO storage.DiskBlockManager: Created local directory at /tmp/spark-local-20150409162956-0c09
15/04/09 16:29:56 INFO storage.MemoryStore: MemoryStore started with capacity 265.4 MB
15/04/09 16:29:56 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
15/04/09 16:29:57 INFO spark.HttpFileServer: HTTP File server directory is /tmp/spark-f23b0dde-e84c-4cb9-a692-0e12c7e1ccda
15/04/09 16:29:57 INFO spark.HttpServer: Starting HTTP Server
15/04/09 16:29:57 INFO server.Server: jetty-8.y.z-SNAPSHOT
15/04/09 16:29:57 INFO server.AbstractConnector: Started SocketC...@0.0.0.0:40998
15/04/09 16:29:57 INFO util.Utils: Successfully started service 'HTTP file server' on port 40998.
15/04/09 16:29:57 INFO server.Server: jetty-8.y.z-SNAPSHOT
15/04/09 16:29:57 INFO server.AbstractConnector: Started SelectChann...@0.0.0.0:4040
15/04/09 16:29:57 INFO util.Utils: Successfully started service 'SparkUI' on port 4040.
15/04/09 16:29:57 INFO ui.SparkUI: Started SparkUI at http://zookeeper.service.geotrellis-spark.internal:4040
15/04/09 16:29:57 INFO spark.SparkContext: Added JAR file:/home/ubuntu/geotrellis/spark/target/scala-2.10/geotrellis-spark-assembly-0.10.0-SNAPSHOT.jar at http://10.0.1.42:40998/jars/geotrellis-spark-assembly-0.10.0-SNAPSHOT.jar with timestamp 1428596997663
I0409 16:29:57.810159  2127 sched.cpp:137] Version: 0.21.1
2015-04-09 16:29:57,814:1890(0x7f52b0cf4700):ZOO_INFO@log_env@712: Client environment:zookeeper.version=zookeeper C client 3.4.5
2015-04-09 16:29:57,814:1890(0x7f52b0cf4700):ZOO_INFO@log_env@716: Client environment:host.name=ip-10-0-1-42
2015-04-09 16:29:57,814:1890(0x7f52b0cf4700):ZOO_INFO@log_env@723: Client environment:os.name=Linux
2015-04-09 16:29:57,814:1890(0x7f52b0cf4700):ZOO_INFO@log_env@724: Client environment:os.arch=3.13.0-48-generic
2015-04-09 16:29:57,814:1890(0x7f52b0cf4700):ZOO_INFO@log_env@725: Client environment:os.version=#80-Ubuntu SMP Thu Mar 12 11:16:15 UTC 2015
2015-04-09 16:29:57,814:1890(0x7f52b0cf4700):ZOO_INFO@log_env@733: Client environment:user.name=ubuntu
2015-04-09 16:29:57,814:1890(0x7f52b0cf4700):ZOO_INFO@log_env@741: Client environment:user.home=/home/ubuntu
2015-04-09 16:29:57,814:1890(0x7f52b0cf4700):ZOO_INFO@log_env@753: Client environment:user.dir=/home/ubuntu/geotrellis
2015-04-09 16:29:57,814:1890(0x7f52b0cf4700):ZOO_INFO@zookeeper_init@786: Initiating client connection, host=zookeeper.service.geotrellis-spark.internal:2181 sessionTimeout=10000 watcher=0x7f52b2c8c6a0 sessionId=0 sessionPasswd=<null> context=0x7f52f9519ab0 flags=0
2015-04-09 16:29:57,817:1890(0x7f52ac4eb700):ZOO_INFO@check_events@1703: initiated connection to server [10.0.1.42:2181]
2015-04-09 16:29:57,819:1890(0x7f52ac4eb700):ZOO_INFO@check_events@1750: session establishment complete on server [10.0.1.42:2181], sessionId=0x14c7724138c006f, negotiated timeout=10000
I0409 16:29:57.819597  2131 group.cpp:313] Group process (group(1)@10.0.1.42:34064) connected to ZooKeeper
I0409 16:29:57.819675  2131 group.cpp:790] Syncing group operations: queue size (joins, cancels, datas) = (0, 0, 0)
I0409 16:29:57.819743  2131 group.cpp:385] Trying to create path '/mesos' in ZooKeeper
I0409 16:29:57.820658  2128 detector.cpp:138] Detected a new leader: (id='2')
I0409 16:29:57.820783  2128 group.cpp:659] Trying to get '/mesos/info_0000000002' in ZooKeeper
I0409 16:29:57.829401  2128 detector.cpp:433] A new leading master (UPID=mas...@10.0.1.42:5050) is detected
I0409 16:29:57.829483  2128 sched.cpp:234] New master detected at mas...@10.0.1.42:5050
I0409 16:29:57.829601  2128 sched.cpp:242] No credentials provided. Attempting to register without authentication
I0409 16:29:57.830821  2132 sched.cpp:408] Framework registered with 20150401-224001-704708618-5050-1958-0100
15/04/09 16:29:57 INFO mesos.CoarseMesosSchedulerBackend: Registered as framework ID 20150401-224001-704708618-5050-1958-0100
15/04/09 16:29:58 INFO netty.NettyBlockTransferService: Server created on 56196
15/04/09 16:29:58 INFO storage.BlockManagerMaster: Trying to register BlockManager
15/04/09 16:29:58 INFO storage.BlockManagerMasterActor: Registering block manager zookeeper.service.geotrellis-spark.internal:56196 with 265.4 MB RAM, BlockManagerId(<driver>, zookeeper.service.geotrellis-spark.internal, 56196)
15/04/09 16:29:58 INFO storage.BlockManagerMaster: Registered BlockManager
15/04/09 16:29:58 INFO mesos.CoarseMesosSchedulerBackend: Mesos task 0 is now TASK_RUNNING
15/04/09 16:29:58 INFO mesos.CoarseMesosSchedulerBackend: SchedulerBackend is ready for scheduling beginning after reached minRegisteredResourcesRatio: 0.0
Exception in thread "main" java.lang.IllegalArgumentException: Can not create a Path from an empty string
        at org.apache.hadoop.fs.Path.checkPathArg(Path.java:127)
        at org.apache.hadoop.fs.Path.<init>(Path.java:135)
        at org.apache.hadoop.util.StringUtils.stringToPath(StringUtils.java:244)
        at org.apache.hadoop.mapreduce.lib.input.FileInputFormat.setInputPaths(FileInputFormat.java:467)
        at geotrellis.spark.io.hadoop.HdfsUtils$.putFilesInConf(HdfsUtils.scala:58)
        at geotrellis.spark.io.hadoop.package$HadoopConfigurationWrapper.withInputDirectory(package.scala:62)
        at geotrellis.spark.io.hadoop.HadoopSparkContextMethods$class.hadoopGeoTiffRDD(HadoopSparkContextMethods.scala:29)
        at geotrellis.spark.io.hadoop.package$HadoopSparkContextMethodsWrapper.hadoopGeoTiffRDD(package.scala:50)
        at geotrellis.spark.ingest.HadoopIngestCommand$.main(HadoopIngestCommand.scala:28)
        at geotrellis.spark.ingest.HadoopIngestCommand$.main(HadoopIngestCommand.scala:19)
        at com.quantifind.sumac.ArgMain$class.mainHelper(ArgApp.scala:45)
        at com.quantifind.sumac.ArgMain$class.main(ArgApp.scala:34)
        at geotrellis.spark.ingest.HadoopIngestCommand$.main(HadoopIngestCommand.scala:19)
        at geotrellis.spark.ingest.HadoopIngestCommand.main(HadoopIngestCommand.scala)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:606)
        at org.apache.spark.deploy.SparkSubmit$.launch(SparkSubmit.scala:358)
        at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:75)
        at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)

On Thursday, April 9, 2015 at 9:10:18 AM UTC-5, Pitt Fagan wrote:
Hi guys,

OK, so I have the Mesos-leader and one Mesos-follower up on AWS. Running the example of parallelizing a list of numbers and collecting a filtered list back to the driver (in the README file of the GitHub repo) works fine. When running the attached ingestion script, the rasters fail to be ingested into Accumulo. From the command line, if I run something like: hadoop fs -ls /accumulo, I get back a directory listing. I was able to create directories and place files in HDFS manually. I believe that the issue is with the value for the CATALOG variable on L22 of the attached file. The current CATALOG value is 'hdfs://namenode.service.geotrellis-spark.internal/accumulo/data/catalog'  This directory exists in HDFS and is empty.

Any assistance would be appreciated.

Thanks,
Pitt

Below is the entire output from the script.

ubuntu@ip-10-0-1-42:~$ python ./scripts/raster_processing.py
Input file size is 2591, 2502
0...10...20...30...40...50...60...70...80...90...100 - done.
Spark assembly has been built with Hive, including Datanucleus jars on classpath
13:55:55 Slf4jLogger: Slf4jLogger started
13:55:55 Remoting: Starting remoting
13:55:55 Remoting: Remoting started; listening on addresses :[akka.tcp://sparkDriver@zookeeper.service.geotrellis-spark.internal:42507]

Rob Emanuele

unread,
Apr 9, 2015, 1:12:23 PM4/9/15
to geotrel...@googlegroups.com
Hey Pitt,

Are you trying to do an Accumulo ingest? The deploy should have set up Accumulo, and I'd recommend using it. It seems like you're trying to write to accumulo directly with the Hadoop ingest..."catalog hdfs://namenode.service.geotrellis-spark.internal:8020/accumulo/data/catalog". Instead, you should use the AccumuloIngestCommand.

Here is a gist of a script that should help you do that:


Want to try that out?

Thanks,
Rob

--
You received this message because you are subscribed to the Google Groups "geotrellis-user" group.
To unsubscribe from this group and stop receiving emails from it, send an email to geotrellis-us...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.



--
Rob Emanuele, Tech Lead, GeoTrellis

Azavea |  340 N 12th St, Ste 402, Philadelphia, PA
rema...@azavea.com  | T 215.701.7692  | F 215.925.2663
Web azavea.com  |  Blog azavea.com/blogs  | Twitter @azavea

Pitt Fagan

unread,
Apr 9, 2015, 1:42:57 PM4/9/15
to geotrel...@googlegroups.com
Hi rob,

Yes, I am trying to ingest the raster into Accumulo, but what you write below is probably is an issue. When I was working on this locally, I was ingesting the rasters into HDFS. I remember you saying that Accumulo was preferable and the AWS machines are my first trial with Accumulo. Let me give your ist a try and see what's what.

Thanks,
pitt
15/04/09 16:29:56 INFO Remoting: Remoting started; listening on addresses :[akka.tcp://sparkDriver@zookeeper.service.geotrellis-spark.internal:38369]
15/04/09 16:29:56 INFO Remoting: Remoting now listens on addresses: [akka.tcp://sparkDriver@zookeeper.service.geotrellis-spark.internal:38369]

Pitt Fagan

unread,
Apr 9, 2015, 2:44:50 PM4/9/15
to geotrel...@googlegroups.com
Howdy Rob,

OK, here is the command I am running, based on your gist. I did not know what to put in for the user and password values so I left your default values in. 

spark-submit --class geotrellis.spark.ingest.AccumuloIngestCommand /home/ubuntu/geotrellis/spark/target/scala-2.10/geotrellis-spark-assembly-0.10.0-SNAPSHOT.jar --instance geotrellis-accumulo-cluster  --user root --password secret --zookeeper zookeeper.service.geotrellis-spark.internal --crs EPSG:3857 --pyramid false --clobber true --input file:/home/ubuntu/datasets/s3/backups/2015/04/09/16/tiles/ls8r/LC80340322013292LGN00/1295534/calibration --layerName s7 --table 1295534

Here is part of the output, including the error. What precedes this is a huge list of jar files which I did not include.

15/04/09 18:36:18 INFO zookeeper.ZooKeeper: Client environment:java.library.path=/usr/java/packages/lib/amd64:/usr/lib/x86_64-linux-gnu/jni:/lib/x86_64-linux-gnu:/usr/lib/x86_64-linux-gnu:/usr/lib/jni:/lib:/usr/lib
15/04/09 18:36:18 INFO zookeeper.ZooKeeper: Client environment:java.io.tmpdir=/tmp
15/04/09 18:36:18 INFO zookeeper.ZooKeeper: Client environment:java.compiler=<NA>
15/04/09 18:36:18 INFO zookeeper.ZooKeeper: Client environment:os.name=Linux
15/04/09 18:36:18 INFO zookeeper.ZooKeeper: Client environment:os.arch=amd64
15/04/09 18:36:18 INFO zookeeper.ZooKeeper: Client environment:os.version=3.13.0-48-generic
15/04/09 18:36:18 INFO zookeeper.ZooKeeper: Client environment:user.name=ubuntu
15/04/09 18:36:18 INFO zookeeper.ZooKeeper: Client environment:user.home=/home/ubuntu
15/04/09 18:36:18 INFO zookeeper.ZooKeeper: Client environment:user.dir=/home/ubuntu
15/04/09 18:36:18 INFO zookeeper.ZooKeeper: Initiating client connection, connectString=zookeeper.service.geotrellis-spark.internal sessionTimeout=30000 watcher=org.apache.accumulo.fate.zookeeper.ZooSession$ZooWatcher@462cc1e9
15/04/09 18:36:18 INFO zookeeper.ClientCnxn: Opening socket connection to server zookeeper.service.geotrellis-spark.internal/10.0.1.42:2181. Will not attempt to authenticate using SASL (unknown error)
15/04/09 18:36:19 INFO zookeeper.ClientCnxn: Socket connection established to zookeeper.service.geotrellis-spark.internal/10.0.1.42:2181, initiating session
15/04/09 18:36:19 INFO zookeeper.ClientCnxn: Session establishment complete on server zookeeper.service.geotrellis-spark.internal/10.0.1.42:2181, sessionid = 0x14c7724138c007f, negotiated timeout = 30000
Exception in thread "main" java.lang.IllegalArgumentException: Can not create a Path from an empty string
        at org.apache.hadoop.fs.Path.checkPathArg(Path.java:127)
        at org.apache.hadoop.fs.Path.<init>(Path.java:135)
        at org.apache.hadoop.util.StringUtils.stringToPath(StringUtils.java:244)
        at org.apache.hadoop.mapreduce.lib.input.FileInputFormat.setInputPaths(FileInputFormat.java:467)
        at geotrellis.spark.io.hadoop.HdfsUtils$.putFilesInConf(HdfsUtils.scala:58)
        at geotrellis.spark.io.hadoop.package$HadoopConfigurationWrapper.withInputDirectory(package.scala:62)
        at geotrellis.spark.io.hadoop.HadoopSparkContextMethods$class.hadoopGeoTiffRDD(HadoopSparkContextMethods.scala:29)
        at geotrellis.spark.io.hadoop.package$HadoopSparkContextMethodsWrapper.hadoopGeoTiffRDD(package.scala:50)
        at geotrellis.spark.ingest.AccumuloIngestCommand$.main(AccumuloIngestCommand.scala:35)
        at geotrellis.spark.ingest.AccumuloIngestCommand$.main(AccumuloIngestCommand.scala:26)
        at com.quantifind.sumac.ArgMain$class.mainHelper(ArgApp.scala:45)
        at com.quantifind.sumac.ArgMain$class.main(ArgApp.scala:34)
        at geotrellis.spark.ingest.AccumuloIngestCommand$.main(AccumuloIngestCommand.scala:26)
        at geotrellis.spark.ingest.AccumuloIngestCommand.main(AccumuloIngestCommand.scala)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:606)
        at org.apache.spark.deploy.SparkSubmit$.launch(SparkSubmit.scala:358)
        at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:75)
        at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)


-- -------------------------

Pitt Fagan

unread,
Apr 10, 2015, 12:24:18 PM4/10/15
to geotrel...@googlegroups.com
This might be a red herring but I was playing around with variations of the --input argument. I am running the following command from the directory /home/ubuntu/geotrellis. The tif file I am loading is also in this same directory.


ubuntu@ip-10-0-1-42:~/geotrellis$ spark-submit --class geotrellis.spark.ingest.AccumuloIngestCommand /home/ubuntu/geotrellis/spark/target/scala-2.10/geotrellis-spark-assembly-0.10.0-SNAPSHOT.jar --instance geotrellis-accumulo-cluster  --user root --password secret --zookeeper zookeeper.service.geotrellis-spark.internal --crs EPSG:3857 --pyramid false --clobber true --input file:/ --layerName s7 --table 1295534

Here is the exception when running this. Not sure if the flip-flopping of the file extension and the directory structure is a clue about what is happening, but figured it was interesting to report it. 

Exception in thread "main" java.lang.IllegalArgumentException: Wrong FS: file://*.tif/home/ubuntu/geotrellis, expected: file:///
        at org.apache.hadoop.fs.FileSystem.checkPath(FileSystem.java:645)
        at org.apache.hadoop.fs.RawLocalFileSystem.pathToFile(RawLocalFileSystem.java:80)
        at org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:519)
        at org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:737)
        at org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:514)
        at org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:398)
        at org.apache.hadoop.fs.Globber.getFileStatus(Globber.java:57)
        at org.apache.hadoop.fs.Globber.glob(Globber.java:252)
        at org.apache.hadoop.fs.FileSystem.globStatus(FileSystem.java:1625)
        at geotrellis.spark.io.hadoop.HdfsUtils$.listFiles(HdfsUtils.scala:85)
        at geotrellis.spark.io.hadoop.package$HadoopConfigurationWrapper.withInputDirectory(package.scala:61)
        at geotrellis.spark.io.hadoop.HadoopSparkContextMethods$class.hadoopGeoTiffRDD(HadoopSparkContextMethods.scala:29)
        at geotrellis.spark.io.hadoop.package$HadoopSparkContextMethodsWrapper.hadoopGeoTiffRDD(package.scala:50)
        at geotrellis.spark.ingest.AccumuloIngestCommand$.main(AccumuloIngestCommand.scala:35)
        at geotrellis.spark.ingest.AccumuloIngestCommand$.main(AccumuloIngestCommand.scala:26)
        at com.quantifind.sumac.ArgMain$class.mainHelper(ArgApp.scala:45)
        at com.quantifind.sumac.ArgMain$class.main(ArgApp.scala:34)
        at geotrellis.spark.ingest.AccumuloIngestCommand$.main(AccumuloIngestCommand.scala:26)
        at geotrellis.spark.ingest.AccumuloIngestCommand.main(AccumuloIngestCommand.scala)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:606)
        at org.apache.spark.deploy.SparkSubmit$.launch(SparkSubmit.scala:358)
        at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:75)
        at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)


On Thursday, April 9, 2015 at 9:10:18 AM UTC-5, Pitt Fagan wrote:
Hi guys,

OK, so I have the Mesos-leader and one Mesos-follower up on AWS. Running the example of parallelizing a list of numbers and collecting a filtered list back to the driver (in the README file of the GitHub repo) works fine. When running the attached ingestion script, the rasters fail to be ingested into Accumulo. From the command line, if I run something like: hadoop fs -ls /accumulo, I get back a directory listing. I was able to create directories and place files in HDFS manually. I believe that the issue is with the value for the CATALOG variable on L22 of the attached file. The current CATALOG value is 'hdfs://namenode.service.geotrellis-spark.internal/accumulo/data/catalog'  This directory exists in HDFS and is empty.

Any assistance would be appreciated.

Thanks,
Pitt

Below is the entire output from the script.

ubuntu@ip-10-0-1-42:~$ python ./scripts/raster_processing.py
Input file size is 2591, 2502
0...10...20...30...40...50...60...70...80...90...100 - done.
Spark assembly has been built with Hive, including Datanucleus jars on classpath
13:55:55 Slf4jLogger: Slf4jLogger started
13:55:55 Remoting: Starting remoting
13:55:55 Remoting: Remoting started; listening on addresses :[akka.tcp://sparkDriver@zookeeper.service.geotrellis-spark.internal:42507]
Reply all
Reply to author
Forward
0 new messages