Questions on Ingesting rasters and using RasterRDDs

520 views
Skip to first unread message

Pitt Fagan

unread,
Jan 8, 2015, 11:47:12 AM1/8/15
to geotrel...@googlegroups.com
Hello GeoTrellis community,

I am seeking some assistance with running GeoTrellis interactively with Spark. I used the method described in the 30 Sept. post to create an sbt-assembly Jar file and start Spark interactively as so:
./spark-shell --master spark://localhost.localdomain:7077 --jars ~/workspace/geotrellis/spark/target/scala-2.10/geotrellis-spark-assembly-0.10.0-SNAPSHOT.jar

Once in the Scala shell I can import geotrellis to get access to the functionality contained in the package.

I am trying to do a proof of concept by:
a) taking a couple of GeoTiffs stored on the local filesystem,
b) ingesting them into HDFS,
c) create RasterRDDs from the GeoTiffs,
d) doing some simple calculations on the rasters (basically subtracting one from the other as they both have the same extent),
e) exporting the resulting GeoTiff to HDFS. 

I am hoping to get some assistance with commands/examples to accomplish steps b and c outlined above. A previous post mentioned using the HadoopIngestCommand. Any assistance would be appreciated.

Cheers,
Pitt

Rob Emanuele

unread,
Jan 8, 2015, 12:37:57 PM1/8/15
to geotrel...@googlegroups.com
Hey Pitt,

Let me sketch out how this process is approached.

The target command you'll run is HadoopIngestCommand (https://github.com/geotrellis/geotrellis/blob/master/spark/src/main/scala/geotrellis/spark/ingest/HadoopIngestCommand.scala). You'll run it using spark submit. I haven't run an ingest interactively through a spark shell, but if the jars are in the appropriate places I'm sure it's possible.

 - Create the geotrellis-spark assembly 
  This can be done by running 
      > ./sbt "project spark" assembly
  at the command line

 - See if this script provides some guidance: https://gist.github.com/lossyrob/59f8116b07d37f7f45c5

This should ingest the GeoTIFFs into your catalog folder on HDFS. You should be able to navigate to the layer in HDFS, e.g.

~/data/nlcd/clipped_tiles hadoop fs -ls hdfs://localhost/catalog/nlcd
Jan 08, 2015 12:33:15 PM org.apache.hadoop.util.NativeCodeLoader <clinit>
WARNING: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Found 12 items
drwxr-xr-x   - rob supergroup          0 2015-01-08 12:31 hdfs://localhost/catalog/nlcd/1
drwxr-xr-x   - rob supergroup          0 2015-01-08 12:31 hdfs://localhost/catalog/nlcd/10
drwxr-xr-x   - rob supergroup          0 2015-01-08 12:31 hdfs://localhost/catalog/nlcd/11
drwxr-xr-x   - rob supergroup          0 2015-01-08 12:31 hdfs://localhost/catalog/nlcd/12
drwxr-xr-x   - rob supergroup          0 2015-01-08 12:31 hdfs://localhost/catalog/nlcd/2
drwxr-xr-x   - rob supergroup          0 2015-01-08 12:31 hdfs://localhost/catalog/nlcd/3
drwxr-xr-x   - rob supergroup          0 2015-01-08 12:31 hdfs://localhost/catalog/nlcd/4
drwxr-xr-x   - rob supergroup          0 2015-01-08 12:31 hdfs://localhost/catalog/nlcd/5
drwxr-xr-x   - rob supergroup          0 2015-01-08 12:31 hdfs://localhost/catalog/nlcd/6
drwxr-xr-x   - rob supergroup          0 2015-01-08 12:31 hdfs://localhost/catalog/nlcd/7
drwxr-xr-x   - rob supergroup          0 2015-01-08 12:31 hdfs://localhost/catalog/nlcd/8
drwxr-xr-x   - rob supergroup          0 2015-01-08 12:31 hdfs://localhost/catalog/nlcd/9


These are the ingested zoom levels for the layer.

At this point you should be able to load up the raster through the HDFS catalog. Try this out and see if this process works for you. When you get the raster ingested and want to try out doing some processing of it, let me know and we can go through that process.

Some notes: The tiled GeoTIFFs shouldn't be extremely large to take advantage of parallelism when ingesting. If you're ingesting one large raster, you might run into some out of memory issues. If that's the case, you can use GDAL to cut up the raster into tiles to make the ingest more manageable:
gdal_retile.py -ps 512 512 -targetDir tiles-512 nlcd_2011.tif 

Let me know how it goes.

 - Rob

--
You received this message because you are subscribed to the Google Groups "geotrellis-user" group.
To unsubscribe from this group and stop receiving emails from it, send an email to geotrellis-us...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.



--
Rob Emanuele, Tech Lead, GeoTrellis

Azavea |  340 N 12th St, Ste 402, Philadelphia, PA
rema...@azavea.com  | T 215.701.7692  | F 215.925.2663
Web azavea.com  |  Blog azavea.com/blogs  | Twitter @azavea

Pitt Fagan

unread,
Jan 8, 2015, 3:44:08 PM1/8/15
to geotrel...@googlegroups.com
Hi Rob,

Thanks for the reply. I modified the script you provided and ran it. I am just trying to load in a single GeoTiff, which perhaps is the cause of this error. I am not familiar with a geokey directory. I have zipped up and attached the single geotiff file I am trying this on, if that helps at all. Is there a specific format you require for the geotifs?

Thanks,
Pitt

15/01/08 12:35:43 ERROR Executor: Exception in task 0.0 in stage 0.0 (TID 0)
geotrellis.raster.io.geotiff.reader.MalformedGeoTiffException: no geokey directory
    at geotrellis.raster.io.geotiff.reader.GeoTiffReader.readImageDirectory(GeoTiffReader.scala:82)
    at geotrellis.raster.io.geotiff.reader.GeoTiffReader.readImageDirectories(GeoTiffReader.scala:71)
    at geotrellis.raster.io.geotiff.reader.GeoTiffReader.read(GeoTiffReader.scala:50)
    at geotrellis.spark.io.hadoop.formats.GeotiffRecordReader.initialize(GeotiffInputFormat.scala:57)
    at org.apache.spark.rdd.NewHadoopRDD$$anon$1.<init>(NewHadoopRDD.scala:135)
    at org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:107)
    at org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:69)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:230)
    at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:230)
    at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:230)
    at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
    at org.apache.spark.scheduler.Task.run(Task.scala:56)
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:196)
    at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
    at java.lang.Thread.run(Thread.java:662)
631245_GSY.TIF.gz

Rob Emanuele

unread,
Jan 8, 2015, 3:56:29 PM1/8/15
to geotrel...@googlegroups.com
Hi Pitt, it looks like this GeoTIFF can't be read by our GeoTIFF reader. I'm not sure why this is; I've contacted the author of the GeoTIFF reader to work it out. In the meantime, here's a workaround:

gdal_translate -a_srs EPSG:3857 631245_GSY.tif 631245_GSY-2.tif

This just pushes the raster through gdal_translate and adds geotags so that our reader can read it (the EPSG code should match what projection the raster is actually in).


--
You received this message because you are subscribed to the Google Groups "geotrellis-user" group.
To unsubscribe from this group and stop receiving emails from it, send an email to geotrellis-us...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Pitt Fagan

unread,
Jan 8, 2015, 4:22:23 PM1/8/15
to geotrel...@googlegroups.com
Ok thanks Rob. I will give this a whirl. I was just about to do gdal_info to see header for the file and see what's up. I will let you know how your suggestions turn out. Thanks again!

Pitt Fagan

unread,
Jan 8, 2015, 5:25:44 PM1/8/15
to geotrel...@googlegroups.com
Hi Rob,

OK, I translated the input file and replaced the old file. That helped to get past the previous error. Here is the next error. Please let me know if you need any more info (the ingest script or the latest GeoTiff).

Thanks,
Pitt

Exception in thread "main" java.util.NoSuchElementException: None.get
    at scala.None$.get(Option.scala:313)
    at scala.None$.get(Option.scala:311)
    at geotrellis.spark.RasterMetaData$.fromRdd(RasterMetaData.scala:73)
    at geotrellis.spark.ingest.Ingest$.apply(Ingest.scala:59)
    at geotrellis.spark.ingest.HadoopIngestCommand$.main(HadoopIngestCommand.scala:29)
    at geotrellis.spark.ingest.HadoopIngestCommand$.main(HadoopIngestCommand.scala:18)
    at com.quantifind.sumac.ArgMain$class.mainHelper(ArgApp.scala:39)
    at com.quantifind.sumac.ArgMain$class.main(ArgApp.scala:34)
    at geotrellis.spark.ingest.HadoopIngestCommand$.main(HadoopIngestCommand.scala:18)
    at geotrellis.spark.ingest.HadoopIngestCommand.main(HadoopIngestCommand.scala)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
    at java.lang.reflect.Method.invoke(Method.java:597)
    at org.apache.spark.deploy.SparkSubmit$.launch(SparkSubmit.scala:358)
    at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:75)
    at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)

Rob Emanuele

unread,
Jan 8, 2015, 5:41:28 PM1/8/15
to geotrel...@googlegroups.com
I was able to ingest that TIF when I ran gdal_translate -a_srs EPSG:3857 

What was the EPSG code you used for gdal_translate?

What is the EPSG code you are projecting to (CRS in the shell script)?

The exception is being thrown by https://github.com/geotrellis/geotrellis/blob/master/spark/src/main/scala/geotrellis/spark/RasterMetaData.scala#L73, which points to something weird going on with the extent.


Pitt Fagan

unread,
Jan 8, 2015, 5:54:22 PM1/8/15
to geotrel...@googlegroups.com
I was specifying EPSG 4326 in both the GDAL translation and in the shell script you provided. 3857 might be more appropriate for an image. I will try this now and let you know the results.

Pitt Fagan

unread,
Jan 8, 2015, 6:25:53 PM1/8/15
to geotrel...@googlegroups.com
Hi Rob,

OK, I was able to ingest the data with EPSG 3857 myself so this was a success. Thanks for your help with this. The script was very helpful.

You had indicated in an earlier post that you would be amenable to providing some insight into doing processing with the geotiffs now that they have been ingested. Please let me know the next step(s) to creating the RasterRDD.

Pitt

Rob Emanuele

unread,
Jan 8, 2015, 6:34:01 PM1/8/15
to geotrel...@googlegroups.com
Sure...what are you looking to do with the RasterRDD? You might be able to load the RDD from the interactive shell now if it can access the appropriate binaries. I'm signing off for now, I can help later tonight, but take a look at:

The Hadoop catalog, new one of these up and you can call .load[SpatialKey](LayerId("layerName", 9)) fore example:

This bit of code is a TMS service that uses an Accumulo catalog to load a filtered RasterRDD:

Once you have the raster RDD, if you have geotrellis.spark.op.local._ imported, you'll have access to all the local operations, so you could do things like

rdd + 5 // Add 5 to each cell

or if you had two RDD's,

rdd1 + rdd2 // Add the values of each of the corresponding cells

so something like:

val rdd1 = hadoopCatalog.load[SpatialKey](LayerId("one", 9))
val rdd2 = hardoopCatalog.load[SpatialKey](LayerId("two", 9))

val added = rdd1 + rdd2
val (min, max) = added.minMax

println(s"MIN: $min, MAX: $max")


Pitt Fagan

unread,
Jan 8, 2015, 7:52:06 PM1/8/15
to geotrel...@googlegroups.com
I am just trying to do some basic map algebra with the rasters. I took a look at the links you provided below. I started the spark-shell with the --jars argument for the abt assembly jar. I can import geotrellis._ to get access the functionality in the shell. When trying to create the Hadoop Catalog (through which I can load the layer that has been ingested) I am running into an issue. Basically I am unsure how to specify the fs path which is the second argument to the apply method to create the HadoopCatalog. Any advice on this? Thx.

val hc = geotrellis.spark.io.hadoop.HadoopCatalog(sc,"hdfs://localhost.localdomain/catalog","/user/cloudera" )

error: type mismatch;
 found   : String("hdfs://localhost.localdomain/catalog")
 required: org.apache.hadoop.fs.Path

Rob Emanuele

unread,
Jan 8, 2015, 11:15:51 PM1/8/15
to geotrel...@googlegroups.com
The hadoop catalog currently requires that you wrap the path string in an org.apache.hadoop.fs.Path. We should maybe add an overload that just takes a String.


val hc = geotrellis.spark.hadoop.HadoopCatalog(sc, new org.apache.hadoop.fs.Path("hdfs://localhost.localdomain/catalog"))

Pitt Fagan

unread,
Jan 9, 2015, 11:38:41 AM1/9/15
to geotrel...@googlegroups.com
Hi Rob,

OK, I have the catalog defined, per your previous post. Here is the command that I used.

scala> val hc = geotrellis.spark.io.hadoop.HadoopCatalog(sc, new org.apache.hadoop.fs.Path("hdfs://localhost.localdomain/user/cloudera/catalog") )
hc: geotrellis.spark.io.hadoop.HadoopCatalog = geotrellis.spark.io.hadoop.HadoopCatalog@46e04c7f

From your second to last post yesterday, you referenced this command to create a Raster RDD.


val rdd1 = hadoopCatalog.load[SpatialKey](LayerId("one", 9))

When I try a couple of variants of this, I get the following message.

scala> val ras9 = hc.load[SpatialKey](LayerId("631245_GSY", 9) )
<console>:20: error: not found: type SpatialKey

Looking at the code in the HadoopCatalog.scala file, I am guessing I don't see what value should be put in for the SpatialKey variable. Or, the HadoopCatalog is not set up correctly. Here is what HDFS looks like for the relevant directories.

[cloudera@localhost geotrellis]$ hadoop fs -ls /user/cloudera/catalog/631245_GSY
Found 14 items
drwxr-xr-x   - cloudera cloudera          0 2015-01-08 15:10 /user/cloudera/catalog/631245_GSY/1
drwxr-xr-x   - cloudera cloudera          0 2015-01-08 15:10 /user/cloudera/catalog/631245_GSY/10
drwxr-xr-x   - cloudera cloudera          0 2015-01-08 15:10 /user/cloudera/catalog/631245_GSY/11
drwxr-xr-x   - cloudera cloudera          0 2015-01-08 15:10 /user/cloudera/catalog/631245_GSY/12
drwxr-xr-x   - cloudera cloudera          0 2015-01-08 15:10 /user/cloudera/catalog/631245_GSY/13
drwxr-xr-x   - cloudera cloudera          0 2015-01-08 15:10 /user/cloudera/catalog/631245_GSY/14
drwxr-xr-x   - cloudera cloudera          0 2015-01-08 15:10 /user/cloudera/catalog/631245_GSY/2
drwxr-xr-x   - cloudera cloudera          0 2015-01-08 15:10 /user/cloudera/catalog/631245_GSY/3
drwxr-xr-x   - cloudera cloudera          0 2015-01-08 15:10 /user/cloudera/catalog/631245_GSY/4
drwxr-xr-x   - cloudera cloudera          0 2015-01-08 15:10 /user/cloudera/catalog/631245_GSY/5
drwxr-xr-x   - cloudera cloudera          0 2015-01-08 15:10 /user/cloudera/catalog/631245_GSY/6
drwxr-xr-x   - cloudera cloudera          0 2015-01-08 15:10 /user/cloudera/catalog/631245_GSY/7
drwxr-xr-x   - cloudera cloudera          0 2015-01-08 15:10 /user/cloudera/catalog/631245_GSY/8
drwxr-xr-x   - cloudera cloudera          0 2015-01-08 15:10 /user/cloudera/catalog/631245_GSY/9
[cloudera@localhost geotrellis]$ hadoop fs -ls /user/cloudera/catalog/631245_GSY/9
Found 4 items
-rw-r--r--   3 cloudera cloudera          0 2015-01-08 15:10 /user/cloudera/catalog/631245_GSY/9/_SUCCESS
-rw-r--r--   3 cloudera cloudera       9472 2015-01-08 15:10 /user/cloudera/catalog/631245_GSY/9/metadata.json
drwxr-xr-x   - cloudera cloudera          0 2015-01-08 15:10 /user/cloudera/catalog/631245_GSY/9/part-r-00000
-rw-r--r--   3 cloudera cloudera         18 2015-01-08 15:10 /user/cloudera/catalog/631245_GSY/9/splits
 
Thanks in advance for your assistance.

Rob Emanuele

unread,
Jan 9, 2015, 12:06:09 PM1/9/15
to geotrel...@googlegroups.com
If there is a "not found: type T" error in Scala, it means that that type is not visible to the compiler. This is a compiler error. This is perhaps why using the REPL could be disadvantageous, rather than writing a set of executable code and submitting it through spark submit; it's hard to discern between compile errors and runtime errors.

Either way, if the compiler can't find the type, that means it's either misspelled or not in scope. Make sure you import the package for that type. In this case, import geotrellis.spark._

The "SpatialKey" type is what tells the catalog the type of key you need the raster to be returned with. We currently support SpatialKey and SpaceTimeKey, which holds both spatial and temporal information. If you ingest a raster with a SpatialKey (which the HadoopIngestCommand does), you need to get that layer out and specify the SpatialKey type. This allows the returned RDD[T] to be typed against the key, so what you get back is RDD[SpatialKey].

It might be tough to find what types live in what packages, to know what to import. Lack of documentation of the development code means you'll have to dive a bit into the source code for GeoTrellis. You can go to the github repo, press 'T', and then type in the Type name...if there is a file name with the same name (which there often is, and is for SpatialKey.scala), you can see where that code lives. 

Alternatively, until you get comfortable navigating around, you could do a sort of blanket import of the types:

import geotrellis.spark._
import geotrellis.op.local._
import geotrellis.io._
import geotrellis.io.hadoop._
import goetrellis.raster._
import geotrellis.vector._

I think this should cover a lot of the types. If you find that a type cannot be found, and can't find it, let me know and I'll tell you what the appropriate import is.

Pitt Fagan

unread,
Jan 9, 2015, 1:36:26 PM1/9/15
to geotrel...@googlegroups.com
Hi Rob,

After I wrote you (and before you replied), I was poking around the codebase and found the SpatialKey.scala file. I had gotten to this point:

scala> val ras9 = hc.load[geotrellis.spark.SpatialKey](LayerId("631245_GSY", 9) )
<console>:20: error: not found: value LayerId

After I ran the import statements in your previous post, I ran this version and everything worked.

Lots to learn!

Thanks so much for your help with this Rob. I appreciate it.

Pitt


scala> val ras9 = hc.load[SpatialKey](LayerId("631245_GSY", 9) )
ras9: geotrellis.spark.RasterRDD[geotrellis.spark.SpatialKey] = RasterRDD[2] at RDD at RasterRDD.scala:28

杨雨浩

unread,
Jul 16, 2015, 3:03:20 AM7/16/15
to geotrel...@googlegroups.com

Hey Pitt,,
I am a postgraduate,and i am looking for some guidance which is very  similar to your research.I found that you had solved this problem with Rob's help.
so could you mind to tell me these steps you make success.
Now ,i had configurated my cluster:spark on yarn ,and i use the command :hadoop fs -put /xx.tif /xx into HDFS ,so would you help me to the next step .

-Albert

在 2015年1月9日星期五 UTC+8上午12:47:12,Pitt Fagan写道:

Albert

unread,
Jul 16, 2015, 3:19:49 AM7/16/15
to geotrel...@googlegroups.com
Hey Rob,
i'm looking for some solutions for my research,and now i try to use geotrellis-spark to do some operation on Geotif file.
i download the geotrellis-master.tar.gz ,and i want to compile them to get the geotrellis-spark-assembly-0.10.0-SNAPSHOT.jar,i use the command  ./sbt "project spark" assembly in geotrellis directory.but i got a error, is something wrong with my operations?could you give me some advice.

-Albert


在 2015年1月9日星期五 UTC+8上午1:37:57,Rob Emanuele写道:

Rob Emanuele

unread,
Jul 16, 2015, 3:25:52 AM7/16/15
to geotrel...@googlegroups.com
Hi Albert,

What's the error?
Message has been deleted
Message has been deleted

Albert

unread,
Jul 16, 2015, 8:59:48 AM7/16/15
to geotrel...@googlegroups.com
Hey Rob,

i use the command to achieve the goal: ./spark-shell  --master spark://master:7077 --jars ~/geotrellis/spark/target/scala-2.10/geotrellis-spark-assembly-0.10.0-SNAPSHOT.jar --class geotrellis.spark.ingest.HadoopIngestCommand  --driver-memory 1g  --layerName ncld --input file:/input --catalog hdfs://output --pyramid true --clobber true

but i got a problem like the picture: 


i don't know how to solve it,please give me some guidance,thank you !

在 2015年7月16日星期四 UTC+8下午3:25:52,Rob Emanuele写道:

Albert

unread,
Jul 16, 2015, 9:36:57 AM7/16/15
to geotrel...@googlegroups.com

Hey Rob,
i have just got the same problem ,and i want to know how to translate the geotiff .what command i need to solve this problem ?

-Albert
在 2015年1月9日星期五 UTC+8上午4:56:29,Rob Emanuele写道:

Harsh Mehta

unread,
Jan 9, 2017, 6:14:50 AM1/9/17
to geotrellis-user
Hi Albert, 
   I am working with Geotrellis, But i want to know that how do I ingest geotiff img using Etl.ingest() method, And, I also want to know that Argument of ingest method like projectedExtent

Rob Emanuele

unread,
Jan 12, 2017, 10:52:18 AM1/12/17
to geotrellis-user
Hi Harsh,

Thanks for reaching out. An ETL tutorial can be found here: http://geotrellis.readthedocs.io/en/latest/tutorials/etl-tutorial/

If you have more questions, please use our LocationTech mailing list: https://locationtech.org/mailman/listinfo/geotrellis-user

Thanks,
Rob
Reply all
Reply to author
Forward
0 new messages