Load ingested tiff as RDD from HDFS

545 views
Skip to first unread message

Jorge Peña

unread,
Feb 10, 2016, 11:06:40 AM2/10/16
to geotrellis-user
Hello,

I managed to ingest a tiff raster into hdfs thanks to the help of pomadchin on gitter using the following script:

SPARK_SUBMIT="spark-1.5.2/bin/spark-submit"

# To generate the assembly run:
# git clone https://github.com/pomadchin/geotrellis-chatta-demo.git
# cd 
geotrellis-chatta-demo
# git checkout spark-version
# ./sbt assembly
# Note: it may be necessary to run ./publish-local.sh on geotrellis repository before
JAR="geotrellis-chatta-demo/geotrellis/target/scala-2.10/GeoTrellis-Tutorial-Project-assembly-0.1-SNAPSHOT.jar"

# Amount of memory for the driver
DRIVER_MEMORY=3G

# Amount of memory per executor. If in local mode, change the DRIVER_MEMORY instead.
EXECUTOR_MEMORY=4G

# MASTER
# For local ingest, options are "local" or "local[K]", where K is the number of executors, e.g. "local[8]"
MASTER=local[*]

# Name of the layer. This will be used in conjunction with the zoom level to reference the layer (see LayerId)
LAYER_NAME=madrid

# This defines the destination spatial reference system we want to use
CRS=EPSG:4326

LAYOUT_SCHEME="tms" # Not very sure about how to define the layout scheme

# Directory with the input tiled GeoTIFF's
INPUT=file:///catalog/madrid/madrid.tif

# Catalog directory on HDFS
OUTPUT=hdfs://localhost:8020/catalog

# Remove some bad signatures from the assembled JAR
zip -d $JAR META-INF/ECLIPSEF.RSA > /dev/null
zip -d $JAR META-INF/ECLIPSEF.SF > /dev/null

$SPARK_SUBMIT \
--class geotrellis.chatta.ChattaIngest \
--master $MASTER \
--driver-memory $DRIVER_MEMORY \
--executor-memory $EXECUTOR_MEMORY \
$JAR \
--input hadoop --format geotiff --cache NONE -I path=$INPUT \
--output hadoop -O path=$OUTPUT \
--layer $LAYER_NAME --crs $CRS --layoutScheme $LAYOUT_SCHEME


I'm trying to use geotrellis-spark-etl from a java project using the following version:

    <repositories>
        <repository>
            <snapshots>
                <enabled>false</enabled>
            </snapshots>
            <id>bintray-azavea-geotrellis</id>
            <name>bintray</name>
            <url>http://dl.bintray.com/azavea/geotrellis</url>
        </repository>
    </repositories>

    <dependencies>
        <dependency>
            <groupId>com.azavea.geotrellis</groupId>
            <artifactId>geotrellis-spark-etl_2.10</artifactId>
            <version>0.10.0-b3b859d</version>
        </dependency>
    </dependencies>

I've read several posts and looked around different repositories (geotrellis, gt-admin, chatta) in order to find how to load my tiff as an RDD but failed to do it.

The most straightforward example I've found so far is the one from this thread, unfortunately the API showed (HadoopCatalog nor HadoopRasterCatalogSpec) is no more available at the version I'm trying to using.

Also I tried to port the ChattaIngest code to java to use spark-etl load, but I don't know how to instantiate the Etl object.

Could you please post or point me to some updated information on how to accomplish this with the version I'm using?

Thank you very much.

P.S. I'm using that one instead of the latest one because there was no gdal version for 2.11 and it failed when I tried to load tiff locally using GeoTiffReader.
P.S. Sorry if my post was a bit verbose but I tried to summarize the answer I got on gitter and how I managed to ingest the tiff so someone in my same situation can find it.

D S PRASANNA

unread,
Mar 8, 2016, 3:27:06 AM3/8/16
to geotrellis-user
Hi Jorge,

I am stuck with the same problem you had. did you find any workaround for this ?
Can you please help me with this if you had any progress?

regards
Prasanna Sudhindrakumar

Jorge Peña

unread,
Mar 9, 2016, 3:43:26 AM3/9/16
to geotrellis-user
Hi Prasanna,

I finally moved to scala because atm some geotrellis API is not compatible with java. 

I ended ingesting rasters from my own project as follows:

Create a RasterRDD using the following method:


def readTiles(sc: SparkContext, source: String, scheme: LayoutScheme = DEFAULT_SCHEME): RasterRDD[SpatialKey] = {
val tiff = readGeoTiff(sc, source)

val (_, sourceMetadata) = RasterMetaData.fromRdd(tiff, scheme)

val metadata = snappedMetaData(sourceMetadata)

ContextRDD(tiff.tileToLayout(sourceMetadata), metadata)
}

def snappedMetaData(metadata: RasterMetaData): RasterMetaData = {
val gridBounds = metadata.mapTransform(metadata.extent)
val snapExtent = metadata.mapTransform(gridBounds)

val RasterMetaData(_, LayoutDefinition(_, tileLayout), _, _) = metadata
val layout = LayoutDefinition(snapExtent, tileLayout)

RasterMetaData(metadata.cellType, layout, snapExtent, metadata.crs)
}

def readGeoTiff(sc: SparkContext, source: String): RDD[(ProjectedExtent, Tile)] = {
// Read the geotiff in as a single image RDD,
// using a method implicitly added to SparkContext by
// an implicit class available via the
// "import geotrellis.spark.io.hadoop._ " statement.
sc.hadoopGeoTiffRDD(source)
}


Note: I'm using a FloatingLayout (it just defines tile size for each tile of a RDD)

Then I store it tiled in HDFS using:


HadoopLayerReader.spatial(catalogPath)(sc).read(LayerId("myraster", 1))


And read it back using:


HadoopLayerReader.spatial(catalogPath)(sc).read(LayerId("myraster",1))


I haven't tried it but I guess you could read script ingested rasters using the previous method

I hope it helps you. Please note that in scala sometimes you import some packages in order to make use of implicit class/methods (otherwise your compile won't find some methods as sc.hadoopGeoTiffRDD)

Cheers

D S PRASANNA

unread,
Mar 10, 2016, 12:35:27 AM3/10/16
to geotrellis-user
Hi Jorge,

Thanks a lot for your help. I was able to load the tiff using their HadoopIngestCommand.scala. 
Their APIs have changed now and are now in scala.
I will try the way you have mentioned here to write the tiff into RasterRDD and read them back.
Can you please share the build.sbt file used for this so that I can download the correct version of dependencies .
I used maven and ended up using older versions .

Thanks beforehand!

Regards
Prasanna Sudhindrakumar

Jorge Peña

unread,
Mar 10, 2016, 3:29:40 AM3/10/16
to geotrellis-user
Hi,

You can find a geotrellis sbt template here:


Currently I'm using revision df9500b

Cheers

D S PRASANNA

unread,
Mar 10, 2016, 8:34:38 AM3/10/16
to geotrellis-user
Hi Jorge,

Thanks a lot.
I am bit confused about the code snippet you have sent.
if I am right , 
with "readGeoTiff" we are reading the tiff file from local directory
with snappedmetadata , we are loading the tiff in to HDFS as rasterRDD based on the layout scheme and layout defeinition

Please correct me if am wrong.

regards
Prasanna Sudhindrakumar

Jorge Peña

unread,
Mar 11, 2016, 3:11:48 AM3/11/16
to geotrellis-user
Yes, I use to readGeoTiff to read a single tiff from local (or remote) directory and split it into several tiles according to a layout definition of 250x250 tiles from its origin (it will create extra cells to snap it to layout definitio).

Then I write the tiles into HDFS and read them back when needed. Note that if you don't use tileToLayout after hadoopGeoTiffRDDyou get an RDD with only 1 tile and thus not very useful to be processed in parallel.

D S PRASANNA

unread,
Mar 11, 2016, 6:07:17 AM3/11/16
to geotrellis-user
Thank you! About the keybound , my understanding is , the values should be between 0(min) and 250(max) if my tile is of 250x250 size. right?

D S PRASANNA

unread,
Mar 11, 2016, 6:36:33 AM3/11/16
to geotrellis-user
Also, i am using the below line to write the tiles into HDFS.

HadoopLayerWriter.spatial(WritePath, paramkeyindex)(sc).write(LayerId("nlcd",1), RasterMetaData)

where the paramkeyindex has to be geotrellis.spark.io.index.KeyIndexMethod[geotrellis.spark.SpatialKey]   
But the method createIndex  in the abstract index KeyIndexMethod has return type geotrellis.spark.io.index.KeyIndex[geotrellis.spark.SpatialKey]  which is ambiguous .

can you please help me ?

Rob Emanuele

unread,
Mar 13, 2016, 2:06:31 PM3/13/16
to geotrel...@googlegroups.com
Hi Prasanna,

In the current version your working with, the HadoopLayerWriter expected a KeyIndexMethod, not a KeyIndex. It ends up calling the "createIndex" method that you mentioned internally as it discovers the keybounds of the layer.

We're about to release 0.10, so some of the API your using is going to be a bit different though. We're currently writing documentation for everything, so letting us know of the questions you come up with either here or on our Gitter channel, https://gitter.im/geotrellis/geotrellis is a big help. We tend to be more responsive on the Gitter channel, though that's mainly because we have our heads down trying to get this release out :)

Thanks,
Rob

--
You received this message because you are subscribed to the Google Groups "geotrellis-user" group.
To unsubscribe from this group and stop receiving emails from it, send an email to geotrellis-us...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.



--
Robert Emanuele, Tech Lead
Azavea |  990 Spring Garden Street, 5th Floor, Philadelphia, PA
remanuele@azavea.com  | T 215.701.7502  | Web azavea.com  |  @azavea

Jorge Peña

unread,
Mar 14, 2016, 5:51:42 AM3/14/16
to geotrellis-user
I realized I copy/pasted twice the reader method.

The writer I use is the following:


HadoopLayerWriter.spatial(regionPath(region), HilbertKeyIndexMethod)(sc).write(LayerId("myraster",1), tiles)

Reply all
Reply to author
Forward
0 new messages