Time series

213 views
Skip to first unread message

Thanasis Karmas

unread,
May 25, 2016, 1:30:26 PM5/25/16
to geotrellis-user
Hi all,

I was wondering about time series possibilities in geotrellis.
Is it possible to construct a "cube" from multiband tiles on different dates?
If so could someone provide some code show casing this feature?

I would be interested in the entire procedure: from reading a multiband geotiff,
creating a SpaceTimeKey and then inserting the resulting tiles in a suitable RDD to
updating the newly created RDD with another multiband geotiff.

From the documentation i have found the classes

SpaceTimeHadoopOutput() and TemporalMultibandGeoTiffHadoopInput()

but i am confused on how to use them to create multitemporal tiles!


If it is not possible to create such a cube how to handle temporal queries in geotrellis?

(meaning how to use for example the TemporalWindow class on the stored RDDs?


I am really sorry about the rather abstract question but i was not able to find relative

code examples.


Thank you very much in advance,

Thanasis



Eugene Cheipesh

unread,
May 26, 2016, 5:11:38 PM5/26/16
to geotrel...@googlegroups.com
Hello Thanasis,

No need to apologize for abstract questions, it’s on us to create docs that cover these topics and this is going to be a start for this topic.

To answer your question broadly before giving some details.

MultibandTile is a stack of tiles, intended as mostly for multi-spectral imagery. SpaceTimeKey has an time instant as a member, so it is a point in time. In this manner what is most readily represented is a multi-band stack, rather than a cube (since we say nothing about temporal resolution). We implement all storage/query based on this key. One may argue that an idea of cube is very important as an optimization for always keeping a set of images together. This may be true but we have not explored those use cases yet.

You have couple of options:

1. Treat your data as a “stack” and express operations on keys in the manner that we do.

2. Placing your own meaning on the values of band indices and treat each index i as an increment of d*i time units from the instant of the key. This would express a time cube.

3. If you find that kind of approach kloodgy it would be possible to create a new value type MultibandCube or TimeCube, something that wraps Seq[MultibandTile] and makes it behave as a cube. In order to read/write such RDDs you would only need to provide an AvroRecordCodec type class instance, which could be built from types already implemented with relative ease.

This is a use case we envisioned but have not flushed out yet. If you end-up going down this road I think this would be an excellent contribution that could be rolled into GeoTrellis master as you work on it.


Overall we make an attempt to express as much of the algorithms as possible through type classes such that new value and key types could be used in more aspects incrementally by implementing relevant type classes. For instance for MultibandCube to be mergable it would need to implement (TileMergeMethods): https://github.com/geotrellis/geotrellis/blob/2402607b0c6f5a4e087bc46fe32c5f56e206947e/raster/src/main/scala/geotrellis/raster/merge/TileMergeMethods.scala Which again in this case could be expressed in existing functionality.

I’ve written this doc to cover some of the things you’re asking and as a start for documentation for this section. Please let me 
know what parts need clarified or elaborated for your needs: 

https://gist.github.com/echeipesh/ed7eea567f6d001813ed5d1d22e6949b

What is MultiBand image

https://github.com/geotrellis/geotrellis/blob/master/raster/src/main/scala/geotrellis/raster/MultibandTile.scala

MultibandTile is an indexed stack of Tiles (each representing a band) which all share the same pixel count, cell type, and by implication spatial extent when they're wrapped in Raster[MultibandTile]. There is nothing that imposes meaning on the band index, so they can be anything. So far we've used them in predictable way of reading a multi-spectral images.

Sourcing MultiBand images

We can read multiband GeoTiffs like so:

import geotrellis.spark.io.hadoop._

val sc: SparkContext = ???
val source: RDD[(ProjectedExtent, MultibandTile)] = sc.hadoopMultibandGeoTiffRDD("hdfs://namenode:8020/dir/")

That uses MultibandGeoTiffInputFormat under the covers and doesn't really require any configuration.

Reading temporal GeoTiffs is a little tricker. There doesn't seem to be a standard for how the timestamp tiff tag. There is sc.hadoopTemporalMultibandGeoTiffRDD method that assumes the tag is GEOTIFF_TIME_TAG but thats essentially a default we used in the past. This is an area where more sugar should be added for nicer API. But here is the verbose way to use it:

import org.apache.hadoop.fs.Path
import org.apache.hadoop.conf.Configuration

val sc: SparkContext = ???

val path: Path = new Path("hdfs://namenode:8020/dirs")
val conf: Configuration = sc.hadoopConfiguration.withInputDirectory(path, List("*.tiff"))

TemporalGeoTiffInputFormat.setTimeTag(conf, "SCENE_TIME")
TemporalGeoTiffInputFormat.setTimeFormat(conf, "YYYY:MM:dd HH:mm:ss")

val source: RDD[(TemporalProjectedExtent, MultibandTile)] =
    sc.newAPIHadoopRDD(conf,
        classOf[TemporalMultibandGeoTiffInputFormat],
        classOf[TemporalProjectedExtent],
        classOf[MultibandTile])

Tiling Multiband images

Having read some unstructured GeoTiffs isn't much use since we have no way to join one set to another or save them in meaningful way. So the thing to do is to tile them to a specific layout. We can do it like so:

val tileLayerMetadata: TileLayerMetadata = source.collectMetadata[SpaceTimeKey](FloatingLayoutScheme(512))
val tiledRdd: RDD[(SpaceTimeKey, MultibandTile)] = source.tileToLayout[SpaceTimeKey](tileLayerMetadata, Bilinear)

There are a number of overloads to collectMetadata that allow you to specify some of the things you know:https://github.com/geotrellis/geotrellis/blob/master/spark/src/main/scala/geotrellis/spark/package.scala#L127-L147

collectMetadata will actually fetch the data and reduce through it to find some things out. If you know a fair bit bout the structure of your input there is a decent chance you will be able to construct TileLayerMetadata directly and avoid this costly step.

Two choices for layout scheme are FloatingLayoutScheme which can be at any, floating, resolution and ZoomedLayoutSchemewhich represents a TMS pyramid. A TMS pryramid a resolution for each tier implied by the CRS and tile size used.

Ingest Process

The process of tiling / retrojecting / pryamiding / saving gets called an ingest process. I think its important to be able to break up the process into these steps so the data can be handled in between. However if you find yourself in a standard case you can checkout this helper class: https://github.com/geotrellis/geotrellis/blob/master/spark/src/main/scala/geotrellis/spark/ingest/MultibandIngest.scala

SpaceTimeKey

case class SpaceTimeKey(col: Int, row: Int, instant: Long)

col and row are from the LayoutDefinition specified in the metadata, a non-overlapping raster grid. Instant is the instant of time in milliseconds. Unlike space time is not effectively gridded. It will be gridded when writing, but in memory a SpaceTimeKey is essentially a point in time.

Updating MultiBand images

Once you have tiles and keys doing joins becomes trivial. Merging one rdd into another is essentially a spark cogroupoperation. There is a nice sugar for it:

val rdd1: RDD[(SpaceTimeKey, MultibandTile)] = ???
val rdd2: RDD[(SpaceTimeKey, MultibandTile)] = ???

val updated = rdd1.merge(other)

Eventually this is going to call apply on TileRDDMerge object: https://github.com/geotrellis/geotrellis/blob/master/spark/src/main/scala/geotrellis/spark/merge/TileRDDMerge.scala#L12

There is a detail here: Pair RDDs don't give us any guarantees of uniqueness of keys. So when doing a merge we have to consider possibility that there is going to be multiple tiles with the same key on left and right side. The rule we follow here is that we merge the left and right independently and then we merge right into left.

Saving / Querying Multiband images

To write the images you need a layer writer, here is a snippet for constructing various kinds:

def writer(): LayerWriter[LayerId] = output match {
  case "s3" =>
    S3LayerWriter(params("bucket"), params("prefix"))
  case "file" =>
    FileLayerWriter(params("path"))
  case "accumulo" =>
    val zookeeper: String = params.get("zookeeper").getOrElse {
      val conf = new Configuration // if not specified assume zookeeper is same as DFS master
      new URI(conf.get("fs.defaultFS")).getHost
    }
    val instance = AccumuloInstance(params("instance"), zookeeper, params("user"), new PasswordToken(params("password")))
    val strategy = params.get("ingestPath") match {
      case Some(path) => HdfsWriteStrategy(path)
      case None => SocketWriteStrategy()
    }
    AccumuloLayerWriter(instance, params("table"), strategy)
}

Layer writer needs to know the metadata regarding the RDDs it is writing. If we look at the write method: https://github.com/geotrellis/geotrellis/blob/master/spark/src/main/scala/geotrellis/spark/io/LayerWriter.scala#L39 we first see that it is looking for RDD[(K, V)] with Metadata[M]. (This is the same thing that layer reader will give back to us)

We can make an object with that signature by joining the TileLayerMetadata with our RDD in:

val rdd: RDD[(SpaceTimeKey, MultibandTile)] = ???
val md: TileLayerMetadata = ???

val layer: RDD[(SpaceTimeKey, MultibandTile)] with Metadata[TileLayerMetadata] = ContextRDD(rdd, md)

Indexing

The other thing we will need to do is to provide a KeyIndex or KeyIndexMethod. Ultimately our query is going to need use a B-Tree index and SpaceTimeKey does not imply any way to order our records sequentially. We can solve this problem by using a space filling curve.

In spatiotemporal case it's time to pay the piper and settle on the granularity on time since an SFC will require an integer co-ordinate. You can trace the effects using ZCurveKeyIndexMethod as an example here:https://github.com/geotrellis/geotrellis/blob/master/spark/src/main/scala/geotrellis/spark/io/index/ZCurveKeyIndexMethod.scala

This has a couple of fall outs:

  • You get improved locality from SFC properties
  • You can have index collisions depending on your resolution choices

Index collisions aren't fatal, on writing the records will be sorted by their index and any collisions will be written as a sequence of (K,V).

Reading and Querying

Reading happens as you imagine. Thankfully there is already a section of the docs for it: https://github.com/geotrellis/geotrellis/blob/master/docs/spark/spark-io.md Hopefully that covers the questions you may have.

Temporal Window

You're asking about this class: https://github.com/geotrellis/geotrellis/blob/2402607b0c6f5a4e087bc46fe32c5f56e206947e/spark/src/main/scala/geotrellis/spark/mapalgebra/local/temporal/Implicits.scala#L18

Since we're working with a collection that has some time component one thing we might want to do is to do a window operation. That is we want to define an interval and apply some operation to all values in that interval (eg: min,max,variance).

TemporalWindow class is a method extension to create a TemporalWindowState which is a builder for such an operation. Ultimately what is going to happen is:

  • map each key to some integer window ID, creating duplicate keys
  • fold over resulting records applying some operation on the values
  • return minimum key for each window with the result of the operation

    So to clarify the temporal window represents more of an operation and its eventual result rather than a data structure.


-- 
Eugene Cheipesh
--
You received this message because you are subscribed to the Google Groups "geotrellis-user" group.
To unsubscribe from this group and stop receiving emails from it, send an email to geotrellis-us...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Thanasis Karmas

unread,
Jun 1, 2016, 8:14:05 AM6/1/16
to geotrellis-user
Hi Eugene,

Thank you very much for this detailed answer! I dedicated a lot of time trying to fully
comprehend what you posted. The fact is that i am not an experienced geotrellis as well as
scala user so i have trouble putting all this knowledge in use.

I mean that for instance after reading a newly stored RDD[(SpatialKey, MultibandTile)] with Metadata[TileLayerMetadata[SpatialKey]]
i have trouble calling the necessary functions for extracting one band from the RDD adding a number to all cells and then store back the
result to the RDD.

In order to overcome such issues which stem from the lack of experience i created a course of common actions needed when processing
geospatial data. I believe that if you can provide me with the function calls needed to complete these actions i will be able to move a step
forward! Also i think it could be a good starting point for other new geotrellis users.

So for example lets assume a use case where we have all Landsat 8 images for one path/row and we want to process them as MultibandTile
with a SpaceTimeKey. The course of actions is the following:

1) Read a multiband geotiff (the date will be a string derived from a metadata file as there is no date in the geotiff file)
code

val sc: SparkContext = ??? val path: Path = new Path("hdfs://namenode:8020/dirs") val conf: Configuration = sc.hadoopConfiguration.withInputDirectory(path, List("*.tiff"))
//TemporalGeoTiffInputFormat.setTimeTag(conf, "SCENE_TIME")
//TemporalGeoTiffInputFormat.setTimeFormat(conf, "YYYY:MM:dd HH:mm:ss")

How to create the SpaceTimeKey from a date string derived from the metadata of an image??

val source: RDD[(TemporalProjectedExtent, MultibandTile)] = sc.newAPIHadoopRDD(conf, classOf[TemporalMultibandGeoTiffInputFormat], classOf[TemporalProjectedExtent], classOf[MultibandTile])
val tileLayerMetadata: TileLayerMetadata = source.collectMetadata[SpaceTimeKey](FloatingLayoutScheme(512))
val tiledRdd: RDD[(SpaceTimeKey, MultibandTile)] = source.tileToLayout[SpaceTimeKey](tileLayerMetadata, Bilinear).repartition(100)



2) Store the tiledRdd as a HadoopLayer (path_row_layer)
code
??
We ll use the HadoopAttributeStore and HadoopLayerWriter objects from geotrellis.spark.io.hadoop

3)Then we want to read a new multiband geotiff (new Landsat image) and we will use it to update the path_row_layer
code
same as step 1

4)We have to read in the path_row_layer which will be of the type RDD[(SpaceTimeKey, MultibandTile)] with Metadata[TileLayerMetadata[SpaceTimeKey]]
code
??
We ll use the HadoopLayerReader object from geotrellis.spark.io.hadoop

5)We need to update the path_row_layer with the new Landsat image
code
??

6)We need to store now the multiband layer that contains 2 multiband images of different dates into hadoop
code
??

7) Then if we want to process the multiband layer we have to read it from hadoop
code
??

8) We want to process all tiles of the multiband layer that belong on a specific band e.g. we want to add a number to all tiles of band 1 at both dates
code
??


9) Store the result layer
code
same as step 2 for example


I would be extremely grateful if you could provide the missing code.
Sorry for the long post.
Thank you very much!!
Thanasis

Eugene Cheipesh

unread,
Jun 2, 2016, 6:07:50 AM6/2/16
to geotrel...@googlegroups.com
Hello Thanisis,

Since you mention Landsat I’ve recently wrote a geotrellis Landsat  8 ingest, I think your questions 1 and 2 will be directly answered by checking out:

One notable difference in that code is that it is tiled to a `ZoomedLayoutScheme` so it can be displayed through a tile service. In your sample it looks like you’re going to use `FloatingLayoutScheme`. Just pointing that out.

That is work in progress and will require to locally publish `scala-landsat-util` 0.2.0 to actually compile/run until corresponding PR is merged and published:

For questions 3 - 6:
There is actually a `LayerUpdater` hierarchy that will help you to update an already saved layer:

Using it would look something like this:

I just wrote that up and compiler likes it, but I’ll need to run it on a cluster to make sure there isn’t something I’m forgetting there. I’ll be rolling that feature in the main branch so I’ll get to do that. But at least you see the shape of things.

Also check out the `update` overload on the updater that allows you to define your own merge method for the tiles that overlap. As default we’re going to do a pixel-merge that is the same as used in the original tiling.

Questions 7-9:
There isn’t a reader use case in the demo right now, but it really is as simple as “make a reader, call read method”.
While I think of the way to integrate a use of the reader into the demo I think this doc covers the reader section ok:

You can see one way to interact with a MultibandTile in this NDVI calc:

This is probably the most useful reference on things you can do with MultibandTile and what they actually do:

your specific case will look like for a single tile:
```scala
val renamedMBTile = mbTile.map(1)(x => x + 1)
```
For RDD case it should look something like

```scala
val rdd: RDD[(SpaceTimeKey, MultibandTile)] with Metadata[TileLayerMetadata] = ???
val updatedRDD = rdd.withContext { _.mapValues(_.map(1)( x => x + 1)) }
```

Essentially we don’t have a way to make sure that transformations you ask for keep the metadata valid, so we make it painfully clear in API when you’re expected to modify it or it gets lost.

Sorry things got a little more hand wavy towards the end, I’ll try to integrate those cases in the the demo. I hope this is enough to get you started with a more concrete examples though. As an overall comment I feel that the pipeline for the actual ingest/tile/update is always the thorniest part where plumbing decisions need to be made but once that is done the code to use/process the data is almost trivial.

Best!
-- 
Eugene Cheipesh
Reply all
Reply to author
Forward
0 new messages