I have a relatively large GeoTIFF file that I would like to process by
PostGIS (other, much larger GeoTIFFs also exist): 24937 × 38673
pixels
So far I am aware of 3 ways to ingest the file into PostGIS.
However, I have concerns about all of these ways:
1: in-db, no tiling:
raster2pgsql raster.tiff public.new_table | psql -d postgis -U postgis
In fact, we create a single large raster inside PostGIS. Should it be
efficient to process?
2. in-db, with tiling:
raster2pgsql -t 128x128 raster.tiff public.new_table | psql -d
postgis -U postgis
This results in numerous rasters that are percieved by PostGIS as
separate rasters, not as a single raster object. While this seems to
be the most appropriate way in terms of storage, it is very
inconvenient to process such a list of rasters. All queries should
work with a table of rasters, not with a raster.
3. out-db:
raster2pgsql -R raster.tiff public.new_table | psql -d postgis -U postgis
We leave the raster outside the DB. I suppose only internal GeoTIFF
tiling might help to manage the I/O efficiently. However, does PostGIS
take internal tiling into account during the I/O? Is this ingestion
option is less efficient in terms of performance compared to in-db
approaches?
The main questions are:
A. What is the best way to ingest a large GeoTIFF into PostGIS -- in
terms of both storage efficiency and query conveniency?
B. Are there any other ways (efficienct and convenient) to put a large
GeoTIFF into PostGIS?
Thank you!
_______________________________________________
postgis-users mailing list
postgi...@lists.osgeo.org
https://lists.osgeo.org/mailman/listinfo/postgis-users
What kind of operations are you planning to do with the rasters? That will determine the tile size you should use.
128x128 is most suitable for intensive analysis of small areas. If you are doing less granular stuff like just unioning or clipping rasters, then you'd probably want to go with higher pixel sizes. I tend to use 256x256 and rarely exceed 1000x1000
Out-db operations I haven't done benchmarks recently on it, but some operations are faster with it than in-db and some are faster without-db.
Generally processes like converting to other raster formats, tilling, and clipping as I recall are faster without-db. Operations like reclass / map algebra flavor of things are slower without-db.
Thank you for answering and sharing the experience.
> Storing a large raster in the database as a single raster record is pretty much out of the question. You'd quickly exceed the max rowxcol pixel allowed. I forget exactly what that limit is. In addition the spatial index on the rasters would be useless so any join operation you do or filtering by spatial location will be painfully slow.
I could guess that "rowxcol" probably should be read as "row x
(multiply) col", i.e. the number of pixels per one raster object
>
> What kind of operations are you planning to do with the rasters? That will determine the tile size you should use.
The main workload will be reclassification, map algebra, resampling
and similar operations.
However, if unary operations like interpolation are easy to express
(just run the operation on each table row), others are not.
For example, I would like to subtract one raster of another: A - B.
We import big rasters as tiles that are stored in a table with plenty of rows.
If A and B cover different spatial areas and have different resolutions should I
1) find the common bounding box for A and B, say bbox
2) clip A and B using the bbox (A and B can be already in-db, so this
should be slower than out-db; this is actually a step of a more
complex query)
3) resample A or B to the resolution of B or A (some common resolution)
4) finally subtract A - B
which tiles should I subtract from each other?
how can I be sure that tiles with the same rid from A and B cover
exactly the same spatial extent?
is there any better way to accomplish such a simple operation like the
difference?
will spatial indexing or any other facility help somehow in this situation?
I suppose most difficulties arise from a raster to be dissolved into
separate tiles within a single table leading to complex SQL queries.
Am I right or there are simpler ways to work with large rasters?
Thank you!
Antonio
> postgis-devel mailing list
> postgi...@lists.osgeo.org
> https://lists.osgeo.org/mailman/listinfo/postgis-devel
> Thank you for answering and sharing the experience.
> Storing a large raster in the database as a single raster record is pretty much out of the question. You'd quickly exceed the max rowxcol pixel allowed. I forget exactly what that limit > is. In addition the spatial index on the rasters would be useless so any join operation you do or filtering by spatial location will be painfully slow.
> I could guess that "rowxcol" probably should be read as "row x
> (multiply) col", i.e. the number of pixels per one raster object
Correct sorry forgot that row has two meanings (a db row, vs. rowxcol when talking about rasters).
>
> What kind of operations are you planning to do with the rasters? That will determine the tile size you should use.
> The main workload will be reclassification, map algebra, resampling and similar operations.
> However, if unary operations like interpolation are easy to express (just run the operation on each table row), others are not.
> For example, I would like to subtract one raster of another: A - B.
> We import big rasters as tiles that are stored in a table with plenty of rows.
> If A and B cover different spatial areas and have different resolutions should I
> 1) find the common bounding box for A and B, say bbox
> 2) clip A and B using the bbox (A and B can be already in-db, so this should be slower than out-db; this is actually a step of a more complex query)
> 3) resample A or B to the resolution of B or A (some common resolution)
> 4) finally subtract A - B
> which tiles should I subtract from each other?
> how can I be sure that tiles with the same rid from A and B cover exactly the same spatial extent?
> is there any better way to accomplish such a simple operation like the difference?
> will spatial indexing or any other facility help somehow in this situation?
> I suppose most difficulties arise from a raster to be dissolved into separate tiles within a single table leading to complex SQL queries.
> Am I right or there are simpler ways to work with large rasters?
A lot of questions here. I would suspect you can skip the ST_Clip step, though you might want to verify with timings.
So basic query would look something like below not tested
https://postgis.net/docs/manual-dev/RT_ST_MapAlgebra_expr.html
https://postgis.net/docs/manual-dev/RT_ST_Union.html
https://postgis.net/docs/manual-dev/RT_ST_Resample.html
SELECT A.rid, ST_MapAlgegra( A.rast, 1, ST_Resample(ST_Union(B.rast), A.rast), 1, '[rast1.val] - [rast2.val]') AS new_rast
FROM A INNER JOIN B ON ST_Intersects(A.rast, B.rast)
GROUP BY A.rid;
Ideally you'd try to tile you raster such that each A.rast doesn't have too many B's to union and you'd want you’re a's and B's to be probably like 128x128 so memory footprint is low.
If you get your tiles exactly the same, you could dispense with the ST_Union and also use the much more efficient bbox equality operation ~=
Reducing the query to
SELECT A.rid, ST_MapAlgegra( A.rast, 1, ST_Resample(B.rast, A.rast), 1, '[rast1.val] - [rast2.val]') AS new_rast
FROM A INNER JOIN B ON A.rast ~= B.rast;
Please, consider the following example
Legend:
+ -- + and | are borders of a tile of A (the picture below depicts 1 tile)
+ == + and || are borders of a tile of B (the picture below depicts 4 tiles)
+=====+=====+
|| || ||
|| +---||----+ ||
|| | || | ||
+=====+=====+
|| | || | ||
|| +---||----+ ||
|| || ||
+=====+=====+
What happens if tile sizes of A and B do not coincide and when they
are spatially not aligned:
Does ST_Union(B.rast) become a large raster consisting of 4 tiles?
(borders are === and ||)
Should I also pre-create some indexes to accelerate
ST_Intersects(A.rast, B.rast)? Which indexes may help?
Does bbox operation ~= make sense for the above case (tiles are not
spatially aligned and there is no equal bbox in B for a bbox in A)?
A storage question: does PostGIS guarantees to persist tiles on disk
when I issue SELECT ... INTO (save the result to a new table)? Are
there any related read/write caches for raster data?
Let's clarify things a bit. In context of PostGIS raster, spatially aligned means grid alignment which means
2 tiles have the same pixel size,skew, srid, offset so essentially they are drawn on the same chess board as determined by this function and does not mean they represent the same tile of space.
http://postgis.net/docs/manual-dev/RT_ST_SameAlignment.html
ST_Union as I recall, will fail if the two tiles cannot be drawn on the same chess board (not aligned in raster spatial speak). So that is why you need to align them first using something like ST_Resample or ST_Transform https://postgis.net/docs/manual-dev/RT_ST_Transform.html (feeding ST_Transform a reference raster). All use GDAL WARP algorithms behind the scenes.
1) That said if the alignments are not the same ST_Union will throw an error. Let me know if it doesn't as that sounds suspicious since ST_Union is a pixel by pixel operation so can't work with mismatched pixels.
If the tiles do not completely coincide e.g. B is to the right of A or partially to the right of A, then you get a bigger raster tile as the output of ST_Union.
The default operation of ST_Union when not specified is to use the 'LAST' operation, which means the last raster tile with a pixel value in a given location is the pixel value that is used for that pixel slot.
Note you can override this with FIRST, MIN, MAX, COUNT, SUM, MEAN, RANGE as noted here
https://postgis.net/docs/manual-dev/RT_ST_Union.html
Example say tile A has a pixel in positon pX1Y1 with value 1, 2 in pX2Y2
Tile B has a pixel value in position pX1Y1 with value 2, and no value in pX2Y2, and a value of 8 in pX3Y3
-- your resulting Tile C will have an extent that at least covers (pX1Y1, pX2Y2, pX3Y3) and in those slots would have pX1Y1 = 2, pX2Y2 = 2, PX3Y3 = 8
Which operation is used in most cases, only comes into play, if two pixel values in your set share the same space as in the case of pixel pX1Y1 in the example above. (the second tile B value was used)
In most cases as in Unioning a set of tiles that belong to the same coverage, each pixel is only represented in one tile, so Union(pix A , pix B) -> will use the pixel that has a value
And the resulting is a bigger tile (the union of space of both tiles in pixels that have values)
Which tile is considered LAST/FIRST is something people generally don't bother controlling because they are often dealing with raster coverages where there are no overlapping pixels anyway.
If you did need to control it though as in case where you do have overlapping pixels (like in case where you have observation data and want the last observation for an area to rule), you can do so by applying an order operation
For example.
SELECT ST_Union(A.rast ORDER BY A.date_observation DESC)
FROM A;
2) When you did a load of your raster using raster2pgsql if you used the -I switch
https://postgis.net/docs/manual-dev/using_raster_dataman.html#RT_Raster_Loader
It should have created a spatial index. If you need to do after the fact, you would use the ST_ConvexHull function.
Raster::geometry really is ST_ConvexHull(rast) and what all the raster spatial operators use.
CREATE INDEX idx_your_rast_table_rast_gist ON your_rast_table USING gist( ST_ConvexHull(rast) );
3)
Does bbox operation ~= make sense for the above case (tiles are not spatially aligned and there is no equal bbox in B for a bbox in A)?
No it doesn't ~= only works if your tiles have the same bounding box.
So you should use ST_Intersects or &&. I forget the differences between the ST_Intersects and && . ST_Intersects I think you can pass in a ignore nodata and so if a tile is not completely filled, it would return different from && but a bit slower, ST_Intersects also since it works against the convex hull it can deal with skewed rasters much more efficiently whereas && would only work against bounding box of the skewed raster.
4) A storage question: does PostGIS guarantees to persist tiles on disk when I issue SELECT ... INTO (save the result to a new table)? Are there any related read/write caches for raster data?
No special read/write caches for raster - it uses temp buffers, shared buffers etc just like any other data type.
It will persist the raster data if it is an in-db raster. If it's an out-db raster, you are just copying the meta data.
However if you apply an operation such as ST_Clip or ST_Reclass, your out-db gets copied into the db.
I'm going to try to update the docs to reflect which functions have that behavior and which ones do not.
e.g. if you do ST_Tile, your out-db is still out-db, it's just that each tile ends up referencing a smaller part of the file.
Hope that helps,
Regina
-----Original Message-----
From: postgis-users [mailto:postgis-us...@lists.osgeo.org] On Behalf Of Antonio Rodriges
Sent: Tuesday, May 08, 2018 12:38 PM
To: PostGIS Users Discussion <postgi...@lists.osgeo.org>
Cc: PostGIS Development Discussion <postgi...@lists.osgeo.org>
Subject: Re: [postgis-users] [postgis-devel] Large GeoTIFF ingestion by PostGIS
Thank you, I think I got the main pattern of making a binary operation However, some things need a bit more in-depth understanding
Please, consider the following example
Legend:
+ -- + and | are borders of a tile of A (the picture below depicts 1
+ tile) == + and || are borders of a tile of B (the picture below
+ depicts 4 tiles)
Thank you very much for your time and thorough description of the
PostGIS functionality! This helps a lot!
Sorry for a bit late response.
I now understand the details related to spatial alignment, but what is skew?
It would be great also to clarify the behaviour of ST_Union for the
case of different tile sizes.
I am repeating the query and the illustration for completeness.
SELECT A.rid, ST_MapAlgegra( A.rast, 1,
ST_Resample(ST_Union(B.rast), A.rast), 1, '[rast1.val] - [rast2.val]')
AS new_rast
FROM A INNER JOIN B ON ST_Intersects(A.rast, B.rast)
GROUP BY A.rid;
Legend:
+ -- + and | are borders of a tile of A (the picture below depicts 1 tile)
+ == + and || are borders of a tile of B (the picture below depicts 4 tiles)
+=====+=====+
|| || ||
|| +---||----+ ||
|| | || | ||
+=====+=====+
|| | || | ||
|| +---||----+ ||
|| || ||
+=====+=====+
Now, to compute a single resulting tile, PostGIS needs to UNION 4 tiles of B.
What happens next?
A) does PostGIS leave this big tile in the resulting table as is?
If this is the case, we would have many large overlapping tiles and
much increased, redundant storage.
B) another way is to remove nodatavalues after ST_MapAlgebra and we
will get a tile with a spatial extent not exceeding tile A.
C) or we could re-tile B *before* UNION so that each tile from B has a
respective tile from A (equal bounding boxes)
D) another option is to re-tile the resulting tiles afer ST_MapAlgegra
eliminating overlapping areas and getting smaller tiles
Which case of the above is the default PostGIS behaviour?
Can RT_Retile help for cases (C) or (D)? I did not find an example in
the documentation: https://postgis.net/docs/RT_Retile.html
> However if you apply an operation such as ST_Clip or ST_Reclass, your out-db gets copied into the db.
Why this is the case? Is this done to simplify the development?
Actually we could store out-db a clipped or reclassified raster,
couldn't we?
Thank you!
Antonio