I have a layer of grid cells and a layer of discreet points representing elevation samples. My grid layer has a total of about 430,000 cells, the elevation data contains about 320,000 points, and both datasets have spatial indexes on them. I need to join each grid cell with the nearest elevation point (within at most 30 feet of the cell’s center point). To accomplish this, I created the following query:
SELECT DISTINCT ON (grid_rail_lines.pk_uid)
grid_rail_lines.pk_uid
, elev_rail_combined.pk_uid AS el_id
, elev_rail_combined.elevation
FROM
grid_rail_lines JOIN elev_rail_combined
ON grid_rail_lines.the_geom<->elev_rail_combined.the_geom < 30
ORDER BY
grid_rail_lines.pk_uid ASC
, grid_rail_lines.the_geom<->elev_rail_combined.the_geom ASC
EXPLAIN provides the following information (also available at depesz):
Unique (cost=141973943356.82..142206589733.23 rows=434170 width=624)
-> Sort (cost=141973943356.82..142090266545.03 rows=46529275283 width=624)
Sort Key: grid_rail_lines.pk_uid, ((grid_rail_lines.the_geom <-> elev_rail_combined.the_geom))
-> Nested Loop (cost=0.00..4910712887.16 rows=46529275283 width=624)
Join Filter: ((grid_rail_lines.the_geom <-> elev_rail_combined.the_geom) < 30::double precision)
-> Seq Scan on grid_rail_lines (cost=0.00..13202.70 rows=434170 width=484)
-> Seq Scan on elev_rail_combined (cost=0.00..6220.05 rows=321505 width=140)
I let this query run over the weekend and it took a total of 21 hours. These datasets are rather large so I expect it to take a long time, but I wonder if there is a more efficient way to conduct the join. As far as I can tell, I’ve structured the query as recommended in the documentation. Does anyone have ideas for how to improve performance?
Thanks,
Spencer
Try this:
SELECT grid.pk_uid, point.pk_uid AS el_id, point.elevation
FROM grid_rail_lines grid, elev_rail_combined point
WHERE point.pk_uid = (SELECT point.pk_uid id2
FROM grid_rail_lines point
ORDER BY point.geom <-> grid.geom
LIMIT 1)
ORDER BY grid.pk_uid
Thanks! That did the trick. I had previously tried a slightly different version of your solution but it was far less efficient.