Hello everyone,
this problem may occur because of several reasons. I suppose that it is a combination of
- tile cache rebuild
- tile cache invalidation
- browser cache invalidation
- errors (connection problem to database, general rendering error,...) when generating PNG tiles
- sync problem between main and spatial database
- scaling / connection issues
- database not responding fast enough
We run a completely custom tile service for both the TMS and MVT. Since our dataset receives a lot of updates, we have a fairly complex logic in place that handles the near-realtime sync between addition/updates to our main database and our spatial database including the re-generation of tiles for the specific area that received the change. This works very well for the MVT part since the generation and also the "on-the-fly" calculation of binary tiles does not take much power and is very fast. In contrast, re-building a PNG tile cache takes considerably more time and CPU power especially with the high zoom-levels that we have to support. The service scales its instances depending on the load and thus they are designed to be stateless. All instances share a pool of re-build tasks that are picked up by a single instances. Each task has a specific "try-X-times-then-discard" configuration. It looks like that in this case, several tile re-build tasks have been dropped due to a reoccurring problem during the rebuild task. What it was, not sure - most likely the database cluster cannot keep up with the amount of requests it receives. Unfortunately there is no easy fix to this. The easiest way (and the last resort) would be to handle it like most other services and rebuild the tile cache depending on a specific schedule. But this means that the PNG tiles will "lag" behind the current dataset depending on the rebuild interval and that this will introduce other problems that have to be solved, e.g. making sure that there are no inconsistencies that arise when a change happens during the rebuild process etc.
I'll have a look at it the next days and reset the cache or even maybe lower the cache expiration to have a better "self healing" of the PNG tile cache but I'm unsure of the outcome. Our focus is currently on the completely new webinterface which will be released shortly. After that, this will get high priority and I can take an in-depth look at it.
Cheers,
Stephan