Should I store offline calculation results in the cache?

Antonis Christofides

unread,

May 27, 2017, 5:27:11 AM5/27/17

to django...@googlegroups.com

Hello all,

I have an application that calculates and tells you whether a specific crop at a specific piece of land needs to be irrigated, and how much. The calculation lasts for a few seconds, so I'm doing it offline with Celery. Every two hours new meteorological data comes in and all the pieces of land are recalculated.

The question is where to store the results of the calculation. I thought that since they are re-creatable, the cache would be the appropriate place. However, there is a difference with the more common use of the cache: they are re-creatable, but they are also necessary. You can't just go and delete any item in the cache. This will cripple the website, which expects to find the calculation results in the cache. Viewing something on the site will never trigger a recalculation (and if I make it trigger, it will be a safety procedure for edge cases and not the normal way of doing things). The results must also survive reboots, so I chose the file-based cache.

I didn't know about culling, so when the pieces of land grew to 100, and the items in the cache to 400 (4 items need to be stored for each piece of land), I spent a few hours trying to find out what the heck is going on. I solved the problem by tweaking the culling parameters. However all this has raised a few issues:

The filesystem cache can't grow too much because of issue 11260, which is marked wontfix. According to Russell Keith-Magee,

"the filesystem cache is intended as an easy way to test caching, not as a serious caching strategy. The default cache size and the cull strategy implemented by the file cache should make that obvious. If you need a cache capable of holding 100000 items, I strongly recommend you look at memcache. If you insist on using the filesystem as a cache, it isn't hard to subclass and extend the existing cache."

If these comments are correct, then the documentation needs some fixing, because not only does in not say that the filesystem cache is not for serious use, but it implies the opposite:

"Without a really compelling reason, ... you should stick to the cache backends included with Django. They’ve been well-tested and are easy to use."

Is Russell not entirely correct perhaps, or is the documentation? Or am I missing something?

In the end, is it a bad idea to use the cache for this particular case? I also have a similar use case in an unrelated app: a page that needs about a minute to render. Although I've implemented a quick-and-dirty solution of increasing the web server's timeout and caching the page, I guess the correct way would be to produce that page offline with Celery or so. Where would I store such a page if not in the cache?

-- 
Antonis Christofides
http://djangodeployment.com

Alexandre Pereira da Silva

unread,

May 27, 2017, 10:48:20 AM5/27/17

to django...@googlegroups.com

Redis caching is a good solution for this. You can have persistence, it's fast and only limited by your available RAM.

Your implementation should be more robust about missing itens from the cache. If for any reason the cache is lost, your code should rebuild the values in a background task.

--
You received this message because you are subscribed to the Google Groups "Django users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to django-users+unsubscribe@googlegroups.com.
To post to this group, send email to django...@googlegroups.com.
Visit this group at https://groups.google.com/group/django-users.
To view this discussion on the web visit https://groups.google.com/d/msgid/django-users/a5a8d1ab-f4e0-a6b5-b1da-acc9dc2dbf9d%40djangodeployment.com.
For more options, visit https://groups.google.com/d/optout.

Alex Heyden

unread,

May 27, 2017, 11:10:43 PM5/27/17

to django...@googlegroups.com

Seven years ago, that may very well have been a true statement. I wouldn't stand by it today. Disk caching should be perfectly stable, even if it's a pretty poor solution for the average use case.

The part that hasn't changed is memcached. Memcached should be everyone's default caching solution, not just on Django, but everywhere. You use memcached until you've proven you have a use case that makes memcached explicitly the wrong choice, and even then, you have someone else sanity check your assumptions.

To view this discussion on the web visit https://groups.google.com/d/msgid/django-users/CAAAP30ER%2B1veikd6yHd%2BbytDStW5%3DjpAp92xNzW8B0wtL-Bw5g%40mail.gmail.com.

Melvyn Sopacua

unread,

May 28, 2017, 10:03:42 AM5/28/17

to django...@googlegroups.com

On Saturday 27 May 2017 12:25:17 Antonis Christofides wrote:

> The question is where to store the results of the calculation. I

> thought that since they are re-creatable, the cache would be the

> appropriate place. However, there is a difference with the more

> common use of the cache: they are re-creatable, but they are also

> necessary. You can't just go and delete any item in the cache. This

> will cripple the website, which expects to find the calculation

> results in the cache.

What you're describing is not a cache, but a key/value store. A cache knows how to obtain content that is not in there.

Since you're offloading the calculation, you should not delete the old contents till the new one can be written. Swapping contents should be atomic, block reads and fast.

Another strategy is two have 2 for 1 entry. When updating one, lock it. Async readers go for the 2nd. Unlock, lock+write second and you're done.

This can be done with one store and 2 keys or - in a larger environment - with replicated stores, where things pretty much work automagically.

--

Melvyn Sopacua

Antonis Christofides

unread,

May 29, 2017, 7:07:23 AM5/29/17

to django...@googlegroups.com

Hello,

thanks to everyone who replied. Here are some conclusions of mine:

Today's filebased-cache code seems to be suffering from the same problems it was suffering 7 years ago. Every time you .set() the cache it asks the OS to provide a list of files, just for counting them (for the purpose of culling). This is slow. The culling strategy is to delete a random sample of cache entries. So Russell's comment seems valid today, at least with respect to culling. Of Django's included cache backends, apparently only memcached is suitable for a large cache in production. Redis could be a good idea for adding persistence, but it is non-standard (not included with Django).

Redis is anyway not appropriate for my use case because I don't need the speed, so storing the information in RAM, which has a larger cost than the filesystem, is suboptimal.

The fact that a cache knows how to get the information if it doesn't have it is an interesting observation that I hadn't thought about, but appears to be true for most uses of "cache" that I can think of (it doesn't apply to write caches). Therefore I'm using the cache for a different purpose than the one for which it was designed, which can create all sorts of problems (such as a new administrator—or even an old one—not knowing or forgetting they can't just delete the cache). However I will take my risks and continue using it for a while, as for these two small projects implementing a more complicated solution, or adding another component and thus raising the bar for other people to replace me, isn't worth it.

Antonis Christofides
http://djangodeployment.com

--

You received this message because you are subscribed to the Google Groups "Django users" group.

To unsubscribe from this group and stop receiving emails from it, send an email to django-users...@googlegroups.com.

Reply all

Reply to author

Forward