Regarding locking of periodic tasks

Showing 1-6 of 6 messages
Regarding locking of periodic tasks vineeth sagar 8/28/18 11:51 PM
I tried reading the locking mechanism in the task cookbook, I didn't understand it at all. Below is the code for reference and here are my doubts, can anyone be kind enough to explain my doubts.

1)Say the first worker instance is created, it calls the memcahed_lock function which yields true and I think (I am not sure) finally block also will be executed and the if inside the finally block will be executed? So doesn't this delete the lock even before the first worker starts execution? Is there something I am missing? I know I am wrong somewhere if this was the case this wouldn't be in the docs.

2) Second thing I feel icky about is the lock expiry, say I put 10 minutes as an expiry and what will happen if the lock expires before the process finishes it's execution. The comments say this won't happen, but I just don't see it.

I have a periodic task which calls an api to populate the DB. It's is critical that I have only one task running at a time. I think I can use a simple locking scheme like binary semaphores would be sufficient, but I like this solution better because  this doesn't make any inherent assumptions about the architechture, This will work in almost any settings. So I want to use this solution but I just want to understand it before I integrate it to my code.

If my questions are not clear please ask me as english is not my first language I tend to be verbose.

regards 
vineeth
from celery import task
from celery.five import monotonic
from celery.utils.log import get_task_logger
from contextlib import contextmanager
from django.core.cache import cache
from hashlib import md5
from djangofeeds.models import Feed

logger = get_task_logger(__name__)

LOCK_EXPIRE = 60 * 10  # Lock expires in 10 minutes

@contextmanager
def memcache_lock(lock_id, oid):
    timeout_at = monotonic() + LOCK_EXPIRE - 3
    # cache.add fails if the key already exists
    status = cache.add(lock_id, oid, LOCK_EXPIRE)
    try:
        yield status
    finally:
        # memcache delete is very slow, but we have to use it to take
        # advantage of using add() for atomic locking
        if monotonic() < timeout_at and status:
            # don't release the lock if we exceeded the timeout
            # to lessen the chance of releasing an expired lock
            # owned by someone else
            # also don't release the lock if we didn't acquire it
            cache.delete(lock_id)

@task(bind=True)
def import_feed(self, feed_url):
    # The cache key consists of the task name and the MD5 digest
    # of the feed URL.
    feed_url_hexdigest = md5(feed_url).hexdigest()
    lock_id = '{0}-lock-{1}'.format(self.name, feed_url_hexdigest)
    logger.debug('Importing feed: %s', feed_url)
    with memcache_lock(lock_id, self.app.oid) as acquired:
        if acquired:
            return Feed.objects.import_feed(feed_url).url
    logger.debug(
        'Feed %s is already being imported by another worker', feed_url)
Re: Regarding locking of periodic tasks Josue Coronel 8/29/18 11:59 AM
I think the main issue here is understand how the context managers enters and exits: https://docs.python.org/3/library/contextlib.html#contextlib.contextmanager

What happens inside the task is:
1. Create string `feed_url_hexdigets`
2. Create `lock_id`
3. Print debug
4. Create memlock
  # This is inside `memcache_lock`
  4.1. initialize expire timestamp
  4.2. create memlock using `cache`
  4.3. yield the lock `status`
# here we go back to the `import_feed` task
5. IF lock was created
  5.1. return `import_feed`
# if the lock is not acquire then the contextmanager goes into the `finally:` block
# meaning, we go back to `memcache_lock` exactly after `yield status`
  4.4. IF current timestamp is less than `timeout_at` AND `status` is true (lock was created)
  4.4.1 Delete the lock.
6. print debug line

As you see when we enter the `with` statement the context manager function will execute everything until `yield`.
The `yield` statement is surrounded by a try/catch block in case there's anything issue then we'll go into the finally block.
The finally block will also be executed when the code exits the `with` statement.
The `finally` block is how the context manager cleans up after itself at the end of the `with` statement.
If a task that created the lock runs for more than 10 minutes then the lock will only be deleted from the django cache but the memcache lock will still be there.

In your case you have a task that initializes a database. I think a memcache lock will not be sufficient because this kind of lock will only make sure
that there's not multiple tasks running at the same time. I think what you need there is to check first if the database is initialized and then initialize it,
if the database has already been initialized then don't do anything.
If you use the memcache lock in the example then there might be still another initialization database task run after the first one is finished.
Re: [celery-users] Re: Regarding locking of periodic tasks vineeth sagar 8/30/18 4:11 AM
Thank you very much after reading the contextmanager link the whole code makes a ton of sense, also this lock would be enough for me because the db is initialised with django before the tasks are run.

>If a task that created the lock runs for more than 10 minutes then the lock will only be deleted from the django cache but the memcache lock will still be there.

Sorry but can you explain this to me? What do you mean it's deleted from django cache but exists in  memcached, my django project uses memcached as a cache backend, so if it's not in django cache it shouldnt be in memcached right? I am completely new to the whole memcached,redis scene so this is a bit confusing for me. I think the solution would be to have an expiry long enough so it won't be expired before the task has completed it's execution. One more locking strategy was  Locking with redis , this seems a bit more inuitive if that makes sense? this would be better right? Can you share your thoughts.

Also at my workplace I have been pushing people to use celery as it feels like the correct way to do things. What would be the things I/we should not assume as a beginner so it doesn't comes back to bite us in the ass. 

 Again thank you very much man, you are a savior.


--
You received this message because you are subscribed to the Google Groups "celery-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to celery-users...@googlegroups.com.
To post to this group, send email to celery...@googlegroups.com.
Visit this group at https://groups.google.com/group/celery-users.
For more options, visit https://groups.google.com/d/optout.
Re: [celery-users] Re: Regarding locking of periodic tasks Josue B. C. 8/30/18 10:40 AM
Sorry about that last part, I didn't re-read my answer before submitting and it's confusing.
I didn't mean it'll stay there, I was thinking about another case.
What I meant is that if the lock is expired by the time the code exits the `with` statement then the lock will just stay dangling there. The lock will not be released if the timeout is reached or if the lock was not created by the task that's run. The good thing is that Django cache will automatically expire the cache record once the timeout is reached so that's done for you automatically.

Now, about using a lock with redis. I don't think it's more intuitive or not but what I do like about that is that the lock can be used across workers. Because a memcache is only shared within processes that have access to the same memory address space and that's usually not true for different workers. Note I'm talking about workers and not worker processes (or threads).
If you're using Django I would suggest to configure your cache store to be redis that way you have the best of both worlds and you're not using redis ONLY to create task locks but everywhere in your app for every cache need.

There's a few things to take into consideration when first diving into Celery (I am actually writing an article about this). Here's a few general ideas in no particular order:
1. Celery is a library, it is very powerful but it's not meant to solve all your problems out of the box, it will only give you the building blocks.
2. It's better to use a proper AMQP broker than something like REDIS.
3. Think in parallel/distributed. Meaning, assume you don't know what the order of the tasks will be and that one task will not know about the other tasks.
4. Autoscale is hard, Celery does offer an autoscale feature but chances are you have to tailor autoscaling to your needs so it's better if you implement an autoscale strategy.
5. Make sure you retry tasks correctly and use exponential backoffs.
6. Test your tasks assuming they are stand-alone functions.
7. Keep tasks simple and separate functionality, it's easier to debug.
8. If using Celery Beat, only use it as an intermediate to queue the actual tasks instead of running your code inside a beat task. This way distribuited computing and scaling is easier.
9. IF you really need to run task with a specific order use workflows and keep your workflows as simple as possible.
10. Use flower or signals to monitor your workers, it's really useful to know when a worker goes offline.

Hope this helps. 
Re: [celery-users] Re: Regarding locking of periodic tasks vineeth sagar 8/30/18 11:15 AM
Wow thank you very much for this. That was very kind of you to take time to write such a well thought out answer. This has cleared many things for me. Yes Redis does everything memcached does, we will probably migrate to redis. I still have to read more and more about it. Really appreciate this.

Re: [celery-users] Re: Regarding locking of periodic tasks Josue B. C. 8/30/18 12:24 PM
Sure thing, no problem :)