Lost track of future?

147 views
Skip to first unread message

Nick Vandewiele

unread,
Oct 10, 2014, 2:40:19 PM10/10/14
to scoop...@googlegroups.com
Hi,

I am using SCOOP in a producer-consumer problem in which a producer process spawns data that should be consumed by parallel consumer processes (SCOOP future objects). A Queue.Queue object is used to connect the producer to the consumers. 

A consumer thread (modeled by threading.Thread) contains a infinite loop that queries the queue for more work, submits future's when a datum is found in the queue. The consumer thread is set to a daemon state, such that when there is no work left and the producer process has terminated, it dies as well:

class ConsumerProcess(Thread):
    def run(self):
        global buffer_queue, launches
        while True:
            item = buffer_queue.get()
            launch = futures.submit(consume, item)
            launches.append(launch)

The futures callback occurs in the broker process as follows:
product_data = [completed.result() for completed in launches]

After running the code on a single node with 8 CPUs, I analyze the output (see below) and see messages "Lost track of future", produced by the scoopzmq module. 
Any idea why this occurred and how this can be prevented?

regards,
Nick

The output: (I am using the latest commit of SCOOP: 68872eb3fa8b )

nickv@pharos:~/scoop/test_queue$ python -m scoop.__main__  callback_q.py
[2014-10-10 14:14:01,565] launcher  INFO    SCOOP 0.7.2 dev on linux2 using Python 2.6.5 (r265:79063, Oct  1 2012, 22:04:36) [GCC 4.4.3], API: 1013
[2014-10-10 14:14:01,566] launcher  INFO    Deploying 8 worker(s) over 1 host(s).
[2014-10-10 14:14:01,566] launcher  INFO    Worker d--istribution:
[2014-10-10 14:14:01,566] launcher  INFO       127.0.0.1:       7 + origin
Launching 8 worker(s) using /bin/bash.
[2014-10-10 14:14:02,153] callback_q (127.0.0.1:53224) INFO    source data: [0, 1, 2, 3, 4]
[2014-10-10 14:14:02,154] callback_q (127.0.0.1:53224) INFO    *** Main thread waiting...
[2014-10-10 14:14:02,154] callback_q (127.0.0.1:53224) INFO    Produced 4
[2014-10-10 14:14:02,154] callback_q (127.0.0.1:53224) INFO    Produced 3
[2014-10-10 14:14:02,155] callback_q (127.0.0.1:53224) INFO    Produced 2
[2014-10-10 14:14:02,156] callback_q (127.0.0.1:53224) INFO    Produced 1
[2014-10-10 14:14:02,156] callback_q (127.0.0.1:53224) INFO    Produced 0
[2014-10-10 14:14:02,156] callback_q (127.0.0.1:53224) INFO    Producer thread is released!
[2014-10-10 14:14:02,157] callback_q (127.0.0.1:53224) INFO    Processing item 0
[2014-10-10 14:14:02,160] callback_q (127.0.0.1:59059) INFO    Processing item 4
[2014-10-10 14:14:02,161] callback_q (127.0.0.1:59175) INFO    Processing item 3
[2014-10-10 14:14:02,169] callback_q (127.0.0.1:49676) INFO    Processing item 2
[2014-10-10 14:14:02,170] callback_q (127.0.0.1:53240) INFO    Processing item 1
[2014-10-10 14:14:05,173] scoopzmq  (127.0.0.1:53224) WARNING Lost track of future ('127.0.0.1:53224', 1):consume(4,){}=None. Resending it...
[2014-10-10 14:14:05,173] scoopzmq  (127.0.0.1:53224) WARNING Lost track of future ('127.0.0.1:53224', 2):consume(3,){}=None. Resending it...
[2014-10-10 14:14:05,174] callback_q (127.0.0.1:50002) INFO    Processing item 4
[2014-10-10 14:14:05,174] callback_q (127.0.0.1:50542) INFO    Processing item 3
[2014-10-10 14:14:07,167] callback_q (127.0.0.1:53224) INFO    product data: [4, 3, 2, 1, 0]
[2014-10-10 14:14:07,167] callback_q (127.0.0.1:53224) INFO    *** Done
[2014-10-10 14:14:07,167] callback_q (127.0.0.1:53224) INFO    Time in broker node: 5.01436805725 seconds
[2014-10-10 14:14:07,206] launcher  (127.0.0.1:59054) INFO    Root process is done.
[2014-10-10 14:14:07,206] launcher  (127.0.0.1:59054) INFO    Finished cleaning spawned subprocesses.

Yannick Hold-Geoffroy

unread,
Oct 13, 2014, 4:37:43 PM10/13/14
to scoop...@googlegroups.com
Hello,

This is a new feature we're still working on. It allows the re-launching of lost tasks and is useful in situations where network losses or computation resource loss is possible (such as voluntary computing).

This feature, even though working, needs still some teaking. Rarely, Futures are flagged lost because the heartbeats containing the Futures list is processed before the Future gets deserialized, and thus isn't found on any resource. This is a known rare glitch that we're currently working on.

If you really want to prevent this, you can either revert to the stable version of SCOOP and wait for the next version. But in the facts, this mention is a low-level info that should not even be displayed and won't change anything to the actual output of the computation. Worst case, a Future is computed twice and only the first result to arrive will be kept, which shouldn't be an issue for most fine grained parallel programs.

Have a nice day,
Yannick Hold

--
Vous recevez ce message, car vous êtes abonné au groupe Google Groupes "scoop-users".
Pour vous désabonner de ce groupe et ne plus recevoir d'e-mails le concernant, envoyez un e-mail à l'adresse scoop-users...@googlegroups.com.
Pour obtenir davantage d'options, consultez la page https://groups.google.com/d/optout.

Néstor Fabián Aguirre Castiblanco

unread,
Jul 26, 2017, 7:21:43 PM7/26/17
to scoop-users
Hello,

I would like to know if anyone has overcome this problem. I have the same problem using the scoop version 0.7.2.0 (see below), but I do not observe it in the version 0.7.1.1. Unfortunately, I need other upgrades only available in the last version. I will appreciate a lot the help.

All the best,
Nestor

[2017-07-26 16:49:16,747] scoopzmq  (192.168.2.23:52308) WARNING Lost track of future ('192.168.2.23:52308', 4):KFoldCrossValidation_runWorker((<Model.Model instance at 0x2ae4d309ecf8>, <Optimizer.Optimizer object at 0x2ae4d2f73cd0>),){}=None. Resending it...
(MainThread) Lost track of future ('192.168.2.23:52308', 4):KFoldCrossValidation_runWorker((<Model.Model instance at 0x2ae4d309ecf8>, <Optimizer.Optimizer object at 0x2ae4d2f73cd0>),){}=None. Resending it...
Traceback (most recent call last):
 
File "/usr/projects/hpcsoft/toss2/common/anaconda/4.1.1-python-2.7/lib/python2.7/runpy.py", line 174, in _run_module_as_main
   
"__main__", fname, loader, pkg_name)
 
File "/usr/projects/hpcsoft/toss2/common/anaconda/4.1.1-python-2.7/lib/python2.7/runpy.py", line 72, in _run_code
   
exec code in run_globals
 
File "build/bdist.linux-x86_64/egg/scoop/bootstrap/__main__.py", line 298, in <module>
 
File "build/bdist.linux-x86_64/egg/scoop/bootstrap/__main__.py", line 92, in main
 
File "build/bdist.linux-x86_64/egg/scoop/bootstrap/__main__.py", line 285, in run
 
File "build/bdist.linux-x86_64/egg/scoop/bootstrap/__main__.py", line 266, in futures_startup
 
File "build/bdist.linux-x86_64/egg/scoop/futures.py", line 65, in _startup
 
File "build/bdist.linux-x86_64/egg/scoop/_control.py", line 273, in runController
 
File "build/bdist.linux-x86_64/egg/scoop/_types.py", line 359, in pop
 
File "build/bdist.linux-x86_64/egg/scoop/_types.py", line 382, in updateQueue
 
File "build/bdist.linux-x86_64/egg/scoop/_comm/scoopzmq.py", line 352, in recvFuture
 
File "build/bdist.linux-x86_64/egg/scoop/_comm/scoopzmq.py", line 269, in _recv
 
File "build/bdist.linux-x86_64/egg/scoop/_comm/scoopzmq.py", line 369, in sendFuture
 
File "build/bdist.linux-x86_64/egg/scoop/encapsulation.py", line 164, in pickleFileLike
IOError: File not open for reading
[2017-07-26 16:49:16,816] launcher  (127.0.0.1:42167) INFO    Root process is done.
[2017-07-26 16:49:16,816] workerLaunch (127.0.0.1:42167) DEBUG   Closing workers on wf535 (4 workers).
[2017-07-26 16:49:16,816] brokerLaunch (127.0.0.1:42167) DEBUG   Closing local broker.
[2017-07-26 16:49:16,816] launcher  (127.0.0.1:42167) INFO    Finished cleaning spawned subprocesses.
Reply all
Reply to author
Forward
0 new messages