Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

Multiprocessing.Queue - I want to end.

3,274 views
Skip to first unread message

Luis Zarrabeitia

unread,
Apr 30, 2009, 4:49:10 PM4/30/09
to pytho...@python.org

Hi. I'm building a script that closely follows a producer-consumer model. In
this case, the producer is disk-bound and the consumer is cpu-bound, so I'm
using the multiprocessing module (python2.5 with the multiprocessing backport
from google.code) to speed up the processing (two consumers, one per core,
and one producer). The consumers are two multiprocessing.Process instances,
the producer is the main script, and the data is sent using a
multiprocessing.Queue instance (with bounded capacity).

The problem: when there is no more data to process, how can I signal the
consumers to consume until the queue is empty and then stop consuming? I need
them to do some clean-up work after they finish (and then I need the main
script to summarize the results)

Currently, the script looks like this:

===
from multiprocessing import Queue, Process

def consumer(filename, queue):
outfile = open(filename,'w')
for data in iter(queue.get, None):
process_data(data, outfile) # stores the result in the outfile
outfile.close()
cleanup_consumer(filename)

if __name__ == "__main__":
queue = Queue(100)
p1 = Process(target=consumer, args=("file1.txt", queue))
p2 = Process(target=consumer, args=("file1.txt", queue))
p1.start(); p2.start()
for item in read_from_disk(): # this is the disk-bound operation
queue.put(item)
queue.put(None); queue.put(None)
p1.join() # Wait until both consumers finish their work
p2.join()
# Tried to put this one before... but then the 'get' raises
# an exception, even if there are still items to consume.
queue.close()
summarize() # very fast, no need to parallelize this.
===

As you can see, I'm sending one 'None' per consumer, and hoping that no
consumer will read more than one None. While this particular implementation
ensures that, it is very fragile. Is there any way to signal the consumers?
(or better yet, the queue itself, as it is shared by all consumers?)
Should "close" work for this? (raise the exception when the queue is
exhausted, not when it is closed by the producer).

--
Luis Zarrabeitia (aka Kyrie)
Fac. de Matemática y Computación, UH.
http://profesores.matcom.uh.cu/~kyrie

MRAB

unread,
Apr 30, 2009, 5:57:48 PM4/30/09
to pytho...@python.org
The producer could send just one None to indicate that it has finished
producing.

Each consumer could get the data from the queue, but if it's None then
put it back in the queue for the other consumer, then clean up and
finish.

When all the consumers have finished, the queue will contain just the
single None.

Cameron Simpson

unread,
Apr 30, 2009, 6:37:44 PM4/30/09
to MRAB, pytho...@python.org
On 30Apr2009 22:57, MRAB <goo...@mrabarnett.plus.com> wrote:

> Luis Zarrabeitia wrote:
>> The problem: when there is no more data to process, how can I signal
>> the consumers to consume until the queue is empty and then stop
>> consuming? I need them to do some clean-up work after they finish (and
>> then I need the main script to summarize the results)
[...]

>> As you can see, I'm sending one 'None' per consumer, and hoping that no
>> consumer will read more than one None. While this particular
>> implementation ensures that, it is very fragile. [...]

>>
> The producer could send just one None to indicate that it has finished
> producing.
>
> Each consumer could get the data from the queue, but if it's None then
> put it back in the queue for the other consumer, then clean up and
> finish.
>
> When all the consumers have finished, the queue will contain just the
> single None.

And as it happens I have an IterableQueue class right here which does
_exact_ what was just described. You're welcome to it if you like.
Added bonus is that, as the name suggests, you can use the class as
an iterator:

for item in iterq:
...

The producer calls iterq.close() when it's done.

I'll clean up the formatting and add a bunch of missing docstrings if
anyone wants it...

Cheers,
--
Cameron Simpson <c...@zip.com.au> DoD#743
http://www.cskk.ezoshosting.com/cs/

Indeed! But do not reject these teachings as false because I am crazy. The
reason that I am crazy is because they are true.
- Malaclypse the Younger

Cameron Simpson

unread,
May 1, 2009, 1:22:26 AM5/1/09
to pytho...@python.org, Jesse Noller
On 01May2009 08:37, I wrote:
| On 30Apr2009 22:57, MRAB <goo...@mrabarnett.plus.com> wrote:
| > The producer could send just one None to indicate that it has finished
| > producing.
| > Each consumer could get the data from the queue, but if it's None then
| > put it back in the queue for the other consumer, then clean up and
| > finish.
| > When all the consumers have finished, the queue will contain just the
| > single None.
|
| And as it happens I have an IterableQueue class right here which does
| _exact_ what was just described. You're welcome to it if you like.
| Added bonus is that, as the name suggests, you can use the class as
| an iterator:
| for item in iterq:
| ...
| The producer calls iterq.close() when it's done.

Someone asked, so code appended below.

class IterableQueue(Queue):
''' A Queue obeying the iterator protocol.
Note: Iteration stops when a None comes off the Queue.
'''

def __init__(self, *args, **kw):
''' Initialise the queue.
'''
Queue.__init__(self, *args, **kw)
self.__closed=False

def put(self, item, *args, **kw):
''' Put an item on the queue.
'''
assert not self.__closed, "put() on closed IterableQueue"
assert item is not None, "put(None) on IterableQueue"
return Queue.put(self, item, *args, **kw)

def _closeAtExit(self):
if not self.__closed:
self.close()

def close(self):
##logFnLine("%s.close()"%(self,),frame=sys._getframe(1))
if self.__closed:
# this should be a real log message
print >>sys.stderr, "close() on closed IterableQueue"
else:
self.__closed=True
Queue.put(self,None)

def __iter__(self):
''' Iterable interface for the queue.
'''
return self

def next(self):
item=self.get()
if item is None:
Queue.put(self,None) # for another iterator
raise StopIteration
return item

Hendrik van Rooyen

unread,
May 1, 2009, 4:49:22 AM5/1/09
to pytho...@python.org
"Luis Zarrabeitia" <akak...@uh.cu> wrote:

8< -------explanation and example of one producer, --------
8< -------more consumers and one queue --------------------

>As you can see, I'm sending one 'None' per consumer, and hoping that no
>consumer will read more than one None. While this particular implementation

You don't have to hope. You can write the consumers that way to guarantee it.

>ensures that, it is very fragile. Is there any way to signal the consumers?

Signalling is not easy - you can signal a process, but I doubt if it is
possible to signal a thread in a process.

>(or better yet, the queue itself, as it is shared by all consumers?)
>Should "close" work for this? (raise the exception when the queue is
>exhausted, not when it is closed by the producer).

I haven't the foggiest if this will work, and it seems to me to be kind
of involved compared to passing a sentinel or sentinels.

And while we are on the subject - Passing None as a sentinel is IMO as
good as or better than passing "XXZulu This is the End uluZXX",
or any other imaginative string that is not likely to occur naturally
in the input.

I have always wondered why people do the one queue many getters thing.

Given that the stuff you pass is homogenous in that it will require a
similar amount of effort to process, is there not a case to be made
to have as many queues as consumers, and to round robin the work?

And if the stuff you pass around needs disparate effort to consume,
it seems to me that you can more easily balance the load by having
specialised consumers, instead of instances of one humungous
"I can eat anything" consumer.

I also think that having a queue per consumer thread makes it easier
to replace the threads with processes and the queues with pipes or
sockets if you need to do serious scaling later.

In fact I happen to believe that anything that does any work needs
one and only one input queue and nothing else, but I am peculiar
that way.

- Hendrik


Aaron Brady

unread,
May 1, 2009, 5:51:04 PM5/1/09
to
On Apr 30, 3:49 pm, Luis Zarrabeitia <ky...@uh.cu> wrote:
> Hi. I'm building a script that closely follows a producer-consumer model. In
> this case, the producer is disk-bound and the consumer is cpu-bound, so I'm
> using the multiprocessing module (python2.5 with the multiprocessing backport
> from google.code) to speed up the processing (two consumers, one per core,
> and one producer). The consumers are two multiprocessing.Process instances,
> the producer is the main script, and the data is sent using a
> multiprocessing.Queue instance (with bounded capacity).
>
> The problem: when there is no more data to process, how can I signal the
> consumers to consume until the queue is empty and then stop consuming? I need
> them to do some clean-up work after they finish (and then I need the main
> script to summarize the results)
snip

>     for data in iter(queue.get, None):
>         process_data(data, outfile) # stores the result in the outfile
snip
>     queue.put(None); queue.put(None)
snip

> As you can see, I'm sending one 'None' per consumer, and hoping that no
> consumer will read more than one None. While this particular implementation
> ensures that, it is very fragile. Is there any way to signal the consumers?
> (or better yet, the queue itself, as it is shared by all consumers?)
> Should "close" work for this? (raise the exception when the queue is
> exhausted, not when it is closed by the producer).

You may have to write the consumer loop by hand, rather than using
'for'. In the same-process case, you can do this.

producer:
sentinel= object( )

consumer:
while True:
item= queue.get( )
if item is sentinel:
break
etc.

Then, each consumer is guaranteed to consume no more than one
sentinel, and thus producing one sentinel per consumer will halt them
all.

However, with multiple processes, the comparison to 'sentinel' will
fail, since each subprocess gets a copy, not the original, of the
sentinel. A sample program which sent the same object multiple times
produced this output:

<object object at 0x00B8A388>
<object object at 0x00B8A3A0>

Theoretically, you could send a shared object, which would satisfy the
identity test in the subprocess. That failed with this exception:

File "c:\programs\python30\lib\multiprocessing\queues.py", line 51,
in __getstate__
assert_spawning(self)
...
RuntimeError: Queue objects should only be shared between processes th
rough inheritance

As a result, your options are more complicated. I think the best
option is to send a tuple with the data. Instead of sending 'item',
send '( True, item )'. Then when the producer is finished, send
'( False, <any> )'. The consumer will break when it encounters a
'False' first value.

An alternative is to spawn a watchman thread in each subprocess, which
merely blocks for a shared Event object, then sets a per-process
variable, then adds a dummy object to the queue. The dummy is
guaranteed to be added after the last of the data. Each process is
guaranteed to consume no more than one dummy, so they will all wake
up.

If you don't like those, you could just use a time-out, which checks
the contents of a shared variable, like a one-element array, then
checks the queue to be empty. If the shared variable is True, and the
queue is empty, there is no more data.

I'm curious how these work and what you decide.

Roel Schroeven

unread,
May 1, 2009, 6:04:45 PM5/1/09
to
Hendrik van Rooyen schreef:

> I have always wondered why people do the one queue many getters thing.

Because IMO it's the simplest and most elegant solution.


>
> Given that the stuff you pass is homogenous in that it will require a
> similar amount of effort to process, is there not a case to be made
> to have as many queues as consumers, and to round robin the work?

Could work if the processing time for each work unit is exactly the same
(otherwise one or more consumers will be idle part of the time), but in
most cases that is not guaranteed. A simple example is fetching data
over the network: even if the data size is always the same, there will
be differences because of network load variations.

If you use one queue, each consumer fetches a new work unit as soon it
has consumed the previous one. All consumers will be working as long as
there is work to do, without having to write any code to do the load
balancing.

With one queue for each consumer, you either have to assume that the
average processing time is the same (otherwise some consumers will be
idle at the end, while others are still busy processing work units), or
you need some clever code in the producer(s) or the driving code to
balance the loads. That's extra complexity for little or no benefit.

I like the simplicity of having one queue: the producer(s) put work
units on the queue with no concern which consumer will process them or
how many consumers there even are; likewise the consumer(s) don't know
and don't need to know where their work units come from. And the work
gets automatically distributed to whichever consumer has first finished
its previous work unit.

> And if the stuff you pass around needs disparate effort to consume,
> it seems to me that you can more easily balance the load by having
> specialised consumers, instead of instances of one humungous
> "I can eat anything" consumer.

If there is a semantic difference, maybe yes; but I think it makes no
sense to differentiate purely on the expected execution times.

> I also think that having a queue per consumer thread makes it easier
> to replace the threads with processes and the queues with pipes or
> sockets if you need to do serious scaling later.

Perhaps, but isn't that a case of YAGNI and/or premature optimization?


--
The saddest aspect of life right now is that science gathers knowledge
faster than society gathers wisdom.
-- Isaac Asimov

Roel Schroeven

Hendrik van Rooyen

unread,
May 2, 2009, 6:38:01 AM5/2/09
to pytho...@python.org
: "Roel Schroeven" <rschroev_...@fastmail.fm> wrote:


> Hendrik van Rooyen schreef:
> > I have always wondered why people do the one queue many getters thing.
>
> Because IMO it's the simplest and most elegant solution.

That is fair enough...

> >
> > Given that the stuff you pass is homogenous in that it will require a
> > similar amount of effort to process, is there not a case to be made
> > to have as many queues as consumers, and to round robin the work?
>
> Could work if the processing time for each work unit is exactly the same
> (otherwise one or more consumers will be idle part of the time), but in
> most cases that is not guaranteed. A simple example is fetching data
> over the network: even if the data size is always the same, there will
> be differences because of network load variations.
>
> If you use one queue, each consumer fetches a new work unit as soon it
> has consumed the previous one. All consumers will be working as long as
> there is work to do, without having to write any code to do the load
> balancing.
>
> With one queue for each consumer, you either have to assume that the
> average processing time is the same (otherwise some consumers will be
> idle at the end, while others are still busy processing work units), or
> you need some clever code in the producer(s) or the driving code to
> balance the loads. That's extra complexity for little or no benefit.
>
> I like the simplicity of having one queue: the producer(s) put work
> units on the queue with no concern which consumer will process them or
> how many consumers there even are; likewise the consumer(s) don't know
> and don't need to know where their work units come from. And the work
> gets automatically distributed to whichever consumer has first finished
> its previous work unit.

This is all true in the case of a job that starts, runs and finishes.
I am not so sure it applies to something that has a long life.

>
> > And if the stuff you pass around needs disparate effort to consume,
> > it seems to me that you can more easily balance the load by having
> > specialised consumers, instead of instances of one humungous
> > "I can eat anything" consumer.
>
> If there is a semantic difference, maybe yes; but I think it makes no
> sense to differentiate purely on the expected execution times.

The idea is basically that you have the code that classifies in one
place only, instead of running in all the instances of the consumer.
Feels better to me, somehow.

>
> > I also think that having a queue per consumer thread makes it easier
> > to replace the threads with processes and the queues with pipes or
> > sockets if you need to do serious scaling later.
>
> Perhaps, but isn't that a case of YAGNI and/or premature optimization?

Yes and no:

Yes - You Are Gonna Need It.
and
No it is never premature to use a decent structure.

:-)

- Hendrik


Roel Schroeven

unread,
May 2, 2009, 2:19:29 PM5/2/09
to
Hendrik van Rooyen schreef:
> : "Roel Schroeven" <rschroev_...@fastmail.fm> wrote:

>> ...

> This is all true in the case of a job that starts, runs and finishes.
> I am not so sure it applies to something that has a long life.

It's true that I'm talking about work units with relatively short
lifetimes, mostly a few seconds but perhaps maximum about ten minutes. I
assumed that queues are mostly used for that kind of stuff. I've never
really thought about cases where that assumption doesn't hold, so it's
very well possible that all I've said is invalid in other cases.

>>> And if the stuff you pass around needs disparate effort to consume,
>>> it seems to me that you can more easily balance the load by having
>>> specialised consumers, instead of instances of one humungous
>>> "I can eat anything" consumer.
>> If there is a semantic difference, maybe yes; but I think it makes no
>> sense to differentiate purely on the expected execution times.
>
> The idea is basically that you have the code that classifies in one
> place only, instead of running in all the instances of the consumer.
> Feels better to me, somehow.

I most cases that I can imagine (and certainly in all cases I've used),
no classification whatsoever is even needed.

Dave Angel

unread,
May 2, 2009, 4:46:16 PM5/2/09
to pythonlist
Hendrik van Rooyen wrote:

> : "Roel Schroeven" <rschroev_...@fastmail.fm> wrote:
>
>
>
>> Hendrik van Rooyen schreef:
>>
>>> I have always wondered why people do the one queue many getters thing.
>>>
>> Because IMO it's the simplest and most elegant solution.
>>
>
> That is fair enough...

>
>
>>> Given that the stuff you pass is homogenous in that it will require a
>>> similar amount of effort to process, is there not a case to be made
>>> to have as many queues as consumers, and to round robin the work?
>>>
>> Could work if the processing time for each work unit is exactly the same
>> (otherwise one or more consumers will be idle part of the time), but in
>> most cases that is not guaranteed. A simple example is fetching data
>> over the network: even if the data size is always the same, there will
>> be differences because of network load variations.
>>
>> If you use one queue, each consumer fetches a new work unit as soon it
>> has consumed the previous one. All consumers will be working as long as
>> there is work to do, without having to write any code to do the load
>> balancing.
>>
>> With one queue for each consumer, you either have to assume that the
>> average processing time is the same (otherwise some consumers will be
>> idle at the end, while others are still busy processing work units), or
>> you need some clever code in the producer(s) or the driving code to
>> balance the loads. That's extra complexity for little or no benefit.
>>
>> I like the simplicity of having one queue: the producer(s) put work
>> units on the queue with no concern which consumer will process them or
>> how many consumers there even are; likewise the consumer(s) don't know
>> and don't need to know where their work units come from. And the work
>> gets automatically distributed to whichever consumer has first finished
>> its previous work unit.
>>
>
> This is all true in the case of a job that starts, runs and finishes.
> I am not so sure it applies to something that has a long life.
>
>
>>> And if the stuff you pass around needs disparate effort to consume,
>>> it seems to me that you can more easily balance the load by having
>>> specialised consumers, instead of instances of one humungous "I can
>>> eat anything" consumer.
>>>
>> If there is a semantic difference, maybe yes; but I think it makes no
>> sense to differentiate purely on the expected execution times.
>>
>
> The idea is basically that you have the code that classifies in one
> place only, instead of running in all the instances of the consumer.
> Feels better to me, somehow.
> <snip>
If the classifying you're doing is just based on expected time to
consume the item, then I think your plan to use separate queues is
misguided.

If the consumers are interchangeable in their abilities, then feeding
them from a single queue is more efficient, both on average wait time,
worst-case wait time, and on consumer utilization, in nearly all
non-pathological scenarios. Think the line at the bank. There's a good
reason they now have a single line for multiple tellers. If you have
five tellers, and one of your transactions is really slow, the rest of
the line is only slowed down by 20%, rather than a few people being
slowed down by a substantial amount because they happen to be behind the
slowpoke. 30 years ago, they'd have individual lines, and I tried in
vain to explain queuing theory to the bank manager.

Having said that, notice that sometimes the consumers in a computer are
not independent. If you're running 20 threads with this model, on a
processor with only two cores, and if the tasks are CPU bound, you're
wasting lots of thread-management time without gaining anything.
Similarly, if there are other shared resources that all the threads trip
over, it may not pay to do as many of them in parallel. But for
homogeneous consumers, do them with a single queue, and do benchmarks to
determine the optimum number of consumers to start.

Message has been deleted

Luis Alberto Zarrabeitia Gomez

unread,
May 4, 2009, 12:56:26 AM5/4/09
to pytho...@python.org

Quoting Hendrik van Rooyen <ma...@microcorp.co.za>:

> "Luis Zarrabeitia" <akak...@uh.cu> wrote:
>
> 8< -------explanation and example of one producer, --------
> 8< -------more consumers and one queue --------------------
>

> >As you can see, I'm sending one 'None' per consumer, and hoping that no
> >consumer will read more than one None. While this particular implementation
>
>

> You don't have to hope. You can write the consumers that way to guarantee
> it.

I did. But that solution is not very reusable (I _must_ remember that
implementation detail every time) and most important, i'll have to remember it
in a few months with I'm updating the code.


> >ensures that, it is very fragile. Is there any way to signal the consumers?
>

> Signalling is not easy - you can signal a process, but I doubt if it is
> possible to signal a thread in a process.
>

> >(or better yet, the queue itself, as it is shared by all consumers?)
> >Should "close" work for this? (raise the exception when the queue is
> >exhausted, not when it is closed by the producer).
>

> I haven't the foggiest if this will work, and it seems to me to be kind
> of involved compared to passing a sentinel or sentinels.

Well, that would be a vaild signal. Too bad I have to pass it by hand, instead
of the queue class taking care of it for me.

> I have always wondered why people do the one queue many getters thing.
>

> Given that the stuff you pass is homogenous in that it will require a
> similar amount of effort to process, is there not a case to be made
> to have as many queues as consumers, and to round robin the work?

Abstraction. This problem is modeled nicely as a producer-consumer (it would be
in fact a classic producer-consumer). I could take care of the scheduling
myself, but there are already good scheduling algorithms written for my OS, that
take into account both the available CPU and IO.

A solution may not be a queue (in my case, I don't care the order in which the
elements are processed, only that they are), but ideally I would just be
'yielding' results on my producer(s), and receiving them on my consumer(s),
leaving the IPC mechanism to deal with how to move the data from producers to
consumers (and to which consumers).

> And if the stuff you pass around needs disparate effort to consume,
> it seems to me that you can more easily balance the load by having
> specialised consumers, instead of instances of one humungous
> "I can eat anything" consumer.

Not necessarily. The load may depend on the size of the data that was sent. The
consumers are receiving the same kind of data, only the sizes are different
(non-predictable different). Again, I could try to implement some heuristics to
try and guess what processor has lower load, but I'd rather delegate that to the OS.

> I also think that having a queue per consumer thread makes it easier
> to replace the threads with processes and the queues with pipes or
> sockets if you need to do serious scaling later.

This is already multiprocess. It could be nice to extend it to multi-computers
later, but the complexity is not worth it right now.

> In fact I happen to believe that anything that does any work needs
> one and only one input queue and nothing else, but I am peculiar
> that way.

Well, I also need some output. In my case, the outputs are files with the result
of the processing, that can be summarized later (hence the need to 'join' the
processes, to know when I can summarize them).

Thank you.

--
Luis Zarrabeitia
Facultad de Matem�tica y Computaci�n, UH
http://profesores.matcom.uh.cu/~kyrie

--
Participe en Universidad 2010, del 8 al 12 de febrero de 2010
La Habana, Cuba
http://www.universidad2010.cu

Luis Alberto Zarrabeitia Gomez

unread,
May 4, 2009, 1:02:36 AM5/4/09
to Dennis Lee Bieber, pytho...@python.org

Quoting Dennis Lee Bieber <wlf...@ix.netcom.com>:

> I'm not familiar with the multiprocessing module and its queues but,
> presuming it behaves similar to the threading module AND that you have
> design control over the consumers (as you did in the sample) make a
> minor change.
>
> queue.put(None) ONCE in the producer
>
> Then, in the consumer, after it sees the None and begins shutdown
> processing, have the consumer ALSO do
>
> queue.put(None)
>

Thank you. I went with this idea, only that instead of modifying the consumer, I
modified the queue itself... Well, Cameron Simpson did :D. It's working nicely now.

ryl...@gmail.com

unread,
May 4, 2009, 1:21:31 AM5/4/09
to
> You may have to write the consumer loop by hand, rather than using
> 'for'.  In the same-process case, you can do this.
>
> producer:
> sentinel= object( )
>
> consumer:
> while True:
>   item= queue.get( )
>   if item is sentinel:
>     break
>   etc.
>
> Then, each consumer is guaranteed to consume no more than one
> sentinel, and thus producing one sentinel per consumer will halt them
> all.
>
> However, with multiple processes, the comparison to 'sentinel' will
> fail, since each subprocess gets a copy, not the original, of the
> sentinel.

Rather than use object() you can create a type whose instances are
equal.

class Stopper(object):
def __eq__(self, other):
return type(other) == type(self)

producer's stop():
queue.put(Stopper())

consumers main loop:
for item in iter(queue.get, Stopper()):
...

Hendrik van Rooyen

unread,
May 4, 2009, 4:01:23 AM5/4/09
to Luis Alberto Zarrabeitia Gomez, pytho...@python.org

When you know what you want to do with the output (which
I suspect is your case) then the trick is to put the individual
outputs onto the input queue of the summary process.

This will form a virtual (or real if you have different machines)
systolic array with producers feeding consumers that feed
the summary process, all running concurrently.

Ponder on it that in the description only the word "feed" is used...

And if you don't know where your output must go, then it must be
put onto the input queue of your default router, who must know
what to do with stuff. - this is in a sense what happens when
you pipe stdout to stdin in a *nix environment - you use the OS
to set up a temporary systolic array.

You only need to keep the output of the consumers in files if
you need access to it later for some reason. In your case it sounds
as if you are only interested in the output of the summary.

- Hendrik


Luis Zarrabeitia

unread,
May 4, 2009, 9:52:14 AM5/4/09
to pytho...@python.org
On Monday 04 May 2009 04:01:23 am Hendrik van Rooyen wrote:

> This will form a virtual (or real if you have different machines)
> systolic array with producers feeding consumers that feed
> the summary process, all running concurrently.

Nah, I can't do that. The summary process is expensive, but not nearly as
expensive as the consuming (10 minutes vs. a few hours), and can't be started
anyway before the consumers are done.

> You only need to keep the output of the consumers in files if
> you need access to it later for some reason. In your case it sounds
> as if you are only interested in the output of the summary.

Or if the summarizing process requires that it is stored on files. Or if the
consumers naturally store the data on files. The consumers "produce" several
gigabytes of data, not enough to make it intractable, but enough to not want
to load them into RAM or transmit it back.

In case you are wondering what the job is: i'm indexing a lot of documents
with Xapian. The producer reads the [compressed] documents from the hard
disk, the consumers process it and index it on they own xapian database. When
they are finished, I merge the databases (the summarizing) and delete the
partial DBs. I don't need the DBs to be in memory at any time, and xapian
works with files anyway. Even if I were to use different machines (not worth
it for a process that will not run very frequently, except at developing
time), it would be still cheaper to scp the files.

Now, if I only had a third core available to consume a bit faster ...

Regards,

Luis Alberto Zarrabeitia Gomez

unread,
May 4, 2009, 12:38:42 AM5/4/09
to pytho...@python.org

Quoting Cameron Simpson <c...@zip.com.au>:

> | And as it happens I have an IterableQueue class right here which does
> | _exact_ what was just described. You're welcome to it if you like.
> | Added bonus is that, as the name suggests, you can use the class as
> | an iterator:
> | for item in iterq:
> | ...
> | The producer calls iterq.close() when it's done.
>
> Someone asked, so code appended below.

> [...]

Thank you!. I tested it, and it seems to work... and having the queue be an
iterable is a plus. Thank you, Cameron & MRAB!

0 new messages