If there was a way to ensure that a block of instructions always runs right
through uninterupted we could just use spin locks. If the thread holding the
lock is always guranteed to finish what its doing, within a given number of
instructions / cycles, then other threads could just spin while waiting.
what do you think?
Yes, I want:
1. MCAS (for M nonadjacent words)
2. Support for multiple-producer/single consumer queues
3. Support for multiple-producer/single consumer stacks
4. Support for work-stealing deques
5. Support for system-wide memory fence
6. Support for reference counting
Dmitriy V'jukov
I think you should not afraid so much of programming with locks. If
you use them properly, you get atomic operations of any size you want.
It is supported on almost any computer architectures for decades. Lock-
free and spinning mania is very characteristic for those who did not
do any practical concurrent programming before. That is what I think.
Best Regards,
Szabolcs
> Lock-free and spinning mania is very characteristic for those who did not
> do any practical concurrent programming before...
... or have made some performance/scalability tests involving locks on
modern multicore/SMP machines.
Lock-based programming is very characteristic for those who started
programming
decades ago on inherently uniprocessor machines and now is too old to
learn again.
That is what I think.
Dmitriy V'jukov
I think that you have invented the wheel. This is called a critical
region for decades and it can be implemented by either mutexes,
semaphores, or spinlocks. Some languages can even provide it for you
at the language level. Spinlock is not always the best way to
implement it, however. That is what I think.
Best Regards,
Szabolcs
You can't guarantee that critical region will be uninterupted in user-
space.
With mutexes, semaphores, or spinlocks or whatever.
Dmitriy V'jukov
> Or perhaps a way to set a block of instructions to execute
> atomicaly, ie for a dozen or so instructions that cant be interupted.
If region is dozen instructions it's ok. But it's user code, and user
can put zillions of instructions into critical region. So OS/hardware
anyway must have some means to preempt this critical region.
Consequently user have to handle preemption in some way.
So imho it's better to take Joe Seigh's approach:
http://groups.google.com/group/comp.programming.threads/browse_frm/thread/9ae84ddea7d81505
And this can be done with only OS support.
Dmitriy V'jukov
> If there was a way to ensure that a block of instructions always runs right
> through uninterupted we could just use spin locks. If the thread holding the
> lock is always guranteed to finish what its doing, within a given number of
> instructions / cycles, then other threads could just spin while waiting.
>
> what do you think?
I think this is much heavier than what is actually needed, and
consequently will likely perform very badly. All that is needed is to
block *conflicting* accesses to the same data, and there is no reason
to prevent the block from being pre-empted if there is something more
important to do.
The basic problem is this: which other threads should spin? Every
single other thread on every other CPU on the system? If so, this
would be a disaster. If only some, which?
The problem you've created has already been solved with a nearly-
optimal solution. It's called a spinlock.
DS
> I've been following this ng and reading up a fair bit on lock free
> programing and it seems to me to be a very difficult and subtle thing to
> grasp. And lots of gotchas. So it got me wondering, do we need more help
> from the cpus?
[...]
Well, luckily for me, existing CPU's have the support I need to implement
my library. So, I personally do not need anything more. Although, I would
not mind if CPU vendors stick a PLO instruction in there...
> If there was a way to ensure that a block of instructions always runs
> right through uninterupted we could just use spin locks.
[...]
There is the Perform Locked Operation (PLO) instruction described here:
http://publibz.boulder.ibm.com/epubs/pdf/dz9zr002.pdf
(Appendix A/Multi-Processing Examples/Peform Locked Operation)
That is good enough for lock-based STM. Pretty cool.
> Fubar wrote:
> ...
>> what do you think?
>
> I think you should not afraid so much of programming with locks. If you
> use them properly, you get atomic operations of any size you want. It is
> supported on almost any computer architectures for decades.
Sure.
> Lock- free
> and spinning mania is very characteristic for those who did not do any
> practical concurrent programming before. That is what I think.
Are you doing a little trolling today?
;^)
> On Feb 1, 5:54 pm, "Fubar" <fu...@google.com> wrote:
>> I've been following this ng and reading up a fair bit on lock free
>> programing and it seems to me to be a very difficult and subtle thing
>> to grasp. And lots of gotchas. So it got me wondering, do we need more
>> help from the cpus? Bigger better atomic ops?
[...]
>> what do you think?
>
>
> Yes, I want:
> 1. MCAS (for M nonadjacent words)
[...]
I think I would prefer a PLO instruction over MCAS...
Citation?
Hashtable of locks? You suggesting I don't think about that? I know how to
optimize locks; in fact, I am all about a clever marriage between lock-based
and lock-free programming.
BTW, sorry for feeding the troll.
Just out of pure curiosity... Do you know how to create a "conventional"
monitor from scratch? You will be using interlocked RMW instruction with
memory barriers. What if I can show you a *single-producer/consumer solution
that has none of those very expensive instructions on the fast-path? The
slow-path being queue empty/full conditions which threads wait on. Yes, I
know this must confuse you a bit. See, lock-free algorithms can block on
conditions... Hint: eventcount...
IMHO, you monitor will most likely be based on very expensive sync
instructions; no doubt. If not, I am always ready to learn... Well, I
already know about biased locking, try and invent something new... PLEASE!
_____
*Before you flame me on the fact that single-producer/consumer solution is
limiting, please take into account that I can convert it into
multiple-producer/consumer with no problems whatsoever. I would like to see
a code example that shows a queue implementation that uses no
synchronization instructions perform markedly worse than one that does. You
are free to setup the conditions. Now, fire away with both barrels Sir.
BTW, you did diagnose me with a serious mental disorder right? What's up
Doc?
;^)
> IMHO, you monitor will most likely be based on very expensive sync
> instructions; no doubt. If not, I am always ready to learn... Well, I
> already know about biased locking, try and invent something new... PLEASE!
The relevant point is that many programs which benefit from multiple threads
and SMP, spend 99% of their time doing something else than working with the
actual contended data. Optimize all that locking away into perfection, and
you still only have a net 1% win. (As always, profile stuff before worrying
about whether to optimize the locking)
I want my kernels and memory-allocator to be as fast as possible, lockfree,
waitfree or just locking. But for the lower contention spots (eg the central
queue), I don't care whether the program spends 0.5% of its time popping
items, or 0.3%. If I think popping items takes too long relative to the
total processing time, I'll just pop off two at a time.
> multiple-producer/consumer with no problems whatsoever. I would like to
> see a code example that shows a queue implementation that uses no
> synchronization instructions perform markedly worse than one that does.
> You
I'd be more interested in an example where a real server, not just the
producer/consumer patterns, as a whole, performs remarkedly better or worse
depending on the way its locking of the queue is managed.
>> multiple-producer/consumer with no problems whatsoever. I would like to
>> see a code example that shows a queue implementation that uses no
>> synchronization instructions perform markedly worse than one that does.
>> You
> I'd be more interested in an example where a real server, not just the
> producer/consumer patterns, as a whole, performs remarkedly better or
> worse depending on the way its locking of the queue is managed.
There are cases where lock-free programming can be of service to major parts
of an applications functionality; one simple example: main-memory database.
A database server usually has to perform a whole lot of searching operations
right? Would you agree that optimizations to the search methods are of
utmost importance because it address a major function of the application as
a whole? Therefore, IMO, if you can make searching go faster, you make a
important aspect of the application go faster; agreed? Lock-free algorithms
can be perfect fit in this scenario. Especially the ones that deal with
ultra low-overhead reader patterns...
What about reference counting? If your entire application makes very heavy
use of ref-counting, then optimizations to that counting should have an
overall positive effect; no?
What about shared collections of data? If your application uses a lot of
shared collections, then making enhancements to those collections should be
worthwhile...
Lock-free programming can also be used to simplify overall designs because
they can _eliminate_ the possibility of deadlocking...
Shall I go on?
> I'd be more interested in an example where a real server, not just the
> producer/consumer patterns, as a whole, performs remarkedly better or worse
> depending on the way its locking of the queue is managed.
There are not only producer/consumer queues. There are memory
allocation/deallocation, object lifetime management/reference
counting, read-only access to global data (settings, caches, routing
tables etc), write access to global data (session list, statistics, db
connection pool etc), centralized/decentralized scheduler/load-
balancer and so on.
So there is no just 0.3% of access to shared structures and 99,7% of
local processing. Often you basically always access some shared
structure. For example, network packet router. The only local
processing - is packet parsing which is very fast. Packet routing
involves read-only access to routing table. Statistics involves write
access to shared structure. Memory allocation involves access to
shared structure. Load-balancing involves access to shared structure.
And so on.
And all this things can be made significantly more scalable and faster
with lock-free/atomic-free algorithms.
Dmitriy V'jukov
You claim those guys are far-away from programming practice; I am going to
have to disagree. For instance, how does creating high-performance memory
allocators not fit in with programming practice? What about creating very
low-overhead shared collections? Na, nobody uses collections in practice
right?
>> You will be using interlocked RMW instruction with memory barriers.
>> What if I can show you a *single-producer/consumer solution that has
>> none of those very expensive instructions on the fast-path?
>
> Doesn't matter. If you handle sth to a consumer the consumer almost
> everytime consumes this item for a period magnitudes higher than the
> fetching process takes time. So the performance-advantage of your
> solution vanishes in the the error of measurement.
What about a message-passing language? If you can reduce the overhead in
sending and receiving messages, then your probably going reap an overall
boost of performance in a key area. For instance, implementations of Erlang
can benefit from low-overhead queuing mechanism; agreed?
>> The slow-path being queue empty/full conditions which threads
>> wait on. Yes, I know this must confuse you a bit. ...
>
> Doesn't confuse me; I see the sense lock-free progamming makes in
> practice.
I am not so sure that you do...
>> IMHO, you monitor will most likely be based on very expensive
>> sync instructions; no doubt. ...
>
> Boy, get a life and see that these algorithms give a performance
> -advantage in practice in very rare cases. So your solution is
> only advantageous in synthetic benchmarks.
I don't think so:
http://groups.google.com/group/comp.programming.threads/msg/097c0fe56dda4582
In normal run of the mill circumstances I do use locks. However I am working
with real time audio, and the concensus from the other programers I know is
that you shouldnt use locks in the audio thread. When the soundcard interupt
handler is firing every 2 or 3 ms, you're running at 80 to 90% cpu, then any
blockage for a ms of more can cause audio dropouts. And your pretty much at
the mercy of the OS scheduler to resolve any contention, which may or may
not get resolved in time for you to fill the next buffer for the soundcard.
Thats pretty much what I mean, a spinlock, just that it would be guarenteed
to finish whatever is being done while the lock is held.
@spin
UI_LOCK lockvar // if aquired thread becomes uninteruptable
if not aquired goto spin.
// do somthing quick! no more than 10-20 instructions?
UI_UNLOCK lockvar // Uniteruptable section finished
That would make it a lock free spin lock. But it would be much easier to
build larger lock free concurency with this than with current lock free
methods.
If I understand you correctly, IBM already has that in their PLO
instruction:
http://publibz.boulder.ibm.com/epubs/pdf/dz9zr002.pdf
(Appendix A/Multi-Processing Examples/Perform Locked Operation)
Is that somewhere near what you have in mind?
I think he means that the critical-section will not be pre-empted.
> And there's no need to spin, look at Intel's MONITOR and
> MWAIT instructions.
Those are intended for operating system usage. The MWAIT instruction puts
the processor to sleep, not a thread. In a scenario in which you have four
threads on two CPU's, calling MWAIT would reduce the processing power in
half and coalesce three threads onto the other processor...
If I understand correctly it is similar. A combination of a single CAS and
multiple write/read operations, i think? But all in one instruction?
I was meaning something more general than that tbh. Where you could code
your own block of instructions to be done.
For me 'lock-free' is not about performance, it's about non blocking in real
time threads.
Now if i'd explained it like that everyone would have known what I was on
about!
Yeah but thats exactly what I mean. I was kind of thinking a CAS, or
similar, takes around 140 cycles on x86, so the cpu can obviously delay
interupt handling long enough to complete that? So why not long enough to
complete a small section of general instructions? It could be an exception
not to unlock within ~12 instructionsor so, and any exception would naturaly
unlock the no pre-emptable thingmebob.
There's no possible way to guarentee you wont get audio dropouts if you're
coding on a general purpose OS. There's simply to much going on.
What you can do is minimize the chances of it happening, which is where lock
free methods come in.
If the audio thread (high priority) is waiting for the gui thread (low
priority) to release a lock you are at the mercy of the OS to do a priority
inversion. That could take a ms or more, maybe half your timeslice for
filling the current audio buffer for the soundcard.
The same thing with lock-free might cost 3 or 4 hundred cycles. Next to
nothing in comparison.
Wait-free algorithms do:
http://groups.google.com/group/comp.arch/msg/5dba6fee05aa07ea
http://groups.google.com/group/comp.sys.super/msg/642b6b608294e426
(I briefly mention lock-free wrt hard real-time near the end of the post)
I would personally go with loop-less (e.g., wait-free) algorithms; _not_
lock-free for real-time. Wait-free techniques can help you meet the
deadlines demanded by
a hard real-time system...
I would welcome an instruction like that with open arms.
Yes. Humm, take a look at Solaris schedctl(...) function. It provides a hint
to the scheduler to extend the time-slice of a process. Some lock-based STM
implementations use it in order to reduce the chance of pre-emption within a
critical-section...
For instance, a wait-free single-producer/consumer queue is ideal for
communications between the audio and GUI threads. I would prefer it over a
lock-free queue in this scenario.
I bet that pisses Elcaro off real good!
;^)
That's basically it. The instruction would work perfectly with STM
implementations. I don't know why all of those papers on the subject never
mention the PLO instruction...
> I was meaning something more general than that tbh. Where you could code
> your own block of instructions to be done.
Right. Okay. That would be useful.
>
> complete a small section of general instructions? It could be an exception
> not to unlock within ~12 instructionsor so, and any exception would naturaly
> unlock the no pre-emptable thingmebob.
>
Because the execution of 12 instructions may take almost indefinite amount of
time (think cache misses). Also, what about imminent DMA from external
devices? Etc..
I'm not saying that these problems are unsolvable - just that you need a
better designed solution before someone considers it seriously :)
Im not sure what you mean tbh. But I dont think any more 'bus locking' is
required than we already have for CAS. The lock is grabbed from a single
memory location in exactly the same way as you would aquire a spinlock. Once
that happens all thats needed is that the thread become uninteruptable for a
hundred or so cycles. I assume a CPU is capable of doing this simply because
CAS and its ilk take 140 cycles. But i dont know if this is true.
I dont see how it's much different that a slightly more powerful CAS tbh.
>>
>> complete a small section of general instructions? It could be an
>> exception
>> not to unlock within ~12 instructionsor so, and any exception would
>> naturaly
>> unlock the no pre-emptable thingmebob.
>>
> Because the execution of 12 instructions may take almost indefinite amount
> of
> time (think cache misses). Also, what about imminent DMA from external
> devices? Etc..
How is this handled with CAS? They take around 140 cycles on my CPU, so what
happens when IRQs fire during one of those?
> I'm not saying that these problems are unsolvable - just that you need a
> better designed solution before someone considers it seriously :)
I dont know enough about cpu architechre to design it any better. ;)
> Thats pretty much what I mean, a spinlock, just that it would be guarenteed
> to finish whatever is being done while the lock is held.
What would the advantage of that be? A normal spinlock prevents
conflicting code from running, and why wouldn't you want to let non-
conflicting code run?
> @spin
> UI_LOCK lockvar // if aquired thread becomes uninteruptable
> if not aquired goto spin.
> // do somthing quick! no more than 10-20 instructions?
> UI_UNLOCK lockvar // Uniteruptable section finished
>
> That would make it a lock free spin lock. But it would be much easier to
> build larger lock free concurency with this than with current lock free
> methods.
How is this different from disabling interrupts? And, again, why would
you want this? Lower-priority work won't interrupt higher-priority
work anyway. And if there's something more important to do, wouldn't
you rather do it?
DS
> If the audio thread (high priority) is waiting for the gui thread (low
> priority) to release a lock you are at the mercy of the OS to do a priority
> inversion. That could take a ms or more, maybe half your timeslice for
> filling the current audio buffer for the soundcard.
There are two possibilities:
1) The audio thread can't make further progress until it has data the
GUI thread could provide it. In this case, allowing the audio thread
to waste CPU cycles is at best not harmful.
2) The audio thread can make further progress, but for some reason
it's blocked acquiring a lock that doesn't protect any data it needs
to access. It's hard to imagine how a competent programmer could
arrange this to happen if he wanted to.
Why would the gui thread hold a lock the audio thread needs unless the
two have something to do with each other? And if the two have
something to do with each other, why is it more important the audio
thread make progress? (Maybe the gui thread is trying to cancel the
audio. Maybe it's trying to change the audio. In what imaginable
scenario is it better for the audio threat to keep going?)
DS
Either your OS sucks or your audio hardware sucks. Getting an audio
interrupt every 2-3ms is pretty ridiculous to begin with. For realtime
interactive editing you don't need to interrupt a stream more than
once every tenth of a second. When you're double-buffering, unless
your code is massively bloated there's no way you can miss a 100ms
window for filling and pointing to the next buffer. If you're not
double-buffering then you've missed something really obvious...
It's always crucial for the audio thread to keep going, because any
disturbance in its timeliness can cause audible clicks/pops that are
at best annoying, at worst physically damaging to amplifiers and
speakers in the chain.
> It's always crucial for the audio thread to keep going,
Only when it can usefully make forward progress.
> because any
> disturbance in its timeliness can cause audible clicks/pops that are
> at best annoying, at worst physically damaging to amplifiers and
> speakers in the chain.
Right, but assuming the CPU is fast enough to ever keep up with the
audio, there will be some times when you don't want the audio thread
to keep going. When it cannot usefully make forward progress without
getting a resource that is not currently available, it cannot keep
going.
If it doesn't need a resource, it shouldn't be blocking on it. If it
does need the resource, it simply cannot make forward progress while
that resource is not available.
The situation he's talking about is when the thread blocks on a
resource it does not need. It's hard to see how that could happen, but
if it does happen, the problem is not that it blocks such that a non-
blocking algorithm would fix it.
For example, if the audio thread is trying to pop chunks of data off a
queue that is filled by another thread, a lock-free queue won't help
if the data isn't put onto the queue fast enough. A normal lock with a
timed wait is a perfectly suitable option if an underflow is a
disaster.
DS
You forgot:
3) The audio thread needs to do something which *should* happen
when possible but doesn't have to happen every time, as long
as it happens sometimes, but it also *must* do something
related to the audio.
For example, one application is recording audio in a recording
studio. While it used to be that analog tape or dedicated digital
systems were commonly used for this, today it is extremely common
to use a regular computer.
In such an application, you are recording what might be a once-in-
a-lifetime performance, or what your customer might consider to
be one at least. So you must avoid dropouts at ALL COSTS, because
a dropout can ruin your only opportunity to capture that performance.
However, in addition to the hard requirement of capturing every
sample without fail, you also would like to, if you have enough
spare cycles, give some updates about what's going on to the
recording engineer. That means maybe providing some samples of
the recorded audio into a buffer so that the GUI thread can draw
a waveform, or draw some software VU meters (probably with peak
hold), or whatever. It's OK for this activity to miss a few samples
here and there as long as it generally keeps up.
The natural inclination is probably to think about eliminating
this problem by using big buffers to soften the deadlines. But
the problem here is that buffering equals latency, and latency
is a show stopper as well if you are going to be listening to (or
otherwise observing) the audio data that is being recorded through
the computer. You *certainly* cannot play back audio delayed by
too much to the musicians who you're recording, and the recording
engineer doesn't really want to hear delayed audio either.
- Logan
>> > Lock-free algorithms don't make any realtime guarantees.
>>
>> There's no possible way to guarentee you wont get audio dropouts if
>> you're
>> coding on a general purpose OS. There's simply to much going on.
>
>Either your OS sucks or your audio hardware sucks. Getting an audio
>interrupt every 2-3ms is pretty ridiculous to begin with.
To play virtual instruments live, samplers, synths ect, you need latency *at
the least* bellow 10ms. If you want to add effects to an audio stream, say
to live guitar, and monitor the results, you need a round trip under 10 ms,
which means 5ms in, 5 ms out.
Any longer than that and the delay is at best audible at worst imposible to
work with.
If you've ever used VOIP and hear your own voice back 50 or so ms later you
know how offputing that can be, well it's like that but alot worse as its a
musical performance.
Even 10ms is less than ideal.
Its what pro audio users expect these days and it's the reason MS included a
new low level audio api in windows Vista.
> > 2) The audio thread can make further progress, but for some reason
> > it's blocked acquiring a lock that doesn't protect any data it needs
> > to access. It's hard to imagine how a competent programmer could
> > arrange this to happen if he wanted to.
> You forgot:
>
> 3) The audio thread needs to do something which *should* happen
> when possible but doesn't have to happen every time, as long
> as it happens sometimes, but it also *must* do something
> related to the audio.
3 is an example of 2. The audio thread can make further progress but
is blocked acquiring a lock that doesn't protect any data it needs to
access.
> For example, one application is recording audio in a recording
> studio. While it used to be that analog tape or dedicated digital
> systems were commonly used for this, today it is extremely common
> to use a regular computer.
>
> In such an application, you are recording what might be a once-in-
> a-lifetime performance, or what your customer might consider to
> be one at least. So you must avoid dropouts at ALL COSTS, because
> a dropout can ruin your only opportunity to capture that performance.
> However, in addition to the hard requirement of capturing every
> sample without fail, you also would like to, if you have enough
> spare cycles, give some updates about what's going on to the
> recording engineer. That means maybe providing some samples of
> the recorded audio into a buffer so that the GUI thread can draw
> a waveform, or draw some software VU meters (probably with peak
> hold), or whatever. It's OK for this activity to miss a few samples
> here and there as long as it generally keeps up.
That's a good application for either a "try to get the lock" function
that never blocks or a "try to get the lock for so long and no longer"
function. The only way you can screw this up is to let the audio
thread block indefinitely, but that would be an obvious rookie
programmer error.
> The natural inclination is probably to think about eliminating
> this problem by using big buffers to soften the deadlines. But
> the problem here is that buffering equals latency, and latency
> is a show stopper as well if you are going to be listening to (or
> otherwise observing) the audio data that is being recorded through
> the computer. You *certainly* cannot play back audio delayed by
> too much to the musicians who you're recording, and the recording
> engineer doesn't really want to hear delayed audio either.
Either the audio thread can make further progress or it can't. If it
can, it shouldn't block. If it can't, it must block.
DS
It can always make forward progress. If it cant then it's designed wrong.
>> because any
>> disturbance in its timeliness can cause audible clicks/pops that are
>> at best annoying, at worst physically damaging to amplifiers and
>> speakers in the chain.
>
> Right, but assuming the CPU is fast enough to ever keep up with the
> audio, there will be some times when you don't want the audio thread
> to keep going. When it cannot usefully make forward progress without
> getting a resource that is not currently available, it cannot keep
> going.
Thats not how it works. The audio thread is basicly like a black box, it
processes a constant stream of audio and or musical events. There is no
'can't make useful forward progress' unless streaming is stopped. IE the
user presses stop.
> The situation he's talking about is when the thread blocks on a
> resource it does not need. It's hard to see how that could happen, but
> if it does happen, the problem is not that it blocks such that a non-
> blocking algorithm would fix it.
>
> For example, if the audio thread is trying to pop chunks of data off a
> queue that is filled by another thread, a lock-free queue won't help
> if the data isn't put onto the queue fast enough. A normal lock with a
> timed wait is a perfectly suitable option if an underflow is a
> disaster.
I dont think you understand how audio streaming works. All the audio in and
out happens in the one thread. If your plugin processes audio it will get
called with a buffers worth of input data, process that and return a buffers
worth of output data.
The inter thread comunication is typicaly between the gui and the audio
thread, and perhaps the odd worker thread.. What that does is tell the audio
thread how to process the data stream.
Say you adjust a setting via the gui and it needs to update a bunch of
coeficients, and they need to be updated as a whole atomic so the audio
effect doesnt explode in weird noises.
You can either lock and update all the coeficients from the gui thread, or
you post the new coeficients via a lock free queue and do the update in the
audio thread.
The former can completely stall the audio thread while it has the lock. The
latter lock free method never can. The worst case with the lock free method
is the updated coeficients take a few ms longer to get updated.
Exactly!
> However, in addition to the hard requirement of capturing every
> sample without fail, you also would like to, if you have enough
> spare cycles, give some updates about what's going on to the
> recording engineer. That means maybe providing some samples of
> the recorded audio into a buffer so that the GUI thread can draw
> a waveform, or draw some software VU meters (probably with peak
> hold), or whatever. It's OK for this activity to miss a few samples
> here and there as long as it generally keeps up.
Well for VU meters you basicly keep a running maximum in the audio thread
which is reset every time you read it from the gui thread So you get the
maximum since the last reading. This only needs to be done around 20 or 30
times a second as the human eye cant see any faster anyway.
> The natural inclination is probably to think about eliminating
> this problem by using big buffers to soften the deadlines. But
> the problem here is that buffering equals latency, and latency
> is a show stopper as well if you are going to be listening to (or
> otherwise observing) the audio data that is being recorded through
> the computer. You *certainly* cannot play back audio delayed by
> too much to the musicians who you're recording, and the recording
> engineer doesn't really want to hear delayed audio either.
I've seen people say they are running at 1.5ms, 2ms, or thereabouts. It
seems to be getting lower and lower all the time.
> Say you adjust a setting via the gui and it needs to update a bunch of
> coeficients, and they need to be updated as a whole atomic so the audio
> effect doesnt explode in weird noises.
Okay.
> You can either lock and update all the coeficients from the gui thread, or
> you post the new coeficients via a lock free queue and do the update in the
> audio thread.
Or you can do it the sane way, never allow the audio thread to block
on the lock.
> The former can completely stall the audio thread while it has the lock. The
> latter lock free method never can. The worst case with the lock free method
> is the updated coeficients take a few ms longer to get updated.
Since the audio thread can always make forward progress, it never
needs to block on the lock. So there is no problem. If you work hard
enough, you can break this such that you need a lock-free queue to
solve it.
One simple algorithm for the audio thread is as follows:
1) Try to lock the lock.
2) If we succeed, and there are new coefficients, use those instead,
and release the lock.
3) Process with the coefficients we have.
4) Go to step 1.
If the audio thread can make forward progress without the new
coefficients, then it should not wait for them. If it cannot, it must.
DS
I would not use a lock-free queue for this. I would make use of a wait-free
one. Lock-free can spin, and the audio needs prompt service. Spinning and
prompt service are at odds, IMO at least. I am no audio expert, however I do
know that they go through great lengths to optimize the audio GUI
inter-thread communication. IMHO, this is one of those edge cases where
wait-free algorithms can be a life-saver. Even a very minor glitch in the
audio is enough to piss everybody off real good.
>
> One simple algorithm for the audio thread is as follows:
>
> 1) Try to lock the lock.
> 2) If we succeed, and there are new coefficients, use those instead,
> and release the lock.
> 3) Process with the coefficients we have.
> 4) Go to step 1.
>
> If the audio thread can make forward progress without the new
> coefficients, then it should not wait for them. If it cannot, it must.
This is similar to the following algorithm:
http://groups.google.com/group/comp.programming.threads/msg/a3e38e27df971e0d
(algorithm sketch at end of message...)
I do not think this would give better performance over wait-free wrt
precision high-end audio applications; What do you think? Especially when
the communication medium can consist of the following functions:
_____________________________________________________
ac_i686_queue_spsc_push:
MOVL 4(%ESP), %EAX
MOVL 8(%ESP), %ECX
MOVL 4(%EAX), %EDX
MOVL %ECX, (%EDX)
MOVL %ECX, 4(%EAX)
RET
ac_i686_queue_spsc_pop:
PUSHL %EBX
MOVL 8(%ESP), %ECX
MOVL (%ECX), %EAX
CMPL 4(%ECX), %EAX
JE ac_i686_queue_spsc_pop_failed
MOVL (%EAX), %EDX
MOVL 12(%EDX), %EBX
MOVL %EDX, (%ECX)
MOVL %EBX, 12(%EAX)
POPL %EBX
RET
ac_i686_queue_spsc_pop_failed:
XORL %EAX, %EAX
POPL %EBX
RET
_____________________________________________________
IMVHO, if I was enegerineein a high-end audio application, I would use the
wait-free queue code above over a lock-based one. Perhaps I am missing
someting; how can lock-based queue ensure better quality sound than the code
above? Explain in detail...
Thanks.
ARGH! Let correct the spelling mistakes:
IMVHO, if I was _engineering_ a high-end audio application, I would use the
wait-free queue code above over a lock-based one. Perhaps I am missing
_something_; how can lock-based queue ensure better quality sound than the
I am not totally convinced that your method will produce the quality
professionals are looking for. This just happens to be of those cases where
optimization is everything. Perhaps you can code something up which uses
locks that processes audio streams better than something that uses ultra
low-overhead wait-free algorithms. Try and convince the audio experts they
are wrong. Get your proof of concept up and running. I know for a fact that
audio programmers want low-overhead optimizations, try and convince them
that they are going about things ass-backwards. If you can, I will be very
surprised.
> That's a good application for either a "try to get the lock" function
> that never blocks or a "try to get the lock for so long and no longer"
> function. The only way you can screw this up is to let the audio
> thread block indefinitely, but that would be an obvious rookie
> programmer error.
AFAIK one of the first libraries which includes lock/wait-free
algorithms was MidiShare (1989):
http://midishare.sourceforge.net/
MidiShare includes implementations of lock-free FIFO queues and LIFO
stacks. And it's about audio.
I'm not audio expert but this suggests...
Dmitriy V'jukov
I did have it programed like that. With a tryLock on the audio end of the
queue. The problem came when I needed to free memory. You cant do that in
the audio thread because the memory manager isnt lock free, well the one im
using isnt, so I ended up using a garbage queue. Any memory no longer needed
is pushed onto this and freed by another thread.
Say for example a sample is updated by the gui, this update is posted to the
audio thread, it has to dispose of the old sample in a lock/wait free way.
Which means you need a lock on the garbage queue to push the item onto it,
you cant just tryLock unfortunately.
I did consider having two coppies of everything, kindof double buffered, and
an update would just flip them. Setting a new update would free teh flipped
out data before setting the new. But it got kinda ugly pretty quickly.
If I'm are filling a 128 sample buffer, (3ms at 44.1khz), and running at 50%
cpu load, thats around 2,5000,000 cycles on my cpu. A 1.8 Athlon.
So im thinking a few hundred cycles here or there while the lock-free thing
spins wont really bother that.
To add to that I need a way to dispose of memory from the audio thread. As i
dont have the luxury of a lock free memory manager I post it to another
queue so it can be disposed of elsewhere. But I cant use tryLock, because
what do i do with it if i cant push it onto the queue? I have to put it
somewhere. So i need a lock free queue to send garbage out to be disposed
of.
You can use atomic-free allocator for fifo-queues I posted here:
http://groups.google.com/group/comp.programming.threads/browse_frm/thread/378a35b21ae2b42e
Free operation on consumer side is only one plain store!
Dmitriy V'jukov
This is clearly a situation where using an app-specific SLAB is better
than using a standard memory allocator. Your buffers are a fixed size
and you're going to need them repeatedly. Just preallocate an array of
them and walk thru the array.
> I am not totally convinced that your method will produce the quality
> professionals are looking for. This just happens to be of those cases where
> optimization is everything. Perhaps you can code something up which uses
> locks that processes audio streams better than something that uses ultra
> low-overhead wait-free algorithms. Try and convince the audio experts they
> are wrong. Get your proof of concept up and running. I know for a fact that
> audio programmers want low-overhead optimizations, try and convince them
> that they are going about things ass-backwards. If you can, I will be very
> surprised.
I think you're misunderstanding my claims and what I'm trying to do. I
am not claiming their approach is wrong, merely that certain specific
arguments presented are wrong.
DS
The wait-free queue code I posted will not ever spin. Here is full blown
implementation of it here:
http://appcore.home.comcast.net
Check out the ac_queue_spsc API:
http://appcore.home.comcast.net/~appcore/appcore/include/ac_queue_spsc_h.html
If you have any questions on how to use this, please feel free to fire away!
;^)
> To add to that I need a way to dispose of memory from the audio thread. As
> i dont have the luxury of a lock free memory manager I post it to another
> queue so it can be disposed of elsewhere. But I cant use tryLock, because
> what do i do with it if i cant push it onto the queue? I have to put it
> somewhere. So i need a lock free queue to send garbage out to be disposed
> of.
You can use the queue for a garbage disposal unit as you suggested, or you
can combine it with a very low-overhead wait-free allocator for this. Here
is one I invented:
http://groups.google.com/group/comp.arch/browse_frm/thread/24c40d42a04ee855
I use a tuned version of it in my vZOOM library.
Since the garbage queue consumer is not on the audio thread, I am not sure
how much you are going to save here. Now, if the queue consumer existed on a
performance critical thread, then yes, the plain store would be of
assistance indeed.
Okay. Well, at least I think it would be interesting to measure the trylock
method against a wait-free algorithm.
Not only will the code never spin, it will never execute a LOCK prefix. It
only uses MOV instructions to update the queue data-structure. You should
try it out, and see if it can give you a noticeable performance boost.
If you transfer user commands or new settings to audio thread then
consumer is exactly on the audio thread. And to free consumed node
consumer on audio thread have to execute only one plain store.
Dmitriy V'jukov
It's not wait-free. And it's not very graceful to producer-consumer
pattern...
Dmitriy V'jukov
"Perhaps" even better, there are cases when the consumer (e.g., audio
thread) can reuse blocks of memory containing settings/commands. BTW, your
design prompted me to create a special caching-mode in the vZOOM allocator
in order to get lightweight remote deallocations:
http://groups.google.com/group/comp.programming.threads/msg/8e3412d48c09f608
But, there is a catch; cache-mode is disabled by default. The user has to
explicitly swing a given threads private allocator into the cache-mode. If
they do not, then the allocator uses CAS on remote deallocations...
It has wait-free local allocations and deallocations. It uses CAS for remote
deallocations that are performed on threads which are not in cache-mode. So,
it's not wait-free for remote deallocation in the default setting; you will
hit a CAS. So, your correct in the sense that a producer-consumer pattern
that has zero application-level caching will be using CAS for the calls to
free performed by the consumer in the default setting. You can set up very
simple app-level caching using vZOOM queues like this:
__________________________________________________________
static vz_spsc_queue g_cacheq, g_workq;
int producer() {
for (;;) {
node* n = vz_spsc_queue_trypop(&g_cacheq);
if (! n) {
n = vz_malloc(sizeof(*n));
if (! n) { break; }
}
InitWorkNode(n);
vz_spsc_queue_push(&g_workq, n);
}
return ENOMEM;
}
void consumer() {
for (;;) {
node* n = vz_spsc_queue_waitpop(&g_workq);
assert(n);
ProcessWorkNode(n);
vz_spsc_queue_push(&g_cacheq, n);
}
}
__________________________________________________________
That kind of seems fairly similar to your producer/consumer optmized
allocator right? Please correct me if I am wrong...
Yes, right.
But in my allocator producer put node to deallocation queue, not
consumer. It's possible because of FIFO order of consumption. And
consumer only have to mark node as consumed.
Dmitriy V'jukov
I see.
I forgot to mention that there is another catch in the vZOOM allocator...
When it's setup to use memory in a threads stack for the heaps, it CANNOT be
swung into cache-mode. I can't think of a way to get around that. But, the
thread stack based allocator is only ever really "needed" in embedded
environments that may, or may not, have any heap at all:
http://groups.google.com/group/comp.programming.threads/msg/7cceeb76aa3301ce
(last paragraph in post...)
That was a special case indeed, and it cannot use the cache-mode thing.
;^(...
> "Perhaps" even better, there are cases when the consumer (e.g., audio
> thread) can reuse blocks of memory containing settings/commands. BTW, your
> design prompted me to create a special caching-mode in the vZOOM allocator
> in order to get lightweight remote deallocations:
>
> http://groups.google.com/group/comp.programming.threads/msg/8e3412d48...
Are you talking about "hazard_cas/repair" scheme? Are you include it
into vZOOM?
Dmitriy V'jukov
It's a bit different than that.
> Are you include it into vZOOM?
Not the exact algorithm, but a flavor of an analog to say the least.
I will respond to your post in detail in a couple of hours; however for
now... You seriously think my main-memory database example is ridiculous?
Are you trolling? Do you not know that main-memory database benefit heavily
from low-overhead reader-patterns? Citation please!?
Citation #1---
Read this analog:
http://www.vldb.org/conf/1997/P086.PDF
which has CRU based on versions. READ into the "ageing" process they
describe.
Since you do not take anything I create seriously...
DAMN IT!!!!!!!!!!!!!!
Replace the CRU with RCU!
Since you don't respect my work... Read here:
http://people.cs.vt.edu/~scschnei/papers/ismm06.pdf
Then realize that my in-house work superseded this by about 9 months. That
is trivial, but you should understand that my work is not crap. BTW, I have
had very lengthy discussion with the individuals who are responsible for
StreamFlow. They understand, AND RESPECT, my work.
Tell me about memory allocators.
I actually created it before that, but I cannot prove it in any sense of the
term. I did it by myself, and thought it VERY was promising to say the
least. I can only prove thought multiple witnesses that is was done after I
presented it too them.
Keep an accurate engineering notebook, with numbered, dated pages, with
a signature and date at the bottom of each, and a signature and date
for who you disclosed any IP to upon same. These are quite often used
in corporate R&D groups. I've found that bookfactory.com offers
several models of these (hardbound) for reasonable prices, as an
example. I've also used paperback notebooks designed almost
identically as to layout and signature blocks in the past, but they
don't hold up as well. I've also seen hardbound versions elsewhere
costing 2X as much.
This should help you track conception date, date of reduction to
practice, show diligence in reducing it to practice, provide details
sufficient for an informed observer to understand how to make and use
it, and document the best way to practice the invention.
It should be done in ink, each page should be signed before continuing
to the next, everything legible, nothing marked out (corrections with a
single line through them, initialed and dated), blank areas marked
through, etc. All figures and calculations should be labeled, and
never, ever, remove any pages from it.
That should get you close to having "proof", better than witness memory
alone. If you think something is incredibly valuable, you can have it
notarized as well.
--
Randy Howard (2reply remove FOOBAR)
"The power of accurate observation is called cynicism by those
who have not got it." - George Bernard Shaw