Parfait synchronization

Paul Smith

unread,

Feb 10, 2012, 10:23:40 PM2/10/12

to parfa...@googlegroups.com

Hey, I just wrote a quick and dirty micro-benchmark for parfait which I've done as a sub-module for re-running before/after changes. I wanted to get a baseline of impacts of the synchronisation improvements we've been thinking about for ages. I run X threads, for Y iterations, with Z # counters that each thread increments to simulate what we're using. I then run it twice, once with the MMV writer not started, so all updates are just dropped to the floor, and again with it started. I output the total incrementRate (sum(counter_values)/total time) and report the total amount of blocked count/time from the ThreadInfos. This should give us an indication of the bottleneck improvements if any. I should run this on a few pieces of different hardware/OS/JVM to get a variance too, but for my 2010 MBP:

pcpStarted: false iterations: 10000 numThreads: 8 numCounters: 1000 incrementRate(/sec): 65,146,579.80 blockedCount: 27 blockedTime: 162
pcpStarted: true iterations: 10000 numThreads: 8 numCounters: 1000 incrementRate(/sec): 2,788,136.48 blockedCount: 246664 blockedTime: 124435

Fairly heavy hit there as we'd expect. So, I was thinking about the synchronisation, and I remember you saying that we could just create a lock object per metric, BUT, can't we already use the PcpValueInfo object we're using here to synch?

@SuppressWarnings("unchecked")
protected final void updateValue(PcpValueInfo info, Object value) {
@SuppressWarnings("rawtypes")
TypeHandler rawHandler = info.getTypeHandler();
synchronized (dataFileBuffer) {
dataFileBuffer.position(rawHandler.requiresLargeStorage() ? info.getLargeValue()
.getOffset() : info.getOffset());
rawHandler.putBytes(dataFileBuffer, value);
}
}

I was thinking we can't actually create a lock per metric because right now the synch is needed to protect the dataFileBuffer positioning for the update. I think instead we need to create a ByteBuffer.slice() per PcpValueInfo which allows independent positioning (and indeed, could just be positioned once at init time). hen we don't need any synchronization at all do we?

thoughts?

Paul

Paul Cowan

unread,

Feb 10, 2012, 10:37:29 PM2/10/12

to parfa...@googlegroups.com

On 11 February 2012 14:23, Paul Smith <psm...@aconex.com> wrote:

I was thinking we can't actually create a lock per metric because right now the synch is needed to protect the dataFileBuffer positioning for the update. I think instead we need to create a ByteBuffer.slice() per PcpValueInfo which allows independent positioning (and indeed, could just be positioned once at init time). hen we don't need any synchronization at all do we?

thoughts?

+1, please do. Attaching a slice() to the PcpValueInfo seems the most sensible option - mind you the documentation is not 100% clear what we need to do here (the thread-safety of ByteBuffer seems oddly underdocumented). I would assume that byte-buffer itself is actually threadsafe (in terms of writing to completely different areas, modulo repositioning the offset), but the other reason for the synchronization here is to provide happens-before relationships so two (chronologically) in-order updates don't happen out-of-order (which would cause all sorts of shenanigans with expected-monotonic counters going backwards, etc).

I think the slice() idea is a brilliant one, firstly because we don't actually need to keep offsets for the metric (just always write at position 0) and secondly because it does reinforce that thread-safety, prevent accidental overwrites, etc. That's an awesome approach, I say go for it. I do think we still want to sync on the metric or the slice though to provide happens-before (or do something with volatile, etc).

cow

Paul Smith

unread,

Feb 10, 2012, 11:09:25 PM2/10/12

to parfa...@googlegroups.com

Yes, yes, silly me. ORIGINALLY I was thinking "Hey, we don't need a new lock, the PcpValueInfo is distinct enough here for the lock" and then realised that wouldn't help with the byteBuffer and focused too much on that. If we have a per-metric ByteBuffer slice, and on each update synch on the PcpValueInfo object we should be fine. Let me see what difference this does.

Paul Smith

unread,

Feb 11, 2012, 7:56:25 AM2/11/12

to parfa...@googlegroups.com

I've just pushed to the master my benchmark module, together with a few tweaks to the BasePCPWriter to allow switching behaviour between the old global lock and the new per-metric lock behaviour. Alas, this wasn't TDD, this class is hard to do that, so highly likely my code is completely wrong, I can't speak for it being functionally correct, actually I can, because it isn't, so I've been bad here and 'broken' it, but I need help there.. Somehow the slicing/positioning isn't working as expected, because when I connect pmchart to pcp and plot one of the generated counters, it displays correctly with the global lock, but I get 0's per second when it switches the test to perMetricLock.

Having said that, while the position the buffer is writing too is probably incorrect, I think the results are still valid and interesting. I tested across my MBP, our load test machine, our current production hardware, and the new one about to be commisioned, and all but my MBP show vastly increased performance (snickering from the non-Mac folk).

Taking our new production hardware which is a dual-hexa-core r710, reporting 24 processors (12 processors, with HT). I ran the benchmark with 32 threads just to overload it a bit.

There are 3 tests (without PCP started, with it started but with Global Lock, and started but with perMetricLocks), and then runs these 3 again, but reverse just to prevent any ordering (unfairly or more fairly treating a run after JVM warm up).

Host: app3.syd.acx Java: 1.6.0_26

Thread Contention Supported: true, Enabled by default: false

numThreads: 32, numCounters=1000, iterations=10000

pcpStarted: false perMetricLock: false incrementRate(/sec): 131,958,762.89 blockedCount: 3 blockedTime: 36

pcpStarted: true perMetricLock: false incrementRate(/sec): 1,880,279.72 blockedCount: 0 blockedTime: 0

pcpStarted: true perMetricLock: true incrementRate(/sec): 13,451,595.28 blockedCount: 2435900 blockedTime: 107871

pcpStarted: true perMetricLock: true incrementRate(/sec): 14,482,259.23 blockedCount: 2468246 blockedTime: 110775

pcpStarted: true perMetricLock: false incrementRate(/sec): 1,962,068.48 blockedCount: 0 blockedTime: 0

pcpStarted: false perMetricLock: false incrementRate(/sec): 86,253,369.27 blockedCount: 2 blockedTime: 5

I do not know why the blockedCount & time count out 0 for the global lock run (row 2 & 2nd to last), but it is consistent when I run it. On other hosts it does report, here's our current production hardware, quite a bit older:

app1.syd (current production app server - ALTHOUGH, using Update 26)

numThreads: 8, numCounters=1000, iterations=10000

pcpStarted: false perMetricLock: false incrementRate(/sec): 64,777,327.94 blockedCount: 14 blockedTime: 22

pcpStarted: true perMetricLock: false incrementRate(/sec): 1,593,466.79 blockedCount: 9731521 blockedTime: 185599

pcpStarted: true perMetricLock: true incrementRate(/sec): 6,963,788.30 blockedCount: 239336 blockedTime: 77

pcpStarted: true perMetricLock: true incrementRate(/sec): 7,303,934.99 blockedCount: 266851 blockedTime: 113

pcpStarted: true perMetricLock: false incrementRate(/sec): 1,512,516.07 blockedCount: 10133819 blockedTime: 195594

pcpStarted: false perMetricLock: false incrementRate(/sec): 85,378,868.73 blockedCount: 2 blockedTime: 2

So I'm presuming that the rate increase in counter increments is due to vastly reduced blockedTime which makes total sense.

So, I think this is definitely a better way than before, I just need to get the code to work properly and actually write to the buffers properly. Hopefully whatever change is needed to get this working doesn't materially affect the benchmark data, but I can run it again anyway.

Cowan, can you review the recent commits and pay closer attending to the BasePcpWriter changes? Something in the buffer slicing and positioning contract I'm missing from the docs.

tah,

Paul

Paul Smith

unread,

Feb 11, 2012, 7:57:53 AM2/11/12

to parfa...@googlegroups.com

Oh, and one more thing, here's a PCP screenshot of the CPU of the host during the test run. As you can see with the global lock, CPU is restricted way down, but with the perMetric Lock the CPU utilisation is much much higher (this is for app3, the newer box about to be commissioned).

PCP Charts-3.jpg

Paul Cowan

unread,

Feb 11, 2012, 7:10:02 PM2/11/12

to parfa...@googlegroups.com

Somehow the slicing/positioning isn't working as expected, because when I connect pmchart to pcp and plot one of the generated counters, it displays correctly with the global lock, but I get 0's per second when it switches the test to perMetricLock.

It's this bit:

ByteBuffer metricByteBufferSlice = dataFileBuffer.slice();

metricByteBufferSlice.position(bufferPosition);

From the docs for 'slice':

The content of the new buffer will start at this buffer's current position.

so dataFileBuffer isn't positioned at 0 when you do this (it's just written the header etc I guess, so who knows where it is), so your position(bufferPosition) is being applied relative to the 0 of the slice, which is not actually 0 of the main buffer. You could .position(0) the dataFileBuffer first, but even better would be to .position(bufferPosition) the dataFileBuffer, then .slice(). Then the writing offset for every slice is 0 which is much simpler (don't need to track lastBufferPosition, just .position(0) on each write).

As an added bonus (in fact, for thread-safety we REALLY should do this), also call .limit(int) to limit the writing area of the slice to the correct size for the data type - then all the slices will be non-overlapping (or you'll get an error if you try to overrun the correct data area) which gives me a lot more confidence in this. :)

Having said that, while the position the buffer is writing too is probably incorrect, I think the results are still valid and interesting. I tested across my MBP, our load test machine, our current production hardware, and the new one about to be commisioned, and all but my MBP show vastly increased performance (snickering from the non-Mac folk).

Awesome work, thanks. This is much better. Happy for the per-thread lock option to only be an option temporarily - this should be the default (hell, the only supported one) eventually.

cow

Paul Cowan

unread,

Feb 11, 2012, 7:12:37 PM2/11/12

to parfa...@googlegroups.com

Ooh, forgot the minor scolding: until this is production-ready, should probably be done in a separate branch (not 'default'). Tsk tsk. :)

cow

Paul Smith

unread,

Feb 14, 2012, 8:34:03 PM2/14/12

to parfa...@googlegroups.com

On 12 February 2012 11:10, Paul Cowan <pa...@custardsource.com> wrote:

Somehow the slicing/positioning isn't working as expected, because when I connect pmchart to pcp and plot one of the generated counters, it displays correctly with the global lock, but I get 0's per second when it switches the test to perMetricLock.

It's this bit:

ByteBuffer metricByteBufferSlice = dataFileBuffer.slice();

metricByteBufferSlice.position(bufferPosition);

From the docs for 'slice':

The content of the new buffer will start at this buffer's current position.

Yes, I actually had this change locally in an uncommitted state because I was still working out why the pcp counter values were not exporting properly (more on that hideous tale shortly).

As an added bonus (in fact, for thread-safety we REALLY should do this), also call .limit(int) to limit the writing area of the slice to the correct size for the data type - then all the slices will be non-overlapping (or you'll get an error if you try to overrun the correct data area) which gives me a lot more confidence in this. :)

yes, that's done now too.

Awesome work, thanks. This is much better. Happy for the per-thread lock option to only be an option temporarily - this should be the default (hell, the only supported one) eventually.

I'll probably do an RC1 style release once I've tried installing this into one of our apps locally and then get our load tester to get it out on the open road, not that the load test currently saturates the CPU like we're currently experiencing but will be an interesting comparison none-the-less.

Paul Smith

unread,

Feb 14, 2012, 8:34:52 PM2/14/12

to parfa...@googlegroups.com

On 12 February 2012 11:12, Paul Cowan <pa...@custardsource.com> wrote:

Ooh, forgot the minor scolding: until this is production-ready, should probably be done in a separate branch (not 'default'). Tsk tsk. :)

err, 0.3.0 isn't released, so it's just 'master' branch right.. Checking stuff in that doesn't WORK, is a different story though...

Paul Smith

unread,

Feb 14, 2012, 8:46:07 PM2/14/12

to parfa...@googlegroups.com

On 12 February 2012 11:10, Paul Cowan <pa...@custardsource.com> wrote:

Somehow the slicing/positioning isn't working as expected, because when I connect pmchart to pcp and plot one of the generated counters, it displays correctly with the global lock, but I get 0's per second when it switches the test to perMetricLock.

It's this bit:

ByteBuffer metricByteBufferSlice = dataFileBuffer.slice();

metricByteBufferSlice.position(bufferPosition);

From the docs for 'slice':

The content of the new buffer will start at this buffer's current position.

So, now for my Tales of WOH. I spent far too long on this, I hate you author-of-ByteBuffer-I-damn-you-to-all-of-eternity. What is happening is that the .slice() call indeed returns a buffer backed by the parent, but of a DIFFERENT BYTE ORDER than the parent. yes, WTF? Here's code to prove it:

import java.nio.ByteBuffer;

import java.nio.ByteOrder;

public class BufferSlicing {

public static void main(String[] args) {

// create a buffer, writing incrementing numbers, and then

// print the 2nd value via a sliced buffer

ByteBuffer buffer = ByteBuffer.allocate(8 * 8);

bufferLoop(buffer);

outputItemSlice(buffer, 2);

System.out.println("------");

// do exactly the same thing, but this time

// the parent buffer change to native Byte Ordering (default is BigEndian)

buffer = ByteBuffer.allocate(8 * 8);

buffer.order(ByteOrder.nativeOrder());

bufferLoop(buffer);

outputItemSlice(buffer, 2);

}

private static void outputItemSlice(ByteBuffer buffer, int itemPos) {

buffer.position(itemPos * 8);

ByteBuffer slice = buffer.slice();

System.out.printf("long @ pos %d is %d\n", itemPos, slice.getLong());

}

private static void bufferLoop(ByteBuffer buffer) {

int number = 8;

for (int i = 0; i < number; i++) {

buffer.putLong(i);

}

buffer.position(0);

for (int i = 0; i < number; i++) {

System.out.println(buffer.getLong());

buffer.position((i + 1) * 8);

}

Output:

0

1

2

3

4

5

6

7

long @ pos 2 is 2

------

0

1

2

3

4

5

6

7

long @ pos 2 is 144115188075855872

The only way I worked this out was by breakpoint debugging and checking the values being written (which were right) and then reading them out again from the main buffer instance, which was indeed coming out with whackily large numbers (hence why PCP was reporting "!"... indeed!). I wondered by the size of it whether it was being byte ordered and sure enough, the values were being written round the wrong way.

Further step debugging shows the 'endianness' and byteOrdering properties of the ByteBuffer are different of the slice(). Simply changing the slice()'d buffers byteOrder to force it to be what the parent's was fixes the issue.

grrrr.

Paul Cowan

unread,

Feb 14, 2012, 10:19:35 PM2/14/12

to parfa...@googlegroups.com

On 15 February 2012 12:46, Paul Smith <tallp...@gmail.com> wrote:

So, now for my Tales of WOH. I spent far too long on this, I hate you author-of-ByteBuffer-I-damn-you-to-all-of-eternity. What is happening is that the .slice() call indeed returns a buffer backed by the parent, but of a DIFFERENT BYTE ORDER than the parent.

Ahahahhaha oh my god that's evil.

The docs for ByteBuffer say: "The initial order of a byte buffer is always BIG_ENDIAN" but I don't think "that doesn't apply if you call .slice()" is unreasonable. That's certainly what I would have assumed.

Nice detective work!

cow

Paul Smith

unread,

Feb 14, 2012, 10:26:50 PM2/14/12

to parfa...@googlegroups.com

it's just totally non-sensical, I cannot fathom under what use case it's ever valid? I too found that documentation about "...always BIG_ENDIAN" a little too late...

Paul

Reply all

Reply to author

Forward