Parfait synchronization

14 views
Skip to first unread message

Paul Smith

unread,
Feb 10, 2012, 10:23:40 PM2/10/12
to parfa...@googlegroups.com
Hey, I just wrote a quick and dirty micro-benchmark for parfait which I've done as a sub-module for re-running before/after changes. I wanted to get a baseline of impacts of the synchronisation improvements we've been thinking about for ages. I run X threads, for Y iterations, with Z # counters that each thread increments to simulate what we're using. I then run it twice, once with the MMV writer not started, so all updates are just dropped to the floor, and again with it started. I output the total incrementRate (sum(counter_values)/total time) and report the total amount of blocked count/time from the ThreadInfos. This should give us an indication of the bottleneck improvements if any. I should run this on a few pieces of different hardware/OS/JVM to get a variance too, but for my 2010 MBP:

pcpStarted: false iterations: 10000 numThreads: 8 numCounters: 1000 incrementRate(/sec): 65,146,579.80 blockedCount: 27 blockedTime: 162
pcpStarted: true iterations: 10000 numThreads: 8 numCounters: 1000 incrementRate(/sec): 2,788,136.48 blockedCount: 246664 blockedTime: 124435


Fairly heavy hit there as we'd expect. So, I was thinking about the synchronisation, and I remember you saying that we could just create a lock object per metric, BUT, can't we already use the PcpValueInfo object we're using here to synch?

@SuppressWarnings("unchecked")
protected final void updateValue(PcpValueInfo info, Object value) {
@SuppressWarnings("rawtypes")
TypeHandler rawHandler = info.getTypeHandler();
synchronized (dataFileBuffer) {
dataFileBuffer.position(rawHandler.requiresLargeStorage() ? info.getLargeValue()
.getOffset() : info.getOffset());
rawHandler.putBytes(dataFileBuffer, value);
}
}

I was thinking we can't actually create a lock per metric because right now the synch is needed to protect the dataFileBuffer positioning for the update. I think instead we need to create a ByteBuffer.slice() per PcpValueInfo which allows independent positioning (and indeed, could just be positioned once at init time). hen we don't need any synchronization at all do we?

thoughts?

Paul

Paul Cowan

unread,
Feb 10, 2012, 10:37:29 PM2/10/12
to parfa...@googlegroups.com
On 11 February 2012 14:23, Paul Smith <psm...@aconex.com> wrote:
I was thinking we can't actually create a lock per metric because right now the synch is needed to protect the dataFileBuffer positioning for the update.  I think instead we need to create a ByteBuffer.slice() per PcpValueInfo which allows independent positioning (and indeed, could just be positioned once at init time). hen we don't need any synchronization at all do we?

thoughts?

+1, please do. Attaching a slice() to the PcpValueInfo seems the most sensible option - mind you the documentation is not 100% clear what we need to do here (the thread-safety of ByteBuffer seems oddly underdocumented). I would assume that byte-buffer itself is actually threadsafe (in terms of writing to completely different areas, modulo repositioning the offset), but the other reason for the synchronization here is to provide happens-before relationships so two (chronologically) in-order updates don't happen out-of-order (which would cause all sorts of shenanigans with expected-monotonic counters going backwards, etc).

I think the slice() idea is a brilliant one, firstly because we don't actually need to keep offsets for the metric (just always write at position 0) and secondly because it does reinforce that thread-safety, prevent accidental overwrites, etc. That's an awesome approach, I say go for it. I do think we still want to sync on the metric or the slice though to provide happens-before (or do something with volatile, etc).

cow

Paul Smith

unread,
Feb 10, 2012, 11:09:25 PM2/10/12
to parfa...@googlegroups.com
Yes, yes, silly me.  ORIGINALLY I was thinking "Hey, we don't need a new lock, the PcpValueInfo is distinct enough here for the lock" and then realised that wouldn't help with the byteBuffer and focused too much on that.  If we have a per-metric ByteBuffer slice, and on each update synch on the PcpValueInfo object we should be fine.  Let me see what difference this does. 

Paul Smith

unread,
Feb 11, 2012, 7:56:25 AM2/11/12
to parfa...@googlegroups.com
I've just pushed to the master my benchmark module, together with a few tweaks to the BasePCPWriter to allow switching behaviour between the old global lock and the new per-metric lock behaviour.  Alas, this wasn't TDD, this class is hard to do that, so highly likely my code is completely wrong, I can't speak for it being functionally correct, actually I can, because it isn't, so I've been bad here and 'broken' it, but I need help there..  Somehow the slicing/positioning isn't working as expected, because when I connect pmchart to pcp and plot one of the generated counters, it displays correctly with the global lock, but I get 0's per second when it switches the test to perMetricLock.

Having said that, while the position the buffer is writing too is probably incorrect, I think the results are still valid and interesting.  I tested across my MBP, our load test machine, our current production hardware, and the new one about to be commisioned, and all but my MBP show vastly increased performance (snickering from the non-Mac folk).

Taking our new production hardware which is a dual-hexa-core r710, reporting 24 processors (12 processors, with HT).  I ran the benchmark with 32 threads just to overload it a bit.

There are 3 tests (without PCP started, with it started but with Global Lock, and started but with perMetricLocks), and then runs these 3 again, but reverse just to prevent any ordering (unfairly or more fairly treating a run after JVM warm up).




Host: app3.syd.acx     Java: 1.6.0_26

Thread Contention Supported: true, Enabled by default: false

numThreads: 32, numCounters=1000, iterations=10000

pcpStarted: false     perMetricLock: false     incrementRate(/sec):  131,958,762.89      blockedCount: 3      blockedTime: 36

pcpStarted: true     perMetricLock: false     incrementRate(/sec):    1,880,279.72      blockedCount: 0      blockedTime: 0

pcpStarted: true     perMetricLock: true     incrementRate(/sec):   13,451,595.28      blockedCount: 2435900      blockedTime: 107871

pcpStarted: true     perMetricLock: true     incrementRate(/sec):   14,482,259.23      blockedCount: 2468246      blockedTime: 110775

pcpStarted: true     perMetricLock: false     incrementRate(/sec):    1,962,068.48      blockedCount: 0      blockedTime: 0

pcpStarted: false     perMetricLock: false     incrementRate(/sec):   86,253,369.27      blockedCount: 2      blockedTime: 5


I do not know why the blockedCount & time count out 0 for the global lock run (row 2 & 2nd to last), but it is consistent when I run it.  On other hosts it does report, here's our current production hardware, quite a bit older:


app1.syd (current production app server - ALTHOUGH, using Update 26)


numThreads: 8, numCounters=1000, iterations=10000

pcpStarted: false     perMetricLock: false     incrementRate(/sec):   64,777,327.94      blockedCount: 14      blockedTime: 22

pcpStarted: true     perMetricLock: false     incrementRate(/sec):    1,593,466.79      blockedCount: 9731521      blockedTime: 185599

pcpStarted: true     perMetricLock: true     incrementRate(/sec):    6,963,788.30      blockedCount: 239336      blockedTime: 77

pcpStarted: true     perMetricLock: true     incrementRate(/sec):    7,303,934.99      blockedCount: 266851      blockedTime: 113

pcpStarted: true     perMetricLock: false     incrementRate(/sec):    1,512,516.07      blockedCount: 10133819      blockedTime: 195594

pcpStarted: false     perMetricLock: false     incrementRate(/sec):   85,378,868.73      blockedCount: 2      blockedTime: 2


So I'm presuming that the rate increase in counter increments is due to vastly reduced blockedTime which makes total sense.  


So, I think this is definitely a better way than before, I just need to get the code to work properly and actually write to the buffers properly.  Hopefully whatever change is needed to get this working doesn't materially affect the benchmark data, but I can run it again anyway.


Cowan, can you review the recent commits and pay closer attending to the BasePcpWriter changes?  Something in the buffer slicing and positioning contract I'm missing from the docs.


tah,


Paul

Paul Smith

unread,
Feb 11, 2012, 7:57:53 AM2/11/12
to parfa...@googlegroups.com
Oh, and one more thing, here's a PCP screenshot of the CPU of the host during the test run.  As you can see with the global lock, CPU is restricted way down, but with the perMetric Lock the CPU utilisation is much much higher (this is for app3, the newer box about to be commissioned).




PCP Charts-3.jpg

Paul Cowan

unread,
Feb 11, 2012, 7:10:02 PM2/11/12
to parfa...@googlegroups.com
Somehow the slicing/positioning isn't working as expected, because when I connect pmchart to pcp and plot one of the generated counters, it displays correctly with the global lock, but I get 0's per second when it switches the test to perMetricLock.

    
It's this bit:

ByteBuffer metricByteBufferSlice = dataFileBuffer.slice();
metricByteBufferSlice.position(bufferPosition);

From the docs for 'slice':

The content of the new buffer will start at this buffer's current position. 

so dataFileBuffer isn't positioned at 0 when you do this (it's just written the header etc I guess, so who knows where it is), so your position(bufferPosition) is being applied relative to the 0 of the slice, which is not actually 0 of the main buffer. You could .position(0) the dataFileBuffer first, but even better would be to .position(bufferPosition) the dataFileBuffer, then .slice(). Then the writing offset for every slice is 0 which is much simpler (don't need to track lastBufferPosition, just .position(0) on each write).

As an added bonus (in fact, for thread-safety we REALLY should do this), also call .limit(int) to limit the writing area of the slice to the correct size for the data type - then all the slices will be non-overlapping (or you'll get an error if you try to overrun the correct data area) which gives me a lot more confidence in this. :)

 
Having said that, while the position the buffer is writing too is probably incorrect, I think the results are still valid and interesting.  I tested across my MBP, our load test machine, our current production hardware, and the new one about to be commisioned, and all but my MBP show vastly increased performance (snickering from the non-Mac folk).

Awesome work, thanks. This is much better. Happy for the per-thread lock option to only be an option temporarily - this should be the default (hell, the only supported one) eventually. 

cow

Paul Cowan

unread,
Feb 11, 2012, 7:12:37 PM2/11/12
to parfa...@googlegroups.com
Ooh, forgot the minor scolding: until this is production-ready, should probably be done in a separate branch (not 'default'). Tsk tsk. :)

cow

Paul Smith

unread,
Feb 14, 2012, 8:34:03 PM2/14/12
to parfa...@googlegroups.com
On 12 February 2012 11:10, Paul Cowan <pa...@custardsource.com> wrote:
Somehow the slicing/positioning isn't working as expected, because when I connect pmchart to pcp and plot one of the generated counters, it displays correctly with the global lock, but I get 0's per second when it switches the test to perMetricLock.

    
It's this bit:

ByteBuffer metricByteBufferSlice = dataFileBuffer.slice();
metricByteBufferSlice.position(bufferPosition);

From the docs for 'slice':

The content of the new buffer will start at this buffer's current position. 

Yes, I actually had this change locally in an uncommitted state because I was still working out why the pcp counter values were not exporting properly (more on that hideous tale shortly). 


As an added bonus (in fact, for thread-safety we REALLY should do this), also call .limit(int) to limit the writing area of the slice to the correct size for the data type - then all the slices will be non-overlapping (or you'll get an error if you try to overrun the correct data area) which gives me a lot more confidence in this. :)


yes, that's done now too.
 
 

Awesome work, thanks. This is much better. Happy for the per-thread lock option to only be an option temporarily - this should be the default (hell, the only supported one) eventually. 


I'll probably do an RC1 style release once I've tried installing this into one of our apps locally and then get our load tester to get it out on the open road, not that the load test currently saturates the CPU like we're currently experiencing but will be an interesting comparison none-the-less.

Paul Smith

unread,
Feb 14, 2012, 8:34:52 PM2/14/12
to parfa...@googlegroups.com
On 12 February 2012 11:12, Paul Cowan <pa...@custardsource.com> wrote:
Ooh, forgot the minor scolding: until this is production-ready, should probably be done in a separate branch (not 'default'). Tsk tsk. :)


err, 0.3.0 isn't released, so it's just 'master' branch right..  Checking stuff in that doesn't WORK, is a different story though... 

Paul Smith

unread,
Feb 14, 2012, 8:46:07 PM2/14/12
to parfa...@googlegroups.com
On 12 February 2012 11:10, Paul Cowan <pa...@custardsource.com> wrote:
Somehow the slicing/positioning isn't working as expected, because when I connect pmchart to pcp and plot one of the generated counters, it displays correctly with the global lock, but I get 0's per second when it switches the test to perMetricLock.

    
It's this bit:

ByteBuffer metricByteBufferSlice = dataFileBuffer.slice();
metricByteBufferSlice.position(bufferPosition);

From the docs for 'slice':

The content of the new buffer will start at this buffer's current position. 


So, now for my Tales of WOH.   I spent far too long on this, I hate you author-of-ByteBuffer-I-damn-you-to-all-of-eternity.  What is happening is that the .slice() call indeed returns a buffer backed by the parent, but of a DIFFERENT BYTE ORDER than the parent.  yes, WTF?  Here's code to prove it:

  

import java.nio.ByteBuffer;
import java.nio.ByteOrder;

public class BufferSlicing {

    public static void main(String[] args) {
        // create a buffer, writing incrementing numbers, and then
        // print the 2nd value via a sliced buffer
        ByteBuffer buffer = ByteBuffer.allocate(8 * 8);
        bufferLoop(buffer);
        outputItemSlice(buffer, 2);

        System.out.println("------");
        // do exactly the same thing, but this time
        // the parent buffer change to native Byte Ordering (default is BigEndian)
        buffer = ByteBuffer.allocate(8 * 8);
        buffer.order(ByteOrder.nativeOrder());

        bufferLoop(buffer);
        outputItemSlice(buffer, 2);




    }

    private static void outputItemSlice(ByteBuffer buffer, int itemPos) {
        buffer.position(itemPos * 8);
        ByteBuffer slice = buffer.slice();
        System.out.printf("long @ pos %d is %d\n", itemPos, slice.getLong());
    }

    private static void bufferLoop(ByteBuffer buffer) {
        int number = 8;
        for (int i = 0; i < number; i++) {
            buffer.putLong(i);
        }

        buffer.position(0);
        for (int i = 0; i < number; i++) {
            System.out.println(buffer.getLong());
            buffer.position((i + 1) * 8);
        }
    }
}

Output:

0
1
2
3
4
5
6
7
long @ pos 2 is 2
------
0
1
2
3
4
5
6
7
long @ pos 2 is 144115188075855872


The only way I worked this out was by breakpoint debugging and checking the values being written (which were right) and then reading them out again from the main buffer instance, which was indeed coming out with whackily large numbers (hence why PCP was reporting "!"... indeed!).  I wondered by the size of it whether it was being byte ordered and sure enough, the values were being written round the wrong way.

Further step debugging shows the 'endianness' and byteOrdering properties of the ByteBuffer are different of the slice().  Simply changing the slice()'d buffers byteOrder to force it to be what the parent's was fixes the issue.

grrrr.

Paul Cowan

unread,
Feb 14, 2012, 10:19:35 PM2/14/12
to parfa...@googlegroups.com
On 15 February 2012 12:46, Paul Smith <tallp...@gmail.com> wrote:
So, now for my Tales of WOH.   I spent far too long on this, I hate you author-of-ByteBuffer-I-damn-you-to-all-of-eternity.  What is happening is that the .slice() call indeed returns a buffer backed by the parent, but of a DIFFERENT BYTE ORDER than the parent.


Ahahahhaha oh my god that's evil.

The docs for ByteBuffer say: "The initial order of a byte buffer is always BIG_ENDIAN" but I don't think "that doesn't apply if you call .slice()" is unreasonable. That's certainly what I would have assumed.

Nice detective work!

cow


 

Paul Smith

unread,
Feb 14, 2012, 10:26:50 PM2/14/12
to parfa...@googlegroups.com
it's just totally non-sensical, I cannot fathom under what use case it's ever valid? I too found that documentation about "...always BIG_ENDIAN" a little too late...

Paul 
Reply all
Reply to author
Forward
0 new messages