A few tips i figured out out the hard way.
* Expect to increase the caliper time-limit significantly, I had to up mine to 5 minutes to get results (this is likely because anything involving multiple threads + context switches has a large amount of variance which means it will take awhile for caliper to be confident in its results, e.g. calling take() on an ArrayBlockingQueue will take ~70 ns on a modern machine if the lock is free and the queue has an item waiting, but if the lock is contended and the thread has to block that can add 12 microseconds of latency which is hundreds of times larger that the actual operation)
* You will mostly likely have to disable the allocation instrument when running with multiple threads, this is because synchronization primitives often allocate objects when they are contended (e.g. a node in the wait queue), which means that your allocation behavior will be non-deterministic (which caliper doesn't like). To measure allocation behavior you should write a separate benchmark that uses a single thread (or very carefully interleaves threads to measure particular cases of contention)
* Some of the things you likely want to measure would be something like queue throughput (put-takes per second) and caliper can't help you with that (at least not in a way that i can figure out).
* if you want to measure how long it takes to add, take, or remove an item from a queue you are going to have to write a different benchmark for each operation you want to measure (caliper can only measure one thing at a time) and you are probably going to want to measure operations in pairs like put/take or add/remove otherwise you're going to fill your queues or spend too much time allocating and drown out your actual operation. e.g. i wrote my benchmarks to measure how long it takes to push reps numbers of items through a queue using various queue operations and numbers of threads. i.e. It is hard to individually measure an add operation without also measuring a remove operation.
It would be nice if caliper offered some help with writing these sorts of benchmarks, but I'm not sure how it could :/