To get a little more technical about it, the compile threshold is probably the biggest the problem. Since Caliper can't really know what code you're intending to microbenchmark, the best it can do is figure out whether or not _something_ has compiled. So, the microbenchmark instrument will warm up for some amount of time, but since your operations take so long, it's unlikely that the top-level method will actually be invoked the requisite (10,000 by default) number of times for it to be JIT'd. Then, you'll have compilation during the middle of your timing which means that you'll probably just have some ugly, bi-modal data.
So, if you were to increase the warm up time, probably decrease the compile threshold and, unless this operation doesn't allocate, increase the heap by a whole ton, you might might be able to get some useful results.
Hope that helps.