We think we've tracked this down to some unexpected (by us) behavior with the replication plugin.
We have one replication remote where we set remote.NAME.replicationDelay to 0. It looks like this setting is also used as the reschedule delay in situations where a replication has to be rescheduled due to in-flight push (see replication/Destination.java 114-115, 380). The result is that the event can be rescheduled hundreds of times per second; we observed up to 11 reschedule operations per millisecond for a single replication on our system.
The other thing we noticed is that every time a replication is rescheduled, a ref-replicated failure event is emitted to stream-events. We're not 100% sure what the trigger is doing with replication events, but heap dumps show a correlation between high cpu usage and a data structure with tens or hundreds of thousands of those ref-replicated events. We later observed cpu usage spike right along with a ref-replicated event storm.
We're going to change the replicationDelay to 1s for this remote and we expect this will mostly resolve the problem.
Thanks to everyone who replied privately off-list.
-Will