The operations I'm doing are on large VM images. Let me describe what I have and then do the benchmarks on that.
qemu-img convert -O raw trusty-server-cloudimg-amd64-disk1.img testy.img
truncate --size 80G testy.img
which gives this on the filesystem
# ls -lash testy.img
807M -rw-r--r--. 1 root root 80G Feb 10 14:51 testy.img
After compressing with lz4 as you suggest I get the following results when decompressing to new images.
# time lz4 -d -f testy.img.lz4 new-image
Successfully decoded 85899345920 bytes
real 3m34.750s
user 1m48.547s
sys 1m21.370s
# time lz4 -c -d testy.img.lz4 | cp /dev/stdin new-image.cp
real 5m13.574s
user 2m30.285s
sys 3m12.671s
# time lz4 -c -d testy.img.lz4 | cp --sparse=always /dev/stdin new-image.sparse
real 3m21.534s
user 2m42.877s
sys 1m15.336s
Compare that with the sparse aware conversion from qemu-img
# time qemu-img convert -O raw ~/trusty-server-cloudimg-amd64-disk1.img new-image.qemu && truncate --size 80G new-image.qemu
real 0m9.174s
user 0m7.576s
sys 0m1.314s
Even if you read the entire 80G raw file it's faster because the writes are optimised
# time qemu-img convert -O raw testy.img new-image.qemu2
real 1m19.341s
user 0m15.849s
sys 1m1.690s
Now qemu-img is of course designed for doing that sort of thing from its own image format, but it cannot stream in the way lz4 can. I'm hoping that we can get lz4 down to somewhere near the qemu-img times for this sort of use case. That would save a lot of write bandwidth on the servers which helps save wear on the SSDs.