LZ4 Possible Speed Optimizations

Shawn

unread,

Apr 9, 2018, 12:02:28 PM4/9/18

to LZ4c

I'm trying to see if it is possible to increase the speed of the lz4 compression algorithm as a project on a AArch64 architecture. I'm thinking about if it is possible to implement SIMD into the compression to increase search speed or other functions. Is there anything I should be watching out for when trying to implement threads into it?

Also, in terms of the optimizations already present, what optimizations are already being done for the AArch64 architecture? I would like to determine if there is anything to increase the speed of lz4 both as a whole or specifically on one architecture, possibly with compiler flags.

Cyan

unread,

Apr 9, 2018, 2:44:10 PM4/9/18

to LZ4c

It's a great project Shawn !

In terms of SIMD, only a few attempts have been made in the past, generally with mixed success.
Best case : a few % were won in the copy or comparison functions, though benefit was also dependent on the data type being compressed.

It might be that Intel has done something on this topic, within their own lz4 flavor :
https://software.intel.com/en-us/articles/building-a-faster-lz4-with-intel-integrated-performance-primitives

though I have no idea how to check that nor if it really involves SIMD.

LZ4 reference source code tries to be portable, and therefore avoids architecture-specific features.

The code is written for the C virtual machine, and tries to provide enough information to the compiler to allow efficient transformations.

As a consequence, the final assembler generated by C compiler may contain some SIMD instructions (generally in the memory copy section, as it's easiest).

That's the best possible outcome : end up with a code which is pure portable C, but is locally compiled into optimized SIMD instructions given the proper compiler and set of flags.

Regards

Peter Steinbach

unread,

Apr 9, 2018, 3:57:15 PM4/9/18

to lz...@googlegroups.com

weird, the intel page that you just mentioned discusses a patch that I
apparently cannot download anywhere. If anyone can provide any pointers,
I'd love to have a look at it. If push comes to shove, we could contact
the intel devs that wrote that article.

Shawn

unread,

Apr 10, 2018, 10:02:39 AM4/10/18

to LZ4c

I have been looking into the main loop within the LZ4_compress_generic function and I have been trying to determine if each match is mutually exclusive, or if each match is dependent on the previous match. I would like to see if multiple threads can be used to do something like testing for the next position while encoding the previous match at the same time.

I am also trying to determine if each loop of the main loop could be made into individual threads as well, or if the previous loop is needed to determine the next loop. Would it be possible to check for multiple matches once and encode each in tandem?

Cyan

unread,

Apr 10, 2018, 2:55:39 PM4/10/18

to LZ4c

It's an interesting direction.

Our own similar attempts have failed, but it is said that the Intel code does something similar.
I suspect it's a hard (but apparently solvable) challenge.

Reply all

Reply to author

Forward