Hi Takayuki
Please find some tentative answers to your questions.
- You don't need to LZ4_loadDict() before each block.
LZ4_compress_continue() will remember where previous data is, and will look for it right there.
LZ4_loadDict() is supposed to be useful in a static dictionary scenario,
whenever LZ4_compress_continue() had no chance to see previous data.
Note that your code will work, it's just that it is doing some additional work which is not required.
- You don't need to memset() before each block either.
memset() is only required once, when the LZ4_stream_t object is created.
Using = {0}; is enough.
After that, it's no longer necessary to reset it.
Resetting the LZ4_stream_t object is the equivalent of restarting to zero again.
It's one way to create independently compressed blocks using LZ4_compressed_continue().
(The other way is to LZ4_loadDict() with a dictSize of zero, which is much faster).
- Small Buffers (<64KB) are a bit tricky, and you're correct to be cautious about them.
Basically :
+ Small blocks have no impact on LZ4_compress_continue() behavior. Everything works as intended, and you don't need to use LZ4_loadDict() nor memset(). You can compress multiple very small data chunks one by one, and LZ4_compress_continue() will produce valid results.
+ Its ability to "remember" where stand all previous small data chunks is limited though. That's where it can become tricky.
+ If all previous small data chunks are stored next to each other into memory, LZ4_compress_continue() will simply merge them, adding the next block to previous ones, and consider them a "single dictionary area"
+ If, at some point, there is a separation between 2 consecutive blocks, LZ4_compress_continue() will only remember the "previous block", not those before.
+ There is a problem though, whenever a memory segment registered within dictionary area is reused as an input area.
In this case, the current algorithm doesn't detect this situation, which can result in data corruption.
It happens when using a small buffer (<=64KB) in rotating mode.
I think it is the case in your example, since you seem to use an 8KB rotating buffer, separated into 8 chunks of 1 KB.
In this case, the work-around is to use LZ4_loadDict(), to reset the dictionary area, as you did by the way !
(memset() is not required, LZ4_loadDict() is enough).
+ I feel this situation could be handled better.
I've not made any effort to support this scenario, since it was not in my target list, but if you believe it is a valid one,
then there might be some ways to directly integrate it into LZ4_compress_continue().
One possibility would be check input area on startup, and remove it from dictionary area whenever it overlaps.
Well, as stated previously, the LZ4_stream_t memory area should be initialized to zero before first use (then it's no longer necessary).
There is a risk that some users might not even do that.
In this case, the idea is to detect it, using LZ4_dict_t_internal.iniCheck field, and output an error message without attempting compression.
As you said, it's not a guaranteed proof, it's only a quick hint. My expectation is that it should detect the initialization issue in most cases : for example, in debug mode, memory is typically initialized with a value != 0, precisely to detect uninitialized areas.
- As a quick comment :
The simple fact that you have questions on how the streaming API works is an important hint that's it's not clear enough.
Either in its behavior, or associated documentation, or both.
I'm all ear and very opened to comments and suggestions, to improve this situation,
and provide the necessary modifications to make this API as simple as possible to use and understand.
Best Regards
Yann