Hi!
I have just released a package that will allow you to deduplicate streams, which allows you to remove duplicate data across gigabytes and even terabytes of data, at speeds >1GB/s.
It is currently in a "stable" state, but still lacks some corner case tests, which I will be adding in the coming days. It is however not extensively battle-tested yet, so if you are interested in that, please let me know.
I do not plan do modify the current API and file format unless there is a fundamental problem I cannot fix without doing so.
Comments, feedback, questions, suggestions, problems are all very welcome.
/Klaus