On 22/06/15 19:14, Klaus Post wrote:
> On Monday, 22 June 2015 19:53:10 UTC+2, Axel Wagner wrote:
>
> It is a pretty hard problem to do, which is why I think there aren't a
> lot of implementations out there. The difference is basically, that you
> assume, that I know which blocks are corrupted. But in theory, an
> algorithm can find that out, as long the numbers of errors is below a
> certain threshold. Think of it as losing a harddrive vs. bitrot.
>
> Ah, ok. I haven't heard about that. Definitely sounds interesting.
Erasure codes are usually used when you know don't have the data at all.
This is extra information provided by a checksum of a packet or the
disk being missing. Reed-Solomon is a fine erasure encoder (it is Optimal).
For a Reed-Solomon code if you can deal with 2k+1 erasures then you can
only deal with k corrections. There are other codes which will do much
better than this. It has its advantages though as it is easy to
implement in software over GF(256).
If you are interested in RS coding, then Russ Cox wrote lots of stuff
about it
http://research.swtch.com/field
And also a QR code (which uses RS codes) encoder in go (but not
decoder?) here
https://godoc.org/code.google.com/p/rsc/qr
The general case of decoding reed solomon codes is hard. Many years a
go I implemented the Berelekamp Massey decoder (which was used for
decoding RS codes over satellite)
https://en.wikipedia.org/wiki/Reed%E2%80%93Solomon_error_correction#Berlekamp.E2.80.93Massey_decoder
There may be better algorithms now though.
If I were doing the same things today, I'd consider a low density parity
check code which perform much better. See David MacKay's book for a
good introduction
http://www.inference.phy.cam.ac.uk/mackay/itila/
> I considered adding a per shard hash as part of the library and help
> keep track of shard number and order, but I couldn't come up with
> something I felt could easily be used in a lot of different scenarios,
> so for now it is up to the library user to ensure validity of the
> individual shards.
And that is as it should be for a low level library. The user might be
packing the data into Ethernet packets and relying on the 32 bit CRC to
check for corruptions, or storing in files, you just don't know...
> Either way, I think that will go into a separate
> package - the only thing I have that I might do is to add a simpler way
> of "streaming" data to the encoder.
In the docs you wrote on the Reconstruct function
//
// The reconstructed shard set is complete, but integrity is not
verified.
// Use the Verify function to check if data set is ok.
The Verify function should always pass after you do error decoding so it
isn't an integrity check in any way.
If you want an additional integrity check then the user of the library
should add one with a CRC/MD5/SHA1 etc - calling Verify after
Reconstruct is just a waste of CPU cycles!
(In general you can use codes for detecting errors, or correcting
errors, but not both!)