Arbitrary corruption of repeated fields

170 views
Skip to first unread message

Stefan

unread,
Jan 27, 2010, 11:40:10 PM1/27/10
to Protocol Buffers
Hello everybody,

I have a small dilemma with regards to protocol buffers. I read the
documentation but I still do not see a clear answer (I only use the
Java version of protocol buffers). I hope I am not missing something
really obvious here ...

I have the following setup:

message Item {
optional string name = 1;
optional string description = 2;
}

message Bag{
repeated Item item= 1;
}

In the code, a Bag (with a significantly big number of items) gets
serialized to a file. Now, lets suppose the file gets corrupted in the
middle (arbitrary point). From my experiments, the entire content
would be lost because Bag cannot be deserialized anymore. To construct
the Bag, I use the parseFrom() method and I get exceptions. I do not
see anything in the documentation that would suggest mergeFrom() would
have a different result either.

I do not expect to be able to recover any individual corrupted items
but it would be nice to be able to recover the rest of the list.

What could I do reduce the risk of losing the entire list due to
arbitrary corruption? What if corruption only occurs at the end of the
file, would it be simpler to recover all the elements up to the
corruption point?

Thanks for your help!

Kenton Varda

unread,
Jan 28, 2010, 12:01:18 AM1/28/10
to Stefan, Protocol Buffers
Sorry, it is a non-goal of protocol buffers to provide message integrity -- this is left to a higher layer.  One byte of corruption in a protocol message can very easily make it impossible to parse the remainder of the message, or even make the rest of the message appear as parseable garbage.  Therefore, trying to design code which can work around corruption in the message is fraught with peril, and no one has tried.

If you need to be able to recover from corruption without discarding the whole file, the way to do it is by designing your file format to contain multiple protocol buffers framed in some way that allows you to continue reading the others if one is corrupted.  This isn't something protocol buffers can provide, but it would make sense for someone to write a library on top of protobufs that provides it.


--
You received this message because you are subscribed to the Google Groups "Protocol Buffers" group.
To post to this group, send email to prot...@googlegroups.com.
To unsubscribe from this group, send email to protobuf+u...@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/protobuf?hl=en.


Michael Poole

unread,
Jan 28, 2010, 12:05:07 AM1/28/10
to Protocol Buffers
Stefan writes:

> What could I do reduce the risk of losing the entire list due to
> arbitrary corruption? What if corruption only occurs at the end of the
> file, would it be simpler to recover all the elements up to the
> corruption point?

If you serialize the elements inside the Bag to the disk individually,
you could prefix them with a synchronizing marker and length. A marker
would typically be a fixed-length pattern that is unlikely to appear in
legitimate data -- starting with a zero byte is a good way given
Protocol Buffers data, it should contain some other (ideally uncommon)
bytes for robustness.

By reading the marker, length, message, and checking the next marker,
your program can be reasonably sure that the detected message boundaries
are correct. Recovery then becomes a matter of looking for the next
synchronizing marker, and checking it the same way.

There is obviously a tradeoff between how much data you can lose with a
corrupted message and the per-message overhead. If you were using the
particular example in your email, you might serialize a Bag that
contains several Items rather than serializing each Item individually.

Michael Poole

Stefan

unread,
Jan 28, 2010, 12:57:17 AM1/28/10
to Protocol Buffers
Kenton, Michael thanks for your quick answers (that was fast!). The
suggestions are great and to the point (and if I remember correctly
the approach was mentioned before).

So, a possible solution would be to break a big Bag into a couple of
Bags with smaller number of items. Now, I would need a mechanism to
write those smaller Bags delimited by some sort of a frame or marker.
Next step would be to discard corrupted messages from the file (a
corrupt message is one that does not parse) and seek to the next
marker. If I want to lose no more than one message per corruption, I
would need to write each Item separately but the overhead from the
markers would be bigger. On the other hand, if the Bag has too many
Items then I have the chance of losing too much data on a single
corruption. Aside from the markers, I would get overhead from
collating and separating the lists each time I need to use/save a big
Bag.

I hope I got the idea correctly. I will give it a try (hopefully it
will not be slow).

Again, thanks for your quick answers.

Madhav Ancha

unread,
Jan 28, 2010, 1:19:57 AM1/28/10
to Stefan, Protocol Buffers
Stefan,

   Under what circumstances does persistence occur? If the message does not break into smaller ones naturally and it makes sense to keep the message in one piece, you can also use some checksum algorithm to verify your persistence.

-Madhav Ancha

Stefan

unread,
Jan 28, 2010, 1:33:34 AM1/28/10
to Protocol Buffers
Madhav,

I am not concerned with the integrity of the file itself. My only goal
is to be able to read back (completely or partially) a message saved
on disk in case file corruption occurs. A checksum would help to
decide if the file got corrupted but would not help at all in
recovering the data if indeed it got corrupted.

On Jan 28, 12:19 am, Madhav Ancha <madhavan...@gmail.com> wrote:
> Stefan,
>
>    Under what circumstances does persistence occur? If the message does not
> break into smaller ones naturally and it makes sense to keep the message in
> one piece, you can also use some checksum algorithm to verify your
> persistence.
>
> -Madhav Ancha
>

> > protobuf+u...@googlegroups.com<protobuf%2Bunsu...@googlegroups.com>

Kenton Varda

unread,
Jan 28, 2010, 1:45:46 AM1/28/10
to Michael Poole, Protocol Buffers
On Wed, Jan 27, 2010 at 9:05 PM, Michael Poole <mdp...@troilus.org> wrote:
If you serialize the elements inside the Bag to the disk individually,
you could prefix them with a synchronizing marker and length.  A marker
would typically be a fixed-length pattern that is unlikely to appear in
legitimate data -- starting with a zero byte is a good way given
Protocol Buffers data, it should contain some other (ideally uncommon)
bytes for robustness.

I'd add that the marker should also contain some sort of checksum, e.g. CRC.  Otherwise, you might not detect corruption when it happens.  It's very easy for a corrupt message to still appear to parse correctly.  In an environment where corruption is a concern, you definitely want to verify all data to make sure you don't accidentally start using garbage!

Kenton Varda

unread,
Jan 28, 2010, 1:49:33 AM1/28/10
to Stefan, Protocol Buffers
On Wed, Jan 27, 2010 at 9:57 PM, Stefan <sne...@gmail.com> wrote:
Next step would be to discard corrupted messages from the file (a
corrupt message is one that does not parse)

This is not a good way to detect corruption.  It is very easy for a corrupt message to still parse correctly, and then you end up operating on garbage, which is usually much worse than discarding the corrupt data.
Reply all
Reply to author
Forward
0 new messages