Dealing with Corrupted Protocol Buffers

1,660 views
Skip to first unread message

julius-schorzman

unread,
Jan 20, 2011, 2:48:09 AM1/20/11
to Protocol Buffers
Hi all --

I have a system that has a large number of read and writes that is
storing messages as protocol buffers.

All was working well, until I recently started noticing some corrupt
(I'm assuming they're corrupt at least) protocol buffers. This is
giving me some exceptions like this:

com.google.protobuf.InvalidProtocolBufferException: While parsing a
protocol message, the input ended unexpectedly in the middle of a
field. This could mean either than the input has been truncated or
that an embedded message misreported its own length.

My question is -- can anything be done to retrieve part of the file?
It would be nice to know at which point in the file the problematic
message occurred, and then I could crop to that point or do some
manual exception -- but unfortunately this exception is very general.
I find it hard to believe that a single mis-saved bit makes the whole
file worthless.

I also find it curious that the source provides no way (that I can
tell) to get at any lower level data in the p.b. since whenever I try
to do anything with it it throws an exception. Best I can tell I will
have to write from scratch my own code to decode the p.b. file.

Has anyone found a better way to deal with this?

Thanks so much -J

Evan Jones

unread,
Jan 20, 2011, 9:27:37 PM1/20/11
to julius-schorzman, Protocol Buffers
On Jan 20, 2011, at 2:48 , julius-schorzman wrote:
> My question is -- can anything be done to retrieve part of the file?
> It would be nice to know at which point in the file the problematic
> message occurred, and then I could crop to that point or do some
> manual exception -- but unfortunately this exception is very general.
> I find it hard to believe that a single mis-saved bit makes the whole
> file worthless.

You are correct: your entire data is not worthless, but at the point
of the error, you will need some manual intervention to figure out
what is going on.

It is probably possible to figure out the byte offset where this error
occurs. The CodedInputStream tracks some sort of bytesRead counter, I
seem to recall. However, this will require you to modify the source.


> I also find it curious that the source provides no way (that I can
> tell) to get at any lower level data in the p.b. since whenever I try
> to do anything with it it throws an exception. Best I can tell I will
> have to write from scratch my own code to decode the p.b. file.

The lowest level tools that are provided is CodedInputStream. But yes,
you will effectively have to "parse" the message yourself. Look at the
code that is generated for the mergeFrom method of your message to get
an idea for how it works, and you can read the encoding documentation:

http://code.google.com/apis/protocolbuffers/docs/encoding.html

You can definitely figure out what is going on, but it will be a bit
of a pain. Good luck,

Evan Jones

--
http://evanjones.ca/

Julius Schorzman

unread,
Jan 20, 2011, 10:11:28 PM1/20/11
to Protocol Buffers
Thanks for the tip on CodedInputStream Evan! I will explore it and
if I get anything out of it will report back my findings for anyone
else dealing with this issue.

Jason Hsueh

unread,
Jan 21, 2011, 2:54:49 PM1/21/11
to Julius Schorzman, Protocol Buffers
It will be rather difficult to correct for the error. The point at which the parse fails may not be the point of corruption: e.g., the corruption may be in a byte that is part of a varint, and the continuation bit may be set when it shouldn't. Similarly you could have a corruption in the length delimiter for a string or nested message field. Both could cause you to read more bytes than you should have for that particular field. The encoding is dense enough that the parser may merrily consume more bytes before encountering an error to complain about.

You can try to mess with the bytes; you might be able to deal with errors using some assumptions about the serialized data based on your protocol. But in general, and going forward, you should write small messages in a container format that allows for error recovery. Various threads from this search discuss this issue.

--
You received this message because you are subscribed to the Google Groups "Protocol Buffers" group.
To post to this group, send email to prot...@googlegroups.com.
To unsubscribe from this group, send email to protobuf+u...@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/protobuf?hl=en.


Reply all
Reply to author
Forward
0 new messages