Byte Offsets into File with Multiple Messages

229 views
Skip to first unread message

jonath...@gree.co.jp

unread,
Jun 25, 2013, 5:51:34 PM6/25/13
to prot...@googlegroups.com
Hi All,

I know the topic of "multiple [protobuff] messages in a single file" has been covered a bunch, but I have a slightly different question. 

Most of the answers to "multiple messages in a single file" has been, use a CodedInputStream to do the reading and writing and to write the size of the message before the message in the file. While this works fine, my use case is different as I don't want to always read every single message in a file upon file read. So, to set up this use case, instead of writing the sizes of the messages before each message in a file, I write a "header" message at the top of the file and then individual messages after that. My header is as follows

<code>

option optimize_for=LITE_RUNTIME;

package MyPackage;

message ListHeader {

  message Entry

  {

    optional string name = 1;

    optional uint32 byte_offset_from_header = 2;

    optional uint32 size_bytes = 3;

  }

  repeated Entry entry = 1;

}

</code>

Conceptually, at application start, I read this header for a given file and store it somewhere. After that, when a particular entry is needed from a file (referenced by name), I want to be able to open the file, jump to a given entry (via the entry byte offset into the file), read the message out and continue on my merry way. The problem is, none of the messages I read are "valid" (contain correct data). They parse ok, but are corrupt. The header message parses fine and contains proper data.

I'm using the LITE_RUNTIME optimization, so I made subclasses of ZeroCopyInputStream which take in an std::ifstream. When I want to read one of the "entry" messages, I use seekg on the ifstream created from the file, then I created a ZeroCopyInputStream from that stream (via my own class), and then I created a CodedInputStream. I set a limit on the CodedInputStream to be the size of the entry from the header and then parse via parseFromCodedInputStream. Is this a valid workflow (using seekg on the stream which I then made a ZeroCopyInputStream and CodedInputStream from)? If not, how can I get the functionality I want?

I do calculate the byte offsets that I seekg to to be the entry's "byte_offset_from_header" + the coded input stream used to parse the header's CurrentPosition which I believe should yield the total byte offset from beginning of the file.

-Jonathan

Ilia Mirkin

unread,
Jun 25, 2013, 5:55:44 PM6/25/13
to jonath...@gree.co.jp, prot...@googlegroups.com
You also need to keep track of how big your header is. Otherwise when
reading the header, it will happily just keep on going, accumulating
"unknown" tags/etc.

When you write the file, keep track of the actual offsets you write
the protos at. When reading, compare them to what you had when written
out... that should reveal the issue.
> --
> You received this message because you are subscribed to the Google Groups
> "Protocol Buffers" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to protobuf+u...@googlegroups.com.
> To post to this group, send email to prot...@googlegroups.com.
> Visit this group at http://groups.google.com/group/protobuf.
> For more options, visit https://groups.google.com/groups/opt_out.
>
>

jonath...@gree.co.jp

unread,
Jun 25, 2013, 6:02:49 PM6/25/13
to prot...@googlegroups.com, jonath...@gree.co.jp, imi...@alum.mit.edu
On Tuesday, June 25, 2013 2:55:44 PM UTC-7, Ilia Mirkin wrote:
You also need to keep track of how big your header is. Otherwise when
reading the header, it will happily just keep on going, accumulating
"unknown" tags/etc.

When you write the file, keep track of the actual offsets you write
the protos at. When reading, compare them to what you had when written
out... that should reveal the issue.



Sorry, I forgot to mention that at the beginning of the file (when I write out the file), I write to a coded output stream and write the size of the header via WriteLittleEndianInt32 (or whatever that function is). I then write out the "header" protocol buffer message (via same coded output stream) and then I loop over all "entries" and (via same coded output stream) and write those messages out to the file. So I believe I am setting the limit's properly for reading the header, but I'll double check.

I'll keep track (via trust ol' pen and paper for now) the offsets and see what I come up with. Thanks for the info. 
Message has been deleted

jonath...@gree.co.jp

unread,
Jun 25, 2013, 6:25:52 PM6/25/13
to prot...@googlegroups.com
After checking the offsets written to the header in the file and the offsets being read at (to what I seekg to), everything seems to match up 100%. The offsets for the entries aren't suspicious either (so there is actual data being written there). Maybe I'm not doing something else properly (does creating a ZeroCopyInputStream or CodedInputStream somehow reset the underlying ifstream back to the start of the file)?

This is a head scratcher. I can paste some more code if that is helpful to anyone.

Ilia Mirkin

unread,
Jun 25, 2013, 6:32:05 PM6/25/13
to jonath...@gree.co.jp, prot...@googlegroups.com
More code certainly couldn't hurt. Are you sure that the data is
making it into the file OK? Are you opening the file with ios::binary
or whatever the flag is?
> This is a head scratcher. I can paste some more coded if that is helpful to
> anyone.

jonath...@gree.co.jp

unread,
Jun 25, 2013, 8:35:08 PM6/25/13
to prot...@googlegroups.com, jonath...@gree.co.jp, imi...@alum.mit.edu
Alright, some code incoming..be warned

This is code which creates the file, I'll only paste the important parts

<code>

void WriteEntriesToFile(const string& path, .....other params here......)

{

    ofstream* stream = new ofstream(path.c_str(), ios::out | ios::binary);

    //.... check for stream validity here......

    OFStreamWriter* writer = new OFStreamWriter(stream); // my own class which is a copy of OStreamOutputStream

    ::google::protobuf::io::CodedOutputStream* codedOut =  new ::google::protobuf::io::CodedOutputStream(writer);


    // Create new header

    HeaderPB* headerPB = new HeaderPB();


    unsigned int byteOffset = 0;

    HeaderPB::EntryPB* entryPB = NULL;


    // .... begin looping over instances of "MyMessage" to write to the file

    {

        entryPB = headerPB->add_resource_entry();


        entryPB->set_resource_size_bytes(myMessagePB->ByteSize());

        entryPB->set_resource_byte_offset_from_header(byteOffset);


        byteOffset += myMessagePB->ByteSize();

    }


    // Write out size of header and header

    codedOut->WriteLittleEndian32(headerPB->ByteSize());

    headerPB->SerializeToCodedStream(codedOut);   

    // PRINT("BYTE OFFSET TO END OF HEADER %d", codedOut->ByteCount());


    // Write out all the messages

    // Again, loop over the messages passed in

    {

        // PRINT("CURRENT OFFSET FROM BEGGING OF FILE %d", codedOut->ByteCount());

        myMessagePB->SerializeToCodedStream(codedOut);

    }


    delete headerPB;

    delete codedOut;

    delete writer;

    delete stream;

}

</code>

jonath...@gree.co.jp

unread,
Jun 25, 2013, 8:51:48 PM6/25/13
to prot...@googlegroups.com, jonath...@gree.co.jp, imi...@alum.mit.edu
More code. This is the parsing code

<code>

void GetMessagesFromFile(const string& path)

{

    ifstream* stream = new ifstream(path.c_str(), ios::in | ios::binary);

    unsigned int endOfHeaderOffset = 0;

    // read_header_from_file just reads the HeaderPB and computes endOfHeaderOffset,

    // via CodedInputStream CurrentPosition once parsing of header size and header is done.

    // yes, read_header_from_file uses limits on the coded stream to not keep parsing


    HeaderPB* header = read_header_from_file(path, stream, endOfHeaderOffset);

    // Loop over the entries in the header

    if ( 0 < header->entry_size() )

    {

        // Loop over all entries

        MyMessagePb* myMessagePB = NULL;

        IFStreamReader* reader = NULL;

        ::google::protobuf::io::CodedInputStream* coded = NULL;


        // Loop from an index of 0 to header->entry_size()

        {

            const HeaderPB::EntryPB& entryProtoBuff = header->entry(entryIndex);

            ::google::protobuf::uint32 entrySize = entryProtoBuff.entry_size_bytes();

            ::google::protobuf::uint32 entryOffset = entryProtoBuff.entry_byte_offset_from_header();

                    

            // Put stream at byte offset and created IFStreamReader and CodedInputStream

            // PRINT("BYTE OFFSET ON READING FROM BEGGING OF FILE %d", endOfHeaderOffset + entryOffset);

            stream->seekg(endOfHeaderOffset + entryOffset, stream->beg);

            reader = new IFStreamReader(stream);

            coded = new ::google::protobuf::io::CodedInputStream(reader);


            ::google::protobuf::io::CodedInputStream::Limit oldLimit = coded->PushLimit(entrySize);

                    

            myMessagePB = new MyMessagePB();

            if ( myMessagePB->ParseFromCodedStream(coded) )

            {

                // Code reaches this point, but all data in myMessagePB is null/not available via "has_xxxxx" function

                // .... Do other stuff here......

            }                    

            coded->PopLimit(oldLimit);

                    

            delete coded;

            delete reader;

        }

    }

    delete header;

    delete stream;

}


</code>

jonath...@gree.co.jp

unread,
Jun 25, 2013, 9:29:02 PM6/25/13
to prot...@googlegroups.com, jonath...@gree.co.jp, imi...@alum.mit.edu
The number of entries in the header is the correct number. The print outs of the byte offsets when encoding and when parsing are the exact same. The parsing from the CodedInputStream doesn't fail. I'm at a loss as to what I can do here. For the time being, I can work around my problem by doing the "standard" approach of reading all messages from the file and putting the message size before each message, but I would prefer to be able to (and I don't see why I can't) parse via a known byte offset into the file.

jonath...@gree.co.jp

unread,
Jun 26, 2013, 2:57:09 PM6/26/13
to prot...@googlegroups.com
Just a little more info. Apparently if I take out the seekg and only used one coded stream for the entire file  from the example code (since the example code is just looping over all messages in a vector, there is no real *NEED* for a byte offset from the header though it IS a feature that I want in the future), everything is peachy. Using the same concept (only one zero input stream and codedinputstream for the whole file instead of having to create new ones for each MyMessage), but putting the seekg in (which should really just move the the ifstream's position to where it currently is) doesn't work. Using a new zero input and coded input streams per MyMessage parse and not using seekg also fails.

So:

1) One CodedInputStream for whole file. No seekg. = WORKS
2) One CodedInputStream for whole file. With seekg (which should seekg to same byte) = FAILS (message parsing fails for some, parsing ok but invalid data for others)
3) Multiple CodedInputStream for each MyMessage. No seekg = FAILS (messages parse properly, invalid data)
4) Multiple CodedInputStream for each MyMessage. With seekg = FAILS (messages parse properly, invalid data)

So the fact that #2 fails above when #1 works properly is interesting to me. I'm not sure why that would be. The results from #3 might be expected depending on implementation of construction of the CodedInputStream. #4 is basically a combination of the 2 things we know fail, so failing is expected there.

Regardless of #1 working above, I would like to be able to not have to read the complete contents of the file if I know the byte offset and byte size of the message I want to parse. There must be a way.

jonath...@gree.co.jp

unread,
Jun 27, 2013, 1:17:18 PM6/27/13
to prot...@googlegroups.com, imi...@alum.mit.edu
A little bit more info:

So, if I loop over the entries in parsing the file and create a new CodedInputStream per entry (so that I don't hit any byte limit warnings) AND don't use seekg, everything is ok. Strangely, if I also put in the construction of a new IFStreamReader (which is an exact copy except for the name from IStreamInputStream) into the same loop, thus making a new ZeroCopyingInputStream and a new CodedInputStream per message, the parsing fails.

So, in this case, constructing the new IStreamInputStream resets some state of some sort and causes problems.

Another problem is when I use seekg to change the underlying ifstream. Even when only using a single IFStreamReader (a solution that worked previously) and using seekg to skip to the same byte location where the stream "should be at", causes problems.

jonath...@gree.co.jp

unread,
Jun 28, 2013, 8:34:33 PM6/28/13
to prot...@googlegroups.com, jonath...@gree.co.jp, imi...@alum.mit.edu
So I eventually figured out how to do what I want to do. This is just an FYI for anyone looking to do the same.

To get the functionality I wanted (able to read from arbitrary offset), I did the following.

1) Reader in the header info.

2) For every message in the file you want to parse, sort them byte increasing byte offset (we use Skip method in a ZeroCopyInputStream later, which doesn't take negative numbers)

3) Use only a single ZeroCopyInputStream (in my case, something similar to IStreamInputStream), but, to avoid memory warnings (if reading in a large file), construct a new CodedInputStream for every message you read in.

4) When "seeking" to the byte offset, use the ZeropCopyInputStream method "Skip" to do as follows ZeroCopyStream->Skip(MessageOffset - ZeroCopyStream->ByteCount()); Just moving the underlying ifstream via seekg ignores the fact that the ZeroCopyingInputStream has knowledge of how many bytes are in the last read block that were not used in the last message read (the block read was larger than the remaining message), its "back up bytes." Just calling seekg does not clear the back up bytes and doesn't produce the results that are expected. Somehow, however, just making a new ZeroCopyInputStream per message (right before making CodedInputStream) ended up hitting the end of the file too quickly for some reason I didn't dive in to too deeply.

Took me a while to figure this out. I hope this helps someone out there :)
Reply all
Reply to author
Forward
0 new messages