Finding the starting point of a PB block in a file that contains noise at the beginning.

21 views
Skip to first unread message

Angel Cervera Claudio

unread,
Oct 4, 2020, 1:16:10 PM10/4/20
to Protocol Buffers

I try to read chuncks of a file that contains sequence of PB blocks. Is there a way to detect where a block starts?

A little bit of context:
It is a huge file (around 60GB).
The file format is a sequences of [[Block header][Block content]]. In reallity, It is a little bit more complex, but as sample is enough.
The [Block header] contains the lenght of the next [block content].
So the way to read it is sequencially.

I wrote a Spark Connector. The first version is reading the file sequencially as well.

In the next version, I want to proccess the file splitted, as Spark provides it. So I will get chuncks of the file.
I need to search where a [block header] starts, to be able to read sequencially from that point.
So, How to find this first block? Any idea?

Adam Cozzette

unread,
Oct 5, 2020, 3:53:03 PM10/5/20
to Angel Cervera Claudio, Protocol Buffers
The protobuf binary format doesn't provide any mechanism for determining where a message begins or ends, so I don't think this is possible. Maybe the only way to do it would be to introduce your own metadata header spaced out at regular intervals (e.g. every 1 GiB), and have this special header indicate where the next block begins.

--
You received this message because you are subscribed to the Google Groups "Protocol Buffers" group.
To unsubscribe from this group and stop receiving emails from it, send an email to protobuf+u...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/protobuf/01bd0fbf-cc13-476d-ab3a-c50a278f81aen%40googlegroups.com.
Reply all
Reply to author
Forward
0 new messages