I try to read chuncks of a file that contains sequence of PB blocks. Is there a way to detect where a block starts?
A little bit of context:
It is a huge file (around 60GB).
The file format is a sequences of [[Block header][Block content]]. In reallity, It is a little bit more complex, but as sample is enough.
The [Block header] contains the lenght of the next [block content].
So the way to read it is sequencially.
I wrote a Spark Connector. The first version is reading the file sequencially as well.
In the next version, I want to proccess the file splitted, as Spark provides it. So I will get chuncks of the file.
I need to search where a [block header] starts, to be able to read sequencially from that point.
So, How to find this first block? Any idea?