Flatbuffers as a storage format for large arrays

J R Sandesh Bhat

unread,

Jan 2, 2022, 4:50:43 AM1/2/22

to FlatBuffers

Hello,

I need to store large arrays where each element has a fixed schema on an object storage like GCS or S3. I am interested in a sequential read of a random interval of the data. For example, for an array of size 100000, I might want to read only elements from 53 to 100, followed by a read of elements from 2300 to 3000. In this case, I would like to only read those subsets of the file from storage without having load the whole file.

The type of schema of each element of the array is also rather constrained. First of all, each element of the stream is fixed, but decided dynamically. For example, if I had two video streams and one sensor stream, the video stream schema would be a function of the resolution: (w, h, color_channels) -> [float: w*h*color_channels] (where I would manually handle the conversion of the single dimensional array into a multi dimensional one). The IMU stream would be a simple float. In this case, combining them would yield:

struct frame struct {

img1: float [w1*h1*color_channels];

img2: float [w2*h2*color_channels];

sensor_val: float;

}

I was thinking of using the dynamic feature of flatbuffers: flexbuffer to achieve this. Is this a reasonable usage of the library?

mikkelfj

unread,

Jan 2, 2022, 3:58:49 PM1/2/22

to FlatBuffers

I'm not sure what exactly you mean by dynamic in this context, maybe just that array sizes are not known in advance?

But if you have a write once, read many scenario, you can use FlatBuffers to read select parts of the array, but you might have to go slightly off the official approach which assumes that you have the entire buffer available.

In your example, all the array elements (or vector elements in FlatBuffer speak) are fixed size. Therefore the location of any array element can be computed as an offset to the address of the first element. Reading the container struct essentially amounts to reading the first part of the buffer to find the offset to the root table presumably holding the struct, and from that gain access to the array start offset. You typically won't get an offset from the FlatBuffer API, but you can convert a pointer into an offset using a bit of math.

You would also potentially have to convert endinanness if you access the array directly, but in all likelyhood you will need it unconverted since little endian is the norm these days.

However, if I were you, I would break down the information in multiple smaller buffers and then load the entire buffer or buffers as needed, or I would store the metadata in FlatBuffers and the raw arrays as simple flat files.

As to FlexBuffers, I don't know them very well, but they are generally best suited when you don't know the schema in advance more than they about changing content after the fact. They are also somewhat slower if that matters.

Mikkel

Message has been deleted

Dimitris Mantas

unread,

May 13, 2022, 9:31:07 PM5/13/22

to FlatBuffers

Hi. Total noob here with the same use case as the OP.

Could you please explain how to select part of a serialized file a bit more? Your post has got me very confused!

I have serialized a list of objects (each one has a different size) and ideally want to deserialize only the i-th object (in the original list order). How excalty would I go about doing that?

Thanks.

mikkelfj

unread,

May 14, 2022, 4:19:46 AM5/14/22

to FlatBuffers

Hi Dimitri,

The simplest approach is to use memory mapped files and just read the file as you normally would. Since I/O cannot happen efficiently in very small units, memory mapping should be just as efficient as reading a specific location of the file.

But it will not be efficient if your data is not localized and therefore you really need to understand flatbuffers data some deeper level to be successful.

You can also navigate data directly as I hinted at, but it is not for beginners, and it works best if data all have the same size so you don't have to chase offsets (pointers in the file).

The file format is presented here, but it isn't really a beginner topic:

https://github.com/dvidelabs/flatcc/blob/master/doc/binary-format.md

I have worked with large data sets using multiple buffers where one buffer contains and index into an array in other larger buffers. These were all memory mapped.

It does work, but if you data is fragmented, it can take hours for the filesystem and buffer cache to stabilize on the dataset that you hit frequently if you have a lot of traffic.

Mikkel

Dimitris Mantas

unread,

May 14, 2022, 12:38:42 PM5/14/22

to FlatBuffers

Hi Mikkel,

Thanks for replying!

Would it be possible to save the offset of each object as I serialize it (i.e., the integer value returned by the corresponding create() method), and then use that to deserialize the object in the resulting byte array? For instance, if the object at index 0 has an offset of 68 bytes, and the the object at index 1 144, would it be a valid assumption that the zero-th object is located 68 bytes from the end of the serialized data and has a length of 76?

Thanks.

Dimitris

mikkelfj

unread,

May 14, 2022, 1:37:58 PM5/14/22

to FlatBuffers

I can only speak for the C implemention in FlatCC.

Here the buffer is constructed via two modules: the builder and the emitter. The builder handles the interface to the user and any stacks necessary to track generating the correct data. The emitter act as a kind of mini-filesystem where data by default is added to a ring buffer. Once the buffer is completed, the data is copied from one or more pages in the buffer into the final contiguous memory buffer allocated or provided by the user.

It is relatively easy to replace the emitter with your own implementation and there is an example in the test directory where is not stored, but debug data printed instead.
You can hook into this system to know where data is placed and simple forward calls to an instance of the default emitter.
This requires some coordination: you need to know when data is emitted so you can correlate data with what the emitter is seeing. For example, adding an integer field will not emit data, but ending a table will. You can also create an emitter that stores data differently so it is no longer really a flatbuffer, but more suitable to your purposes. The idea with emitter was for example to enable compression or data transmission without materializing the buffer until a later stage.

A simpler approach may be to just look at the 'ref' variables returned by the builder. For example, with an end table call, you get a ref to that table. This ref is a negative virtual address relative to the end of the table. (FlatBuffers are constructed back to front for historical reasons). The 'ref' is never actually stored anywhere, but if you need to add a sub table to a table, or provide the root table to the end buffer call, you use the 'ref' as a handle. Now, you can translate this 'ref' into an absolute offset once you know the able size, or you can just keep it as is and compute the location relative to the end.

There is one caveat: FlatCC stores vtables at the very end of the buffer, at a positive virtual address range. This means the 'ref' is in fact relative to somewhere close to the end, but not really the end. The reason is the clustered vtables at the end are more cache efficient. It can be disabled by calling a method on the builder so you don't get this effect. Or, you can get the ref of the root table, then read the offset to the root table as the first 4 bytes in the final FlatBuffer (AFAIR). Then you can take the difference between this offset and the 'ref' for the root. This delta can be applied to all other 'refs' to get the absolute file offset.

FlatCC also generates a common reader header which includes logic for translating and offset to a pointer while traversing the buffer as it is being read. You can intercept this logic by replacing the common header with your own and thereby read on demand, for example to load a table from a database and copy it to a buffer, etc..

It is vastly simpler to use memory mapped files to achieve something similar, but it depends on the use case.

Mikkel

mikkelfj

unread,

May 14, 2022, 1:50:20 PM5/14/22

to FlatBuffers

BTW:

Googles flatc tool recently added support for a hex dump of a binary buffer, similar to the example in the binary format document I linked.
You could parse this hex dump to get locations.

You could also index the buffer by doing something similar yourself: simply write some code to traverse all the data in the buffer and convert the pointers to offsets relative to the buffer start.

Reply all

Reply to author

Forward