`Size Efficiency` Good Practices

147 views
Skip to first unread message

Mehdi AMINI

unread,
Sep 26, 2016, 1:14:58 AM9/26/16
to FlatBuffers
Hi, 

I've been experimenting with flatbuffers recently, and I found it pretty nice to work with, but the output buffer size is fairly large, so I'm trying to see how to reduce the size. I wonder if there are generic advices on this aspect?

There are a few generic things that I noticed. For instance, sometimes I have unions like:

union AnyObject { SomeComplexType, SomeOtherComplexType, SomeSimpleType }
table SomeSimpleType { value : uint }
table Object {
  Impl : AnyObject;
}

Now if "SomeSimpleType" contains only a single integer, I'd like to be able to "inline" it in the Object table instead of paying an indirection. This is already doable in practice to stick the integer instead of the offset, but it will fail the verifier. I could imagine a syntax like: union AnyObject { SomeComplexType, SomeOtherComplexType, SomeSimpleType : uint, ... }
where SomeSimpleType would always be "inlined".

Another thing I'd like is some "small vector" optimization: table Object { elts : [uint] }
Now if I have a lot of Object with 1 or 2 elements in the vector, I'd be better of with:  table Object { elt0 : uint, elt1 : uint, elts : [uint] }  ; and use elt0 and elt1 unless I have more, at which point I leave these to 0 and use the vector instead.
Of course there need some boilerplate to handle this in the writer/reader, and I wonder if such an optimization could be handled (with an opt-in in the schema somehow) with a more extended vtable.

Also, on the performance side, ~40% of the serialization time is spent in EndTable() right now, I suspect something could be done on this side as well but I have not look too closely yet.

Thanks,

-- 
Mehdi

mikkelfj

unread,
Sep 26, 2016, 9:00:40 AM9/26/16
to FlatBuffers
Hi Mehdi,

I think it would be difficult to do optimizations like inlining integers because it quickly becomes a lot of special cases that become slow to handle, although I understand your thinking.

I believe experience has shown that these kinds of optimizations are better avoided in favor of a post compression algorithm. For example json + gzip is competitive with msgpack both in terms size and speed. FlatBuffers is much faster than this, and do not compress as well on large buffers. This is a tradeoff, and if size is really important beyond what zipping can do, ProtocolBuffers is probably a better choice.

That said, I have considered a number of compression schemes that could be applied to flatbuffers. And I think this is doable, but you miss the portability and direct access. In the flatcc project (for C), there is a pluggable emitter object where one motivation for this seperate step was compression. In praxis, it would be very specialized applications that would bother - such as streaming lots of data samples, for example all elemements of a 32-bit integer vector can be stored in 16 bits, well, than you can obviously do something here.

As the EndTable performance. The reason this is time intensive is that vtables are being built and optinally reused at this step. If performance is really important, use structs instead. Flatbuffer takes about 10x longer to build than to read and it is probably mostly due to table building.

Mehdi AMINI

unread,
Sep 26, 2016, 12:58:25 PM9/26/16
to mikkelfj, FlatBuffers
2016-09-26 6:00 GMT-07:00 mikkelfj <mik...@dvide.com>:
Hi Mehdi,

I think it would be difficult to do optimizations like inlining integers because it quickly becomes a lot of special cases that become slow to handle, although I understand your thinking.

Can you clarify what you mean by "slow to handle" here?
I can understand that it may be tricky for the "inline vector" case, but what about the union case? Inlining scalar union  (except long/ulong)  seems like both faster to read/write and more size efficient. 


I believe experience has shown that these kinds of optimizations are better avoided in favor of a post compression algorithm. For example json + gzip is competitive with msgpack both in terms size and speed. FlatBuffers is much faster than this, and do not compress as well on large buffers. This is a tradeoff, and if size is really important beyond what zipping can do, ProtocolBuffers is probably a better choice.

Right, I'm comparing flatbuffer to some custom VBR encoding right now (flatbuffers are 2 times larger). I'm producing buffers that ranges from 1MB to 1GB, and both size and speed (encoding and decoding) are fairly important.
 

That said, I have considered a number of compression schemes that could be applied to flatbuffers. And I think this is doable, but you miss the portability and direct access. In the flatcc project (for C), there is a pluggable emitter object where one motivation for this seperate step was compression. In praxis, it would be very specialized applications that would bother - such as streaming lots of data samples, for example all elemements of a 32-bit integer vector can be stored in 16 bits, well, than you can obviously do something here.

As the EndTable performance. The reason this is time intensive is that vtables are being built and optinally reused at this step. If performance is really important, use structs instead. Flatbuffer takes about 10x longer to build than to read and it is probably mostly due to table building.

Right, I wonder if this code could be made more efficient. For instance PushElement does not seem to be inlined, and in a sequence of two PushElement<voffset_t> the second shouldn't have to realign the buffer.
Also I wonder if starting by the end of the vtables vector would not help a little since when writing a vector with 1M elements, the first elements would add a new vtable at the end, but all the future elements are likely to reuse these, right now they are at the end and all the previous vtables have to be scanned first. Maybe another scheme with a hash table for storing the vtables could help. Or moving the last matched vtable at the front of the vector (temporal reuse)?

While I'm at it: vtables are including the tail padding in the object size, I don't know if there is a good reason for that, but it prevents from deduplicating otherwise identical ones.

Thanks,

-- 
Mehdi

mikkelfj

unread,
Sep 26, 2016, 3:31:32 PM9/26/16
to FlatBuffers, mik...@dvide.com


On Monday, September 26, 2016 at 6:58:25 PM UTC+2, Mehdi AMINI wrote:


2016-09-26 6:00 GMT-07:00 mikkelfj <mik...@dvide.com>:
Hi Mehdi,

I think it would be difficult to do optimizations like inlining integers because it quickly becomes a lot of special cases that become slow to handle, although I understand your thinking.

Can you clarify what you mean by "slow to handle" here?
 
It was a general consideration, I have not given a lot of thought to specific cases. But the general idea is that the generated code knows exactly the type and exactly what to do, so it only needs to check for presence in the vtable and do its thing with no branching. As soon as an additional if condition enters the code path, performance might start to drop 10%-50% depending on how frequent the access is. This may then be offset by avoiding an indirection via a pointer reference - and I am definitely not saying the current flatbuffer design is optimal, but anytime you add complexity you add branching logic, and that is rarely a good thing. It is also worth noting that read performance of the ca. 10 field benchmark buffer is done in about 30ns. You can read a single integer in about 0.2 ns (rough estimate), and if you miss a 1 level cache, you hit 30ns no matter what you do. So for anything but the smallest buffers, you are not likely to see a lot of read performance gain no matter what you do - you will be constrained by memory bandwidth.

I can understand that it may be tricky for the "inline vector" case, but what about the union case? Inlining scalar union  (except long/ulong)  seems like both faster to read/write and more size efficient. 
 
I was never a fan of the union design - I can see how it was done building on existing primitives. But one thing is to implement unions differently, such as allowing inline struct unions, another thing is to make special cases depending on the data stored. Another design I'd prefer would be the union type embedded with the table to avoid a secondary lookup and to allow lists of of unions. But then there is portability, and is it really an issue after all? I believe unions were designed to branch out at the top-level for things like different tables in a message envelope.

Note: You can easily do you own custom inline unions with an enum and different structs at different fields where only one is present (this has been mentioned several times on the this list btw.). This will have zero impact on performance and portability.
 


I believe experience has shown that these kinds of optimizations are better avoided in favor of a post compression algorithm. For example json + gzip is competitive with msgpack both in terms size and speed. FlatBuffers is much faster than this, and do not compress as well on large buffers. This is a tradeoff, and if size is really important beyond what zipping can do, ProtocolBuffers is probably a better choice.

Right, I'm comparing flatbuffer to some custom VBR encoding right now (flatbuffers are 2 times larger). I'm producing buffers that ranges from 1MB to 1GB, and both size and speed (encoding and decoding) are fairly important.
 
If you are interested in working on an encoder on the flatcc backend, I am all ears! I think the LZ4 compression scheme is easily adapted to work on flatbuffer fields. Decoding would have to be segmented. Generated reader code could be modified to intercept reads outside a decoded range, but at a cost.


As the EndTable performance. The reason this is time intensive is that vtables are being built and optinally reused at this step. If performance is really important, use structs instead. Flatbuffer takes about 10x longer to build than to read and it is probably mostly due to table building.

Right, I wonder if this code could be made more efficient. For instance PushElement does not seem to be inlined, and in a sequence of two PushElement<voffset_t> the second shouldn't have to realign the buffer.

I did the implementation for C and Wouter did the original implementation for C++. We build the tables differently - I front to back (I believe this gives more consistent padding and vtable reuse, but that is a guess), I also build and patch vtable offsets differently. The resulting performance of the two implementations both run between 600 and 700ns for building the ca. 10 field mixed type benchmark buffer. I'm sure this can be optimzed more but the fact that the two different implementations are so close suggests most is down to the vtable design (which has a lot of benefits). Sometimes it might make sense to turn off vtable reuse because it takes time to compute and vtables can be small.

The flatcc api also allow you to send structs as flatbuffers - I don't think the C++ version fully supports this. But this can be used in high performance applications where you need portable data layout, but not schema evolution.

Another approach is the use the low-level flatcc builder api which allow you to construct a table with an explicit vtable. I have only ever used this to implement the normal vtable reuse logic, but say you know your data intimately, then you can build your own vtable and just tell the builder to use that vtable repeatedly. No building of vtables no checks for matches. This should be a whole lot faster - but this is in the untyped api, normally called by generated code.


 
Also I wonder if starting by the end of the vtables vector would not help a little since when writing a vector with 1M elements, the first elements would add a new vtable at the end, but all the future elements are likely to reuse these, right now they are at the end and all the previous vtables have to be scanned first. Maybe another scheme with a hash table for storing the vtables could help. Or moving the last matched vtable at the front of the vector (temporal reuse)?

I don't think I understand what you are saying here, but let me explain how I see vtables:
a vtable is constructed along with a table being written. It is the checked in runtime hash table and if it misses, the vtable is written somewhere in the buffer and the table gets that location. If the hash table hits, the existing location is used instead and the recently built vtable is discarded. When you get to 1M elements, the vtable will often sit comfortable in a single Level 1 cache line and it doesn't matter where in the buffer it is stored. You might be referrering to the C++ impl. choice of storing a vtable before one table, but as soon as CPU caching gets to work, it matters little.

However, flatcc differs from Googles C++ impl. because it by default stores all vtables at the end of the buffer (except for nested flatbuffers where this is not possible). This makes it very likely that 1 level, or at least 2nd level cache is hot for not only a matching vtable but also for the next vtable being lookup up. And, during network transmission, it is possible to make sure all vtables are shipped at once so any table can be accessed without the entire buffer available.


While I'm at it: vtables are including the tail padding in the object size, I don't know if there is a good reason for that, but it prevents from deduplicating otherwise identical ones.
 
You could argue that. It has also occurred to me. I belive the idea is related to prefetching data from a stream (and this is also likely why vtables are placed before a table in C++), but I think that in praxis this isn't really a major use case compared to other problems related to streaming flatbuffers.
Regardless, I think it is good information to have, and not likely to matter much wrt. vtable reuse. The vtables are small and a 1M vector likely will have only a few different vtables. What mattes is that you don't 1M vtables.
 
I have also had the opposite thought: if vtables were restricted to tables of same type, you could some interesting extensions such that would otherwise be difficult.

Mikkel

mikkelfj

unread,
Sep 26, 2016, 4:13:48 PM9/26/16
to FlatBuffers, mik...@dvide.com


On Monday, September 26, 2016 at 9:31:32 PM UTC+2, mikkelfj wrote:
 
Also I wonder if starting by the end of the vtables vector would not help a little since when writing a vector with 1M elements, the first elements would add a new vtable at the end, but all the future elements are likely to reuse these, right now they are at the end and all the previous vtables have to be scanned first. Maybe another scheme with a hash table for storing the vtables could help. Or moving the last matched vtable at the front of the vector (temporal reuse)?

I don't think I understand what you are saying here, but let me explain how I see vtables:

Ahh, sorry, now I think I understand - I forget that the C++ implementation actually scans the list of vtables instead of using a hash table. So yes, you can use a hash table instead as flatcc does, and as I described before. No-one has really compared C and C++ performance on really large buffers. As long as the vtable count is reasonably low, it is not necessarily worse to scan the vtables because it also costs to compute a hash function, but ultimately a hash table is more scaleable. flatcc also has a concept of maximum size of vtable cache after which it starts to remit vtables - this can be significant in embedded systems with memory constraints.

 

Mehdi AMINI

unread,
Sep 26, 2016, 4:17:41 PM9/26/16
to mikkelfj, FlatBuffers
2016-09-26 12:31 GMT-07:00 mikkelfj <mik...@dvide.com>:


On Monday, September 26, 2016 at 6:58:25 PM UTC+2, Mehdi AMINI wrote:


2016-09-26 6:00 GMT-07:00 mikkelfj <mik...@dvide.com>:
Hi Mehdi,

I think it would be difficult to do optimizations like inlining integers because it quickly becomes a lot of special cases that become slow to handle, although I understand your thinking.

Can you clarify what you mean by "slow to handle" here?
 
It was a general consideration, I have not given a lot of thought to specific cases. But the general idea is that the generated code knows exactly the type and exactly what to do, so it only needs to check for presence in the vtable and do its thing with no branching. As soon as an additional if condition enters the code path, performance might start to drop 10%-50% depending on how frequent the access is. This may then be offset by avoiding an indirection via a pointer reference - and I am definitely not saying the current flatbuffer design is optimal, but anytime you add complexity you add branching logic, and that is rarely a good thing. It is also worth noting that read performance of the ca. 10 field benchmark buffer is done in about 30ns. You can read a single integer in about 0.2 ns (rough estimate), and if you miss a 1 level cache, you hit 30ns no matter what you do. So for anything but the smallest buffers, you are not likely to see a lot of read performance gain no matter what you do - you will be constrained by memory bandwidth.

I can understand that it may be tricky for the "inline vector" case, but what about the union case? Inlining scalar union  (except long/ulong)  seems like both faster to read/write and more size efficient. 
 
I was never a fan of the union design - I can see how it was done building on existing primitives. But one thing is to implement unions differently, such as allowing inline struct unions, another thing is to make special cases depending on the data stored. Another design I'd prefer would be the union type embedded with the table to avoid a secondary lookup and to allow lists of of unions. But then there is portability, and is it really an issue after all? I believe unions were designed to branch out at the top-level for things like different tables in a message envelope.

Note: You can easily do you own custom inline unions with an enum and different structs at different fields where only one is present (this has been mentioned several times on the this list btw.). This will have zero impact on performance and portability.

Right, I know I can do a bunch of custom things like the enum + struct fields. But I like having a DSL to describe my data structure and this seems to go in the opposite direction. There is no real reason that this structure (enum + fields) can't be generated by flatc for me.

 
 


I believe experience has shown that these kinds of optimizations are better avoided in favor of a post compression algorithm. For example json + gzip is competitive with msgpack both in terms size and speed. FlatBuffers is much faster than this, and do not compress as well on large buffers. This is a tradeoff, and if size is really important beyond what zipping can do, ProtocolBuffers is probably a better choice.

Right, I'm comparing flatbuffer to some custom VBR encoding right now (flatbuffers are 2 times larger). I'm producing buffers that ranges from 1MB to 1GB, and both size and speed (encoding and decoding) are fairly important.
 
If you are interested in working on an encoder on the flatcc backend, I am all ears! I think the LZ4 compression scheme is easily adapted to work on flatbuffer fields. Decoding would have to be segmented. Generated reader code could be modified to intercept reads outside a decoded range, but at a cost.


As the EndTable performance. The reason this is time intensive is that vtables are being built and optinally reused at this step. If performance is really important, use structs instead. Flatbuffer takes about 10x longer to build than to read and it is probably mostly due to table building.

Right, I wonder if this code could be made more efficient. For instance PushElement does not seem to be inlined, and in a sequence of two PushElement<voffset_t> the second shouldn't have to realign the buffer.

I did the implementation for C and Wouter did the original implementation for C++. We build the tables differently - I front to back (I believe this gives more consistent padding and vtable reuse, but that is a guess), I also build and patch vtable offsets differently. The resulting performance of the two implementations both run between 600 and 700ns for building the ca. 10 field mixed type benchmark buffer. I'm sure this can be optimzed more but the fact that the two different implementations are so close suggests most is down to the vtable design (which has a lot of benefits). Sometimes it might make sense to turn off vtable reuse because it takes time to compute and vtables can be small.

The flatcc api also allow you to send structs as flatbuffers - I don't think the C++ version fully supports this. But this can be used in high performance applications where you need portable data layout, but not schema evolution.

Another approach is the use the low-level flatcc builder api which allow you to construct a table with an explicit vtable. I have only ever used this to implement the normal vtable reuse logic, but say you know your data intimately, then you can build your own vtable and just tell the builder to use that vtable repeatedly. No building of vtables no checks for matches. This should be a whole lot faster - but this is in the untyped api, normally called by generated code.



Oh sweet: custom handling of Vtables is definitely something that could help a lot here!
 
 
Also I wonder if starting by the end of the vtables vector would not help a little since when writing a vector with 1M elements, the first elements would add a new vtable at the end, but all the future elements are likely to reuse these, right now they are at the end and all the previous vtables have to be scanned first. Maybe another scheme with a hash table for storing the vtables could help. Or moving the last matched vtable at the front of the vector (temporal reuse)?

I don't think I understand what you are saying here, but let me explain how I see vtables:
a vtable is constructed along with a table being written. It is the checked in runtime hash table and if it misses, the vtable is written somewhere in the buffer and the table gets that location. If the hash table hits, the existing location is used instead and the recently built vtable is discarded. When you get to 1M elements, the vtable will often sit comfortable in a single Level 1 cache line and it doesn't matter where in the buffer it is stored. You might be referrering to the C++ impl. choice of storing a vtable before one table, but as soon as CPU caching gets to work, it matters little.

However, flatcc differs from Googles C++ impl. because it by default stores all vtables at the end of the buffer (except for nested flatbuffers where this is not possible). This makes it very likely that 1 level, or at least 2nd level cache is hot for not only a matching vtable but also for the next vtable being lookup up. And, during network transmission, it is possible to make sure all vtables are shipped at once so any table can be accessed without the entire buffer available.


Let me explain a bit more. Imagine that I have written 200 vtables already. There are scattered "somewhere" in the buffer (with the C++ implementation) and the runtime keeps a vector of offset to these already emitted vtables. 
Now I'm about to emit a vector of 1M elements. It means that the runtime will repeat 1M time this sequence of operations:
1) building a vtable for the element I just emitted
2) going linearly through the 200 tables, and for the one that have the same size do a memcmp.
3) If a vtable is found it'll pop the one just created out of the vector and write the offset to the found one.

Let's imagine that my 1M elements in the array will generate ~5 variants of the vtable, but none of these variant matches one of the 200 already emitted and that the runtime keeps track.
The first time we emit these 5 tables, they will stay in the buffer and re-used for the future elements. For this purpose they are added *at the end* of the vector containing the offset to all the vtables already emitted.

If we agree with this description, I come to the issue I mentioned: for every new element I push to this array, a new vtable is created and it will be compared to the 200 first ones in the vector before hitting the right one towards the end of this vector.

This is why I mentioned "temporal reuse" to favor first looking at the most recently emitted (starting by the end of the vector), or reorganizing the vector to look at the most recently matched.
Make sense?

 


While I'm at it: vtables are including the tail padding in the object size, I don't know if there is a good reason for that, but it prevents from deduplicating otherwise identical ones.
 
You could argue that. It has also occurred to me. I belive the idea is related to prefetching data from a stream (and this is also likely why vtables are placed before a table in C++),

Except that:

1) you don't need to prefetch the tail padding to read an object, so why include it?
2) vtables are placed before the first table that uses it in C++, but for all the reuse they are far far further away (indeed I'd design this, I'd probably try to keep the vtables "on the side" till the buffer is done and emit them first. The vtable offset would be from the start of the buffer, and 16bits could probably be enough).

-- 
Mehdi


 
but I think that in praxis this isn't really a major use case compared to other problems related to streaming flatbuffers.
Regardless, I think it is good information to have, and not likely to matter much wrt. vtable reuse. The vtables are small and a 1M vector likely will have only a few different vtables. What mattes is that you don't 1M vtables.
 
I have also had the opposite thought: if vtables were restricted to tables of same type, you could some interesting extensions such that would otherwise be difficult.

Mikkel

--
You received this message because you are subscribed to the Google Groups "FlatBuffers" group.
To unsubscribe from this group and stop receiving emails from it, send an email to flatbuffers+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

mikkelfj

unread,
Sep 26, 2016, 4:19:57 PM9/26/16
to FlatBuffers, mik...@dvide.com
for storing the vtables could help. Or moving the last matched vtable at the front of the vector (temporal reuse)?

The flatcc hash table actually uses a move to front strategy: if there is a hash collision, the most recently found match is moved to the front of the searched list. In the C++ implementation, it would probably be very simple to cache the most recently matched vtable and compare against this one first for significant gains. Perhaps even fast that using a hash table in most cases.

Maxim Zaks

unread,
Oct 3, 2016, 5:55:22 AM10/3/16
to FlatBuffers
I believe the main problem with your proposal is that you lose the backward and forward compatibility.
What you suggest is basically a union of structs. In this case the type/enum in front can tell how long the actual type is, but you can't do schema evolutions given such layout.

The main problem I have with union is that the type/enum is a 1byte and the offset is 4byte = 5byte will in most cases get 3bytes padding. 
I would rather than have 4byte enum, and have an 8byte for every union. I used unions as a way to encode different types of commands for backend frontend communication. To be honest, 128 (signed byte) can become not sufficient in very big cases.

mikkelfj

unread,
Oct 3, 2016, 7:01:19 AM10/3/16
to FlatBuffers

The main problem I have with union is that the type/enum is a 1byte and the offset is 4byte = 5byte will in most cases get 3bytes padding. 
I would rather than have 4byte enum, and have an 8byte for every union. I used unions as a way to encode different types of commands for backend frontend communication. To be honest, 128 (signed byte) can become not sufficient in very big cases.

There is some effort put into placing the type byte at the end of the table when possible, so it is not all wasted, but I agree it is not ideal.

I developed the type hash identifiers, and have half an implementation for C++, but it's only fully implemented in flatcc for C.
These take the namespace and name of a table and produces a hashed 32 bit value that can used as a flatbuffer identifer. Or, it can actually be used anywhere an unsigned 32 bit value is used (I wish integers were named uint32, can't rember the other stuff).

In this way you can avoid the union altogether when the purpose is send different top-level table types across the wire. There is a risk of conflict, but it coudl be checked, or, normally it should generate a compile error in generated code - at least for C/C++.

Mehdi AMINI

unread,
Oct 3, 2016, 11:26:20 AM10/3/16
to Maxim Zaks, FlatBuffers
2016-10-03 2:55 GMT-07:00 'Maxim Zaks' via FlatBuffers <flatb...@googlegroups.com>:
I believe the main problem with your proposal is that you lose the backward and forward compatibility.

Forward compatibility is not preserved obviously, but why not backward compatibility?
 
What you suggest is basically a union of structs. In this case the type/enum in front can tell how long the actual type is, but you can't do schema evolutions given such layout.

I only suggested to offer inlining when size(struct)  <= size(offset). The size is unchanged. 
Also, I believe we can't have variadic sized record inside a vector right?

-- 
Mehdi

 

The main problem I have with union is that the type/enum is a 1byte and the offset is 4byte = 5byte will in most cases get 3bytes padding. 
I would rather than have 4byte enum, and have an 8byte for every union. I used unions as a way to encode different types of commands for backend frontend communication. To be honest, 128 (signed byte) can become not sufficient in very big cases.


Am Montag, 26. September 2016 07:14:58 UTC+2 schrieb Mehdi AMINI:
Hi, 

I've been experimenting with flatbuffers recently, and I found it pretty nice to work with, but the output buffer size is fairly large, so I'm trying to see how to reduce the size. I wonder if there are generic advices on this aspect?

There are a few generic things that I noticed. For instance, sometimes I have unions like:

union AnyObject { SomeComplexType, SomeOtherComplexType, SomeSimpleType }
table SomeSimpleType { value : uint }
table Object {
  Impl : AnyObject;
}

Now if "SomeSimpleType" contains only a single integer, I'd like to be able to "inline" it in the Object table instead of paying an indirection. This is already doable in practice to stick the integer instead of the offset, but it will fail the verifier. I could imagine a syntax like: union AnyObject { SomeComplexType, SomeOtherComplexType, SomeSimpleType : uint, ... }
where SomeSimpleType would always be "inlined".

Another thing I'd like is some "small vector" optimization: table Object { elts : [uint] }
Now if I have a lot of Object with 1 or 2 elements in the vector, I'd be better of with:  table Object { elt0 : uint, elt1 : uint, elts : [uint] }  ; and use elt0 and elt1 unless I have more, at which point I leave these to 0 and use the vector instead.
Of course there need some boilerplate to handle this in the writer/reader, and I wonder if such an optimization could be handled (with an opt-in in the schema somehow) with a more extended vtable.

Also, on the performance side, ~40% of the serialization time is spent in EndTable() right now, I suspect something could be done on this side as well but I have not look too closely yet.

Thanks,

-- 
Mehdi

Wouter van Oortmerssen

unread,
Oct 3, 2016, 4:19:13 PM10/3/16
to Mehdi AMINI, Maxim Zaks, FlatBuffers
I agree the way unions are forced to work with tables is inconvenient, and something that could have been designed better. Making changes that affect the binary encoding at this point is problematic, and I doubt something we'll do unless the benefits are wide ranging and very compelling.

A typed system like FlatBuffers is not great at emulating "variant" style data usion unions, which is one of the reasons I am working on a schemaless variant (see ongoing work here) https://github.com/google/flatbuffers/commit/f136590c9f06ffeca87689ec920b215aed0f294b
It will encode unions of simple types much more compactly than FlatBuffers does, but will take a while to mature and become available to all languages.

Maxim Zaks

unread,
Oct 5, 2016, 6:45:58 AM10/5/16
to FlatBuffers, maxim...@googlemail.com
Imagine you have a table T1

table T1{
 b1
: bool;
 b2
: bool;
}

The data inside T1 is only 2bytes big.
So if we reference T1 in T2
table T2{
 t1
: T1;
}

It would be better to store the content of T1 inline (2bytes) instead of a reference to T1 (4bytes) And a compiler could figure out such cases and generate T2 so that it stores T1 inline.
However if we would change T1
table T1{
 b1
: bool;
 b2
: bool;
 i1
: int;
}

Inlining T1 becomes a bad idea. The compiler will generate T2 which does not inline T1 but stores a reference to it. Now Here comes the compatibility problem.
Backwards means new code can read old data. New code in T2 will expect 4 bytes reference pointer, but old data will have only 2bytes inlined T1 data. So we have a bug.
Forwards compatible means, old code can read new data. Old T2 will read 2 bytes out of the 4bytes pointer and interpret them as two boolean values. So this is also a bug.

I gave a talk on flatbuffers last month:

On Slide 23 you can see a schema and on slide 27 how a small example would be stored under the hood.
I hope it helps to gain an overview how FlatBuffers stores data internaly.

Cheers,

Maxim
To unsubscribe from this group and stop receiving emails from it, send an email to flatbuffers...@googlegroups.com.

Mehdi AMINI

unread,
Oct 5, 2016, 11:24:07 AM10/5/16
to Maxim Zaks, FlatBuffers
By the way, for more information on my use case, I reported my experiment with flatbuffer compared to our custom variable-sized-integer based format on our project mailing-list here: http://lists.llvm.org/pipermail/llvm-dev/2016-September/105273.html

-- 
Mehdi


To unsubscribe from this group and stop receiving emails from it, send an email to flatbuffers+unsubscribe@googlegroups.com.

mikkelfj

unread,
Oct 5, 2016, 12:28:12 PM10/5/16
to FlatBuffers, maxim...@googlemail.com


On Wednesday, October 5, 2016 at 5:24:07 PM UTC+2, Mehdi AMINI wrote:
By the way, for more information on my use case, I reported my experiment with flatbuffer compared to our custom variable-sized-integer based format on our project mailing-list here: http://lists.llvm.org/pipermail/llvm-dev/2016-September/105273.html

Interesting writeup.

If size is a concern, custom compression dictionaries, by the author of LZ4 might be worth investigating. I does require segmented reading which may or may not work well.

Wouter van Oortmerssen

unread,
Oct 5, 2016, 1:09:41 PM10/5/16
to Mehdi AMINI, Maxim Zaks, FlatBuffers
What you allude to at the end may be worth exploring: to use FlatBuffers for the overall structure, and use a custom encoding for the "leaf" data. That way you get more compactness presumably, and you still get the advantages of random access.

As for efficiency of de-duplication, yes, this was optimized for small buffers, as they have relatively little variety in vtables, so a linear search is fastest (and importantly, causes less allocations). For large buffers, it be better to use a map or unordered_map. These could be added as options.
Reply all
Reply to author
Forward
0 new messages