Hi Mehdi,I think it would be difficult to do optimizations like inlining integers because it quickly becomes a lot of special cases that become slow to handle, although I understand your thinking.
I believe experience has shown that these kinds of optimizations are better avoided in favor of a post compression algorithm. For example json + gzip is competitive with msgpack both in terms size and speed. FlatBuffers is much faster than this, and do not compress as well on large buffers. This is a tradeoff, and if size is really important beyond what zipping can do, ProtocolBuffers is probably a better choice.
That said, I have considered a number of compression schemes that could be applied to flatbuffers. And I think this is doable, but you miss the portability and direct access. In the flatcc project (for C), there is a pluggable emitter object where one motivation for this seperate step was compression. In praxis, it would be very specialized applications that would bother - such as streaming lots of data samples, for example all elemements of a 32-bit integer vector can be stored in 16 bits, well, than you can obviously do something here.As the EndTable performance. The reason this is time intensive is that vtables are being built and optinally reused at this step. If performance is really important, use structs instead. Flatbuffer takes about 10x longer to build than to read and it is probably mostly due to table building.
2016-09-26 6:00 GMT-07:00 mikkelfj <mik...@dvide.com>:Hi Mehdi,I think it would be difficult to do optimizations like inlining integers because it quickly becomes a lot of special cases that become slow to handle, although I understand your thinking.Can you clarify what you mean by "slow to handle" here?
I can understand that it may be tricky for the "inline vector" case, but what about the union case? Inlining scalar union (except long/ulong) seems like both faster to read/write and more size efficient.
I believe experience has shown that these kinds of optimizations are better avoided in favor of a post compression algorithm. For example json + gzip is competitive with msgpack both in terms size and speed. FlatBuffers is much faster than this, and do not compress as well on large buffers. This is a tradeoff, and if size is really important beyond what zipping can do, ProtocolBuffers is probably a better choice.Right, I'm comparing flatbuffer to some custom VBR encoding right now (flatbuffers are 2 times larger). I'm producing buffers that ranges from 1MB to 1GB, and both size and speed (encoding and decoding) are fairly important.
As the EndTable performance. The reason this is time intensive is that vtables are being built and optinally reused at this step. If performance is really important, use structs instead. Flatbuffer takes about 10x longer to build than to read and it is probably mostly due to table building.Right, I wonder if this code could be made more efficient. For instance PushElement does not seem to be inlined, and in a sequence of two PushElement<voffset_t> the second shouldn't have to realign the buffer.
Also I wonder if starting by the end of the vtables vector would not help a little since when writing a vector with 1M elements, the first elements would add a new vtable at the end, but all the future elements are likely to reuse these, right now they are at the end and all the previous vtables have to be scanned first. Maybe another scheme with a hash table for storing the vtables could help. Or moving the last matched vtable at the front of the vector (temporal reuse)?
While I'm at it: vtables are including the tail padding in the object size, I don't know if there is a good reason for that, but it prevents from deduplicating otherwise identical ones.
Also I wonder if starting by the end of the vtables vector would not help a little since when writing a vector with 1M elements, the first elements would add a new vtable at the end, but all the future elements are likely to reuse these, right now they are at the end and all the previous vtables have to be scanned first. Maybe another scheme with a hash table for storing the vtables could help. Or moving the last matched vtable at the front of the vector (temporal reuse)?I don't think I understand what you are saying here, but let me explain how I see vtables:
On Monday, September 26, 2016 at 6:58:25 PM UTC+2, Mehdi AMINI wrote:2016-09-26 6:00 GMT-07:00 mikkelfj <mik...@dvide.com>:Hi Mehdi,I think it would be difficult to do optimizations like inlining integers because it quickly becomes a lot of special cases that become slow to handle, although I understand your thinking.Can you clarify what you mean by "slow to handle" here?It was a general consideration, I have not given a lot of thought to specific cases. But the general idea is that the generated code knows exactly the type and exactly what to do, so it only needs to check for presence in the vtable and do its thing with no branching. As soon as an additional if condition enters the code path, performance might start to drop 10%-50% depending on how frequent the access is. This may then be offset by avoiding an indirection via a pointer reference - and I am definitely not saying the current flatbuffer design is optimal, but anytime you add complexity you add branching logic, and that is rarely a good thing. It is also worth noting that read performance of the ca. 10 field benchmark buffer is done in about 30ns. You can read a single integer in about 0.2 ns (rough estimate), and if you miss a 1 level cache, you hit 30ns no matter what you do. So for anything but the smallest buffers, you are not likely to see a lot of read performance gain no matter what you do - you will be constrained by memory bandwidth.I can understand that it may be tricky for the "inline vector" case, but what about the union case? Inlining scalar union (except long/ulong) seems like both faster to read/write and more size efficient.I was never a fan of the union design - I can see how it was done building on existing primitives. But one thing is to implement unions differently, such as allowing inline struct unions, another thing is to make special cases depending on the data stored. Another design I'd prefer would be the union type embedded with the table to avoid a secondary lookup and to allow lists of of unions. But then there is portability, and is it really an issue after all? I believe unions were designed to branch out at the top-level for things like different tables in a message envelope.Note: You can easily do you own custom inline unions with an enum and different structs at different fields where only one is present (this has been mentioned several times on the this list btw.). This will have zero impact on performance and portability.
I believe experience has shown that these kinds of optimizations are better avoided in favor of a post compression algorithm. For example json + gzip is competitive with msgpack both in terms size and speed. FlatBuffers is much faster than this, and do not compress as well on large buffers. This is a tradeoff, and if size is really important beyond what zipping can do, ProtocolBuffers is probably a better choice.Right, I'm comparing flatbuffer to some custom VBR encoding right now (flatbuffers are 2 times larger). I'm producing buffers that ranges from 1MB to 1GB, and both size and speed (encoding and decoding) are fairly important.If you are interested in working on an encoder on the flatcc backend, I am all ears! I think the LZ4 compression scheme is easily adapted to work on flatbuffer fields. Decoding would have to be segmented. Generated reader code could be modified to intercept reads outside a decoded range, but at a cost.As the EndTable performance. The reason this is time intensive is that vtables are being built and optinally reused at this step. If performance is really important, use structs instead. Flatbuffer takes about 10x longer to build than to read and it is probably mostly due to table building.Right, I wonder if this code could be made more efficient. For instance PushElement does not seem to be inlined, and in a sequence of two PushElement<voffset_t> the second shouldn't have to realign the buffer.I did the implementation for C and Wouter did the original implementation for C++. We build the tables differently - I front to back (I believe this gives more consistent padding and vtable reuse, but that is a guess), I also build and patch vtable offsets differently. The resulting performance of the two implementations both run between 600 and 700ns for building the ca. 10 field mixed type benchmark buffer. I'm sure this can be optimzed more but the fact that the two different implementations are so close suggests most is down to the vtable design (which has a lot of benefits). Sometimes it might make sense to turn off vtable reuse because it takes time to compute and vtables can be small.The flatcc api also allow you to send structs as flatbuffers - I don't think the C++ version fully supports this. But this can be used in high performance applications where you need portable data layout, but not schema evolution.Another approach is the use the low-level flatcc builder api which allow you to construct a table with an explicit vtable. I have only ever used this to implement the normal vtable reuse logic, but say you know your data intimately, then you can build your own vtable and just tell the builder to use that vtable repeatedly. No building of vtables no checks for matches. This should be a whole lot faster - but this is in the untyped api, normally called by generated code.
Also I wonder if starting by the end of the vtables vector would not help a little since when writing a vector with 1M elements, the first elements would add a new vtable at the end, but all the future elements are likely to reuse these, right now they are at the end and all the previous vtables have to be scanned first. Maybe another scheme with a hash table for storing the vtables could help. Or moving the last matched vtable at the front of the vector (temporal reuse)?I don't think I understand what you are saying here, but let me explain how I see vtables:a vtable is constructed along with a table being written. It is the checked in runtime hash table and if it misses, the vtable is written somewhere in the buffer and the table gets that location. If the hash table hits, the existing location is used instead and the recently built vtable is discarded. When you get to 1M elements, the vtable will often sit comfortable in a single Level 1 cache line and it doesn't matter where in the buffer it is stored. You might be referrering to the C++ impl. choice of storing a vtable before one table, but as soon as CPU caching gets to work, it matters little.However, flatcc differs from Googles C++ impl. because it by default stores all vtables at the end of the buffer (except for nested flatbuffers where this is not possible). This makes it very likely that 1 level, or at least 2nd level cache is hot for not only a matching vtable but also for the next vtable being lookup up. And, during network transmission, it is possible to make sure all vtables are shipped at once so any table can be accessed without the entire buffer available.
While I'm at it: vtables are including the tail padding in the object size, I don't know if there is a good reason for that, but it prevents from deduplicating otherwise identical ones.You could argue that. It has also occurred to me. I belive the idea is related to prefetching data from a stream (and this is also likely why vtables are placed before a table in C++),
but I think that in praxis this isn't really a major use case compared to other problems related to streaming flatbuffers.Regardless, I think it is good information to have, and not likely to matter much wrt. vtable reuse. The vtables are small and a 1M vector likely will have only a few different vtables. What mattes is that you don't 1M vtables.I have also had the opposite thought: if vtables were restricted to tables of same type, you could some interesting extensions such that would otherwise be difficult.Mikkel
--
You received this message because you are subscribed to the Google Groups "FlatBuffers" group.
To unsubscribe from this group and stop receiving emails from it, send an email to flatbuffers+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
I would rather than have 4byte enum, and have an 8byte for every union. I used unions as a way to encode different types of commands for backend frontend communication. To be honest, 128 (signed byte) can become not sufficient in very big cases.
I believe the main problem with your proposal is that you lose the backward and forward compatibility.
What you suggest is basically a union of structs. In this case the type/enum in front can tell how long the actual type is, but you can't do schema evolutions given such layout.
The main problem I have with union is that the type/enum is a 1byte and the offset is 4byte = 5byte will in most cases get 3bytes padding.I would rather than have 4byte enum, and have an 8byte for every union. I used unions as a way to encode different types of commands for backend frontend communication. To be honest, 128 (signed byte) can become not sufficient in very big cases.
Am Montag, 26. September 2016 07:14:58 UTC+2 schrieb Mehdi AMINI:Hi,I've been experimenting with flatbuffers recently, and I found it pretty nice to work with, but the output buffer size is fairly large, so I'm trying to see how to reduce the size. I wonder if there are generic advices on this aspect?There are a few generic things that I noticed. For instance, sometimes I have unions like:
union AnyObject { SomeComplexType, SomeOtherComplexType, SomeSimpleType }table SomeSimpleType { value : uint }table Object {Impl : AnyObject;}Now if "SomeSimpleType" contains only a single integer, I'd like to be able to "inline" it in the Object table instead of paying an indirection. This is already doable in practice to stick the integer instead of the offset, but it will fail the verifier. I could imagine a syntax like: union AnyObject { SomeComplexType, SomeOtherComplexType, SomeSimpleType : uint, ... }where SomeSimpleType would always be "inlined".Another thing I'd like is some "small vector" optimization: table Object { elts : [uint] }Now if I have a lot of Object with 1 or 2 elements in the vector, I'd be better of with: table Object { elt0 : uint, elt1 : uint, elts : [uint] } ; and use elt0 and elt1 unless I have more, at which point I leave these to 0 and use the vector instead.Of course there need some boilerplate to handle this in the writer/reader, and I wonder if such an optimization could be handled (with an opt-in in the schema somehow) with a more extended vtable.Also, on the performance side, ~40% of the serialization time is spent in EndTable() right now, I suspect something could be done on this side as well but I have not look too closely yet.Thanks,--Mehdi
table T1{
b1 : bool;
b2 : bool;
}
table T2{
t1 : T1;
}
table T1{
b1 : bool;
b2 : bool;
i1 : int;
}
To unsubscribe from this group and stop receiving emails from it, send an email to flatbuffers...@googlegroups.com.
To unsubscribe from this group and stop receiving emails from it, send an email to flatbuffers+unsubscribe@googlegroups.com.
By the way, for more information on my use case, I reported my experiment with flatbuffer compared to our custom variable-sized-integer based format on our project mailing-list here: http://lists.llvm.org/pipermail/llvm-dev/2016-September/105273.html