Schemaless FlatBuffers

1,270 views
Skip to first unread message

mikkelfj

unread,
Aug 15, 2016, 3:05:44 PM8/15/16
to FlatBuffers
Continuing from other thread

Hi Wouter, thanks for explaining.

> I feel that in-place reading of data with no parsing / object allocation is a significant enough improvement over msgpack.

I figured as much, but I wonder in terms of practical use - memory and speed - but you'll never know without trying - its just that there are so many formats that is becoming a major issue.

I think looking into a shadow vtable with types and names may also be an option, extending the current format and resctricting vtables to same type tables. Disregarding schemaless, the fb format would benefit from being able to decide if a voffset is a specific reference type for copying data etc. But preferably with front to back support.

You suggestion about storing the root offset last is in line with what I wrote earlier: front to back is insufficient, you also need each reference to point to the tail and have the header placed there. So yes, this may sound weird, but is necessary.

In terms of caching, I believe modern intel achitectures work equally well scanning back to front and front to back. Random access beyond one page is never efficient because prefetching is disabled to prevent unnecessary page misses - or something along those lines. For simpler devices there are no prefetching and back to front has same performance. So caching isn't a big deal wrt. front to back vs back to front. In fact, binary seach has been shown to be faster than linear scanning in nearly all array sizes provided an efficient implementation (which is rarely seen though).

As to top down, there is a need for buffering and when this is done, you still get you offset size optimization because you can adjust at the end (I presume without studying details) - this can also be done with integer arrays. In my FlatCC implementation I also buffer so voffsets can be computed correctly. Offset sizes can also be adjusted. The exception is when the input vector is given entirely in one call. This is a major weakness for very large vectors that are constructed piecemeal, but in most cases the temporary storage can happen on a reusable stack that typically consumes less space than application space equivalent temporary storage. Notably, this tempory stack will be in hot cache, which is a big deal.

The overhead of this extra buffering step is not large due to hot cache and because it is optional: if you provide the entire vector as input, it is not required, and compared to the time needed for constructing a buffer vs. memory bandwidth, it also suggests low overhead (fb write is roughly 10x slower than fb read).

I'm sort of tempted to support a fragmented vector type where each fragment can have its own offset size or element size. This solves the issue of buffering very large vectors which is useful for sensor nodes streaming data on the wire, but it is also an additional complexity that will rarely be needed. The alternative is to do this at the application level.


On Monday, August 15, 2016 at 8:00:24 PM UTC+2, Wouter van Oortmerssen wrote:
Mikkel / Hải: the whole reason to even consider a binary schemaless format is to be able to make significant improvements over the state of the art (e.g. msgpack), not simply to extend interop options.

I feel that in-place reading of data with no parsing / object allocation is a significant enough improvement over msgpack.

You're right that a new format could totally be written front to back. The one disadvantage I can see is that you need to reserve a large (32bit?) root pointer at the start and back-patch it, whereas now it is written at the end and can always be a byte. Unlike typed FB, there's no need for this to be limited to 32bit either, 64bit buffers are possible with the current design.

One way around that would be to say that the root offset is the LAST thing in the buffer, though that may be strange as well? Certainly, I want formats to be optimized for reading at the expense of writing where needed, and reading from the front sounds generally more useful and cache friendly.

As for bottom up, I think here the reasons for doing bottom up are stronger in SFB than TFB (we need a better name! :): The code actually decides on the fly whether to encode offsets and data of its children as 8/16/32/64 quantities, and to do that it needs to have all children already serialized. Being able to do this automatically and optimally without too much overhead is a huge win (many small vectors can use 8bit offsets).

Yes, feel free to move this to a new thread :)



On Sun, Aug 14, 2016 at 6:06 PM, Hải Nguyễn Trung <hain...@gmail.com> wrote:
From the document and the source code, I learned that the benefit of schemaless FB over Msgpack is that SL FB has quick access to sub values, saying values inside a vector or nested object. But the performance gain is not so much, IMHO.
I am with you on considering top down approach rather than bottom up.
Another thing about FB in general (including Schemaless FB) is that FB designates for read heavy operation, meanwhile for dynamic storage, an equal read-write ratio should be considered.
We have some existed choices in our shopping list for schemaless design: Amazon Ion, ArangoDB Velocypack (they both claim to be more accessible than Msgpack), and of course Msgpack for this widely adopted.
By the way, we should move the topic of schemaless into another post, leaving this topic for its main purpose: What do people use FlatBuffers for?


On Saturday, August 13, 2016 at 3:10:00 PM UTC+7, mikkelfj wrote:
Is there any very significant benefit to this format over msgpack?

Yes msgpack is sort of parsed, and it isn't super fast, but I think tighter msgpack integration should be considered before crearting a new format just becuase. And a schemaless design will make flatbuffers less portable because implemetations will lack for a long time, whereas msgpack will be available. Now, if there really are great benefits to fb schemaless, this may be worthwhile.

That said, for dynamic storage of data in a database, it can be very tricky to manage compiled flatbuffer schema, and msgpack may not be fast enough for all purposes.

The following relates to the layout of the current flatbuffer system - and I don't know the schemaless approach well enough to know if this applies here. But some learnings:

Bottom up:

I did the C implementation which showed that top down can be done fast without preventing bottom up use, but it does add some implementation complexity. I also adds some memory overhead, but likely less than application space memory otherwise needed for bottom up. Having worked with flatbuffers in both C and Go, I must say the bottom up approach has some limitations, although of course I am biased: If you read in a number of external optional string fields, you must first create all the flatbuffer strings, then subsequently add the strings to the buffer. But since the data is optional, you end up with a lot of temporary variables and repeated tests for presence in separate stages of building the buffer. With top down you can just test field presence once, and create the string and add it one step. Top down is also less error prone with respect to violating flatbuffer construction rules.

Back to front:

We have seen issues with streaming data and flushing data to disk. With back to front this requires and extra step to reverse incremental flushes to disk. And in terms of building the buffer or speed of access, these benefits are at most minor or non-existent. Front to back doesn't solve all problems because length fields would also have to be stored after, say, a vector, mening the vector should be referenced by pointing to a tail header. With this, memory constrained devices can more efficiently build buffers in partial steps and flush as data arrives and data can be streamed to disk or network without reordering.

I understand all of this from a historical perspective, but I think the bottom up and and back to front approach should be reconsidered in a new design.


On Thursday, August 11, 2016 at 1:26:06 AM UTC+2, Wouter van Oortmerssen wrote:
Hải:

Here is a very first attempt at Schemaless FlatBuffers (branch schemaless):

This is far from finished. It's something that be can be used to give feedback on its design.

It comes with a basic implementation, some tests, and a rationale doc that should explain more of the direction this is going.

Let me know what you think :)

On Tue, Aug 2, 2016 at 8:16 PM, Hải Nguyễn Trung <hain...@gmail.com> wrote:
Hello Wouter,
How the Schemaless is going? May I have an access to the early design (wire format) for this one? 
Thank,
Hai.

On Saturday, July 30, 2016 at 12:14:35 AM UTC+7, Wouter van Oortmerssen wrote:
Maxim: that is trying to emulate a dynamically typed system in FlatBuffers, and will be rather inefficient. I'm talking about something that does that natively.

Hải: yes, it is zero-copy like FlatBuffers, but it knows about types in-line.

On Fri, Jul 29, 2016 at 1:46 AM, Hải Nguyễn Trung <hain...@gmail.com> wrote:
Awesome! Does it support zero serialization by using offset to value (like FlatBuffers)? Does it support schemaless using meta tag preceding the real value, like JSON?
Cannot wait to see it out! 
By the way, I don't like the name schemaless, it sounds like "She-male-les" :)


On Wednesday, July 27, 2016 at 11:54:35 PM UTC+7, Wouter van Oortmerssen wrote:
Yes, it would be nice to get a first version of schemaless FlatBuffers out, so y'all can comment on it. Maybe I'll put it in a seperate branch for now. I'll see what I can do.

On Tue, Jul 26, 2016 at 7:16 AM, Hải Nguyễn Trung <hain...@gmail.com> wrote:
Hello Wouter,
I have the similar interest in schemaless solution with zero copy serialization functionality. May I join the early development of this schemaless FlatBuffers, if this version towards to community driven?
I have some prior experience with schema design of Capnproto, FlatBuffers, ProtoBuf, as well as schemaless one like MsgPack, Amazon Ion. I think of getting the best of both worlds: flexibility of schema-less, as well as speed of schema design.
My main goal is a format of object data with reflection forn mutation,  and certainly high speed.


On Tuesday, July 26, 2016 at 1:31:43 AM UTC+7, Wouter van Oortmerssen wrote:
srinisingha: FlatBuffers is really designed for strongly typed (schema) use cases. Though we have some reflection functionality, it is clumsy compared to a system that was designed for it. The object API does not improve this situation.

If you want to store any data (schema-less), you may be better off with a different serialization solution for the moment. I have been working on a schema-less version of FlatBuffers, but that may take a while longer to be ready.

agallego: currently that is not possible, but I agree it should be possible to substitute your own data-types. Whatever string you replace it with would need to correspond to the STL API though.

On Fri, Jul 22, 2016 at 11:58 AM, agallego <galleg...@gmail.com> wrote:
Hi Wouter, 

Is there a way to define the string type used in the schema file? ie.:

typedef string my::String  

This object API works really well for my RPC. You can simply use the type and I'll build it and put it on a flatbuffer and ship it downstream. 

Thanks.



On Wednesday, July 20, 2016 at 8:47:49 PM UTC-4, Wouter van Oortmerssen wrote:

On Wed, Jul 20, 2016 at 11:20 AM, Wouter van Oortmerssen <w...@google.com> wrote:
The first version will be out shortly :)

On Mon, Jul 18, 2016 at 5:52 PM,  <srini...@gmail.com> wrote:
Great to hear that. Thank you Wouter.
May I take a preview about new API, hence to evaluate the efficiency to use it as data structure for the database?
Thank you. Srini.


On Tuesday, July 19, 2016 at 3:30:52 AM UTC+7, Wouter van Oortmerssen wrote:
Yes, the reflection API does automatic resizing when mutating strings etc. It is not ideal, but it is useful for patchups.

The new API has an optional way to unpack a FlatBuffer into C++ data structures, which can then be mutated and serialized back with a single call. Of course this not in-line, and thus not as efficient as the base flatbuffer reading/writing code.

On Sun, Jul 17, 2016 at 6:30 PM,  <srini...@gmail.com> wrote:
Thank you Wouter, I will take a look of Reflection API.
In any case, update in place with new value that bigger than old one cause the whole buffer to be moved to larger buffer to better suit the new value. What is forthcoming API you mentioned? Can we overcome this re-allocation issue with an affordable price?
I saw in CapnProto, it use multiple arenas (like multiple memory zones  - I guess) which seem reasonable. What do you think about that?
Thank. Srini.

On Thursday, July 14, 2016 at 11:50:58 PM UTC+7, Wouter van Oortmerssen wrote:
srinisingha: JSON is even less mutable than FlatBuffers (you can't modify it in place), so if JSON is sufficient for a database, FlatBuffers will be better.

FlatBuffers can be modified in C++ using the reflection API, but that isn't easy. There's also a forthcoming object API that will make mutation a lot easier. If you're using a language other than C++, this may take a while to arrive.

On Thu, Jul 14, 2016 at 2:38 AM,  <srini...@gmail.com> wrote:
I have an idea of storing flatbuffer instead of JSON in a data store (Kind of MongoDB). But it seems FB is intended for immutability (read heavily), not for mutable operation (write/update). For example, update a string field may case performance degradation because the total memory layout has been changed. I Flatbuffer suitable to use as data layout for database?


On Saturday, May 28, 2016 at 9:38:53 AM UTC+7, David Budworth wrote:
Not actually in use yet, but I'm likely to use it in a trading app where we have to transmit a high speed data set from our exchange interface, written in go, to a set of back trade idea finding servers, written in java.

Looking at flatbuffers for both the  minimal allocations (read: gc overhead) and fast encode/decode.



On Wednesday, May 20, 2015 at 11:06:52 AM UTC-7, Wouter van Oortmerssen wrote:
We've seen a lot of people use FlatBuffers, both here and on GitHub, yet we have no idea what products FlatBuffers ends up in. At Google we'd like to know, to see what impact FlatBuffers is having.

So, what do you use FlatBuffers for?
What products will it be used for?
How is it improving on whatever you used before?

I would expect many of you can't say so publicly, in which case I would appreciate a private email: wvo at "thatsearchengine" dot com.

-- 
You received this message because you are subscribed to the Google Groups "FlatBuffers" group.
To unsubscribe from this group and stop receiving emails from it, send an email to flatbuffers...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

-- 
You received this message because you are subscribed to the Google Groups "FlatBuffers" group.
To unsubscribe from this group and stop receiving emails from it, send an email to flatbuffers...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

-- 
You received this message because you are subscribed to the Google Groups "FlatBuffers" group.
To unsubscribe from this group and stop receiving emails from it, send an email to flatbuffers...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


-- 
You received this message because you are subscribed to the Google Groups "FlatBuffers" group.
To unsubscribe from this group and stop receiving emails from it, send an email to flatbuffers...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

-- 
You received this message because you are subscribed to the Google Groups "FlatBuffers" group.
To unsubscribe from this group and stop receiving emails from it, send an email to flatbuffers...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

-- 
You received this message because you are subscribed to the Google Groups "FlatBuffers" group.
To unsubscribe from this group and stop receiving emails from it, send an email to flatbuffers...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

-- 
You received this message because you are subscribed to the Google Groups "FlatBuffers" group.
To unsubscribe from this group and stop receiving emails from it, send an email to flatbuffers...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

-- 
You received this message because you are subscribed to the Google Groups "FlatBuffers" group.
To unsubscribe from this group and stop receiving emails from it, send an email to flatbuffers...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Wouter van Oortmerssen

unread,
Aug 17, 2016, 1:04:00 PM8/17/16
to mikkelfj, FlatBuffers
Mikkel,

> I feel that in-place reading of data with no parsing / object allocation is a significant enough improvement over msgpack.

I figured as much, but I wonder in terms of practical use - memory and speed - but you'll never know without trying - its just that there are so many formats that is becoming a major issue.

Sorry, I don't buy that reasoning. Whenever someone introduces a new X, there's always complaints we already have too many X (languages, operating systems, file formats, standards..).

The reality is that it doesn't matter how many X we have, we need many X for the best to rise to the top. There is no cost to having many X, since we'll never standardize on a single X anyway. Most importantly, new X are needed to advance the state of the art (bump into the pareto frontier).

accessing schemaless binary data without unpacking is an important use case for many fields, that is not being addressed anywhere else right now. It makes a lot of sense for FlatBuffers to address it.

I think looking into a shadow vtable with types and names may also be an option, extending the current format and resctricting vtables to same type tables. Disregarding schemaless, the fb format would benefit from being able to decide if a voffset is a specific reference type for copying data etc. But preferably with front to back support.

I looked into simply writing type/name data along with the current FlatBuffer data, but you'd end up with something way less optimal and flexible than the Schemaless system I am proposing. Schemaless doesn't need vtables since it doesn't deal with optional/missing fields. But what it does need is open ended dictionaries which FlatBuffers doesn't support well. It is too much of a mismatch.

I wouldn't want to make this new format less optimal for legacy reasons, and accessing it won't be compatible with existing FlatBuffers anyway. I do want to share implementation parts and interop.
 
You suggestion about storing the root offset last is in line with what I wrote earlier: front to back is insufficient, you also need each reference to point to the tail and have the header placed there. So yes, this may sound weird, but is necessary.

Ok, so front to back can be considered.
 
In terms of caching, I believe modern intel achitectures work equally well scanning back to front and front to back. Random access beyond one page is never efficient because prefetching is disabled to prevent unnecessary page misses - or something along those lines. For simpler devices there are no prefetching and back to front has same performance. So caching isn't a big deal wrt. front to back vs back to front. In fact, binary seach has been shown to be faster than linear scanning in nearly all array sizes provided an efficient implementation (which is rarely seen though).

What about ARM?
But yes, this is probably not the biggest deal. What could be an issue if a Schemaless buffer sit inside a regular FlatBuffer (a use case which I anticipate will be useful).

As to top down, there is a need for buffering and when this is done, you still get you offset size optimization because you can adjust at the end (I presume without studying details) - this can also be done with integer arrays. In my FlatCC implementation I also buffer so voffsets can be computed correctly. Offset sizes can also be adjusted. The exception is when the input vector is given entirely in one call. This is a major weakness for very large vectors that are constructed piecemeal, but in most cases the temporary storage can happen on a reusable stack that typically consumes less space than application space equivalent temporary storage. Notably, this tempory stack will be in hot cache, which is a big deal.

Top down buffering can be done, but requires more temporary storage (need a whole stack rather than just the current object), which may get unwieldy when mixing dictionaries with other objects, etc.).

Can you give an example of what you think top-down construction of some nested objects would look like, with a variety of data types? I am guessing I am failing to see the advantage.
 
The overhead of this extra buffering step is not large due to hot cache and because it is optional: if you provide the entire vector as input, it is not required, and compared to the time needed for constructing a buffer vs. memory bandwidth, it also suggests low overhead (fb write is roughly 10x slower than fb read).

I'm sort of tempted to support a fragmented vector type where each fragment can have its own offset size or element size. This solves the issue of buffering very large vectors which is useful for sensor nodes streaming data on the wire, but it is also an additional complexity that will rarely be needed. The alternative is to do this at the application level.

We still want to keep in-place access fast, so a fragmented vector doesn't sound great to me. 
 

mikkelfj

unread,
Aug 17, 2016, 1:59:01 PM8/17/16
to FlatBuffers, mik...@dvide.com


Sorry, I don't buy that reasoning. Whenever someone introduces a new X, there's always complaints we already have too many X (languages, operating systems, file formats, standards..).
I tend to agree, but in this specific area, there really are a lot, and a lack of restraint. But I agree, access with allocation is desireable.
 
I looked into simply writing type/name data along with the current FlatBuffer data, but you'd end up with something way less optimal and flexible than the Schemaless system I am proposing. Schemaless doesn't need vtables since it doesn't deal with optional/missing fields. But what it does need is open ended dictionaries which FlatBuffers doesn't support well. It is too much of a mismatch.
 
Make sense, thanks for explaining.
 
Ok, so front to back can be considered.
 
Great!
 
In terms of caching, I believe modern intel achitectures work equally well scanning back to front and front to back. Random access beyond one page is never efficient because prefetching is disabled to prevent unnecessary page misses - or something along those lines. For simpler devices there are no prefetching and back to front has same performance. So caching isn't a big deal wrt. front to back vs back to front. In fact, binary seach has been shown to be faster than linear scanning in nearly all array sizes provided an efficient implementation (which is rarely seen though).

What about ARM?
Older ARM tech had AFAIK no prefetching. Modern ARM is similar to Intel I believe, at least it does have descending prefetch according to this page:
 
But yes, this is probably not the biggest deal. What could be an issue if a Schemaless buffer sit inside a regular FlatBuffer (a use case which I anticipate will be useful).

As to top down, there is a need for buffering and when this is done, you still get you offset size optimization because you can adjust at the end (I presume without studying details) - this can also be done with integer arrays. In my FlatCC implementation I also buffer so voffsets can be computed correctly. Offset sizes can also be adjusted. The exception is when the input vector is given entirely in one call. This is a major weakness for very large vectors that are constructed piecemeal, but in most cases the temporary storage can happen on a reusable stack that typically consumes less space than application space equivalent temporary storage. Notably, this tempory stack will be in hot cache, which is a big deal.

Top down buffering can be done, but requires more temporary storage (need a whole stack rather than just the current object), which may get unwieldy when mixing dictionaries with other objects, etc.).
 
I have done most of this already in FlatCC and in particular JSON parsing  where the low-level typeless write api manages some 5+ stacks, most rather small and the name lookup happens in a ternary trie implemented in generated code, but could have been a binary search in a map instead (though here the situation is reverse since write does lookup while schemaless would have read lookup). And it is not trivial to do, but performance and overhead is not a big deal if done right and the typeless builder with stacks are still only about 1000 lines of C or so, and most of this is alignment and vtable management. JSON parsing is only 4 times slower than direct buffer building, and direct buffer building is on par with C++ +/- some percentages.


Can you give an example of what you think top-down construction of some nested objects would look like, with a variety of data types? I am guessing I am failing to see the advantage.

That actual top-down api can be debated - in FlatCC I found it useful to a have a typeless runtime api, then a generated typed method such as append to a string vector, then a field name specific method so, for example, you can append to a given named field of a given table type when the field is a vector. The user can choose which api level to use. Each has pros and cons, and some of this has nothing to do with top-down, but together with top-down, it becomes simpler to modularize buffer construction without tempory storage at the application level.

The following is pseudo code quickly mashed up for illustration, so there may be flaws, but should get the point accross.

table Baz {
  names [string];
}
table FooBar {
  names: [string];
  intent: string;
  author: string;
table All {
  x: FooBar;
  y: Baz;
}

function AddNames(builder, names)
  for name in names
    if len(name) > 0
      builder.appendStringElem(builder.createString(name))
end

pseudo code in top down

function doFooBar(names, author, intent)
  builder.beginFooBar
  if len(intent) < 0
    builder.addIntent(builder.createString(intent))
  builder.beginNames()
  addNames(builder, names)
  builder.endNames()
  if len(author) > 0
    builder.addAuthor(builder.createString(author))
  return builder.endFooBar
end

function doBaz(builder, names)
  builder.beginBaz()
  builder.beginNames()
  addNames(builder, names)
  builder.endNames()
  return builder.endBaz()
end

function doBuffer()
  names = ["Julia", "George"]
  builder.beginBufferAsAll
  builder.addBaz(doBaz(builder, names, "John "Doe", "")
  builder.addFooBar(doFooBar, names)
  builder.endBuffer
end

(Note that in FlatCC it is actually not necessary to create each string before appending to the vector or storing as a field, since there are _str method variants to do this directly, but the following pseudocode has no such support, even so, such _str methods would be difficult to support in a bottom up only api).

I'm sort of tempted to support a fragmented vector type where each fragment can have its own offset size or element size. This solves the issue of buffering very large vectors which is useful for sensor nodes streaming data on the wire, but it is also an additional complexity that will rarely be needed. The alternative is to do this at the application level.

We still want to keep in-place access fast, so a fragmented vector doesn't sound great to me. 

Well, I am not convinced fragmented vectors are worth the complexity, but performance is not a concern to me, compared to the benefits. Fragmentation would be a fixed size 4K 8K, or 16K element count with a parent vector indexing into each fragment, similar to a STL Deque datatype. Since at these sizes we end up hitting 2nd level cache, the indirection is fairly cheap since this parent vector will reside in hot cache similar to a vtable. There is no point in fragmentation for small vectors, and fragmentation should always be trivial to index, i.e. not variable sized fragments. Fragmented vectors would not be an overhead to vectors that are not fragmented since you have to branch out on offset  size in the schemaless design, and could branch out on fragmented and non-fragmented access at the same time (roughly, need to see implementation in detail). Memory is totally fragmented anyway down to 64 bytes cachelines typically, but, placement of fragments may affect prefetch if the fragments are smaller than page sizes, which they should not be.

mikkelfj

unread,
Aug 17, 2016, 2:02:28 PM8/17/16
to FlatBuffers, mik...@dvide.com


On Wednesday, August 17, 2016 at 7:59:01 PM UTC+2, mikkelfj wrote:

I tend to agree, but in this specific area, there really are a lot, and a lack of restraint. But I agree, access with allocation is desireable.

Of course I meant without allocation. 

Wouter van Oortmerssen

unread,
Aug 22, 2016, 6:48:18 PM8/22/16
to mikkelfj, FlatBuffers
I have done most of this already in FlatCC and in particular JSON parsing  where the low-level typeless write api manages some 5+ stacks, most rather small and the name lookup happens in a ternary trie implemented in generated code, but could have been a binary search in a map instead (though here the situation is reverse since write does lookup while schemaless would have read lookup). And it is not trivial to do, but performance and overhead is not a big deal if done right and the typeless builder with stacks are still only about 1000 lines of C or so, and most of this is alignment and vtable management. JSON parsing is only 4 times slower than direct buffer building, and direct buffer building is on par with C++ +/- some percentages.

If top-down would require more intermediate structures than bottom-up, I would go with bottom-up. I don't think it does though. I am not sure why we'd require that many stacks/maps etc.
Ok.. I am not sure how I see that is top down. It is completely equivalent to current bottom-up code, if the "begin" calls end up not doing anything important.

The way I see it, you have one stack, and every time you call "add" you just add an item to the stack. Whenever you encounter an "end", it looks at all the items between that and the corresponding start (which I would use a local variable for, but I guess this is one additional stack in your case), and then a) figure out how compact the vector/map can be encoded, and b) sort the items if it is a map. Then replace all the items on the stack with an offset to the newly created object.
 
(Note that in FlatCC it is actually not necessary to create each string before appending to the vector or storing as a field, since there are _str method variants to do this directly, but the following pseudocode has no such support, even so, such _str methods would be difficult to support in a bottom up only api).
 
Well, I am not convinced fragmented vectors are worth the complexity, but performance is not a concern to me, compared to the benefits. Fragmentation would be a fixed size 4K 8K, or 16K element count with a parent vector indexing into each fragment, similar to a STL Deque datatype. Since at these sizes we end up hitting 2nd level cache, the indirection is fairly cheap since this parent vector will reside in hot cache similar to a vtable. There is no point in fragmentation for small vectors, and fragmentation should always be trivial to index, i.e. not variable sized fragments. Fragmented vectors would not be an overhead to vectors that are not fragmented since you have to branch out on offset  size in the schemaless design, and could branch out on fragmented and non-fragmented access at the same time (roughly, need to see implementation in detail). Memory is totally fragmented anyway down to 64 bytes cachelines typically, but, placement of fragments may affect prefetch if the fragments are smaller than page sizes, which they should not be.

Right, they could be their own type. But what is the typical use case? 

mikkelfj

unread,
Aug 23, 2016, 10:16:42 AM8/23/16
to FlatBuffers, mik...@dvide.com


On Tuesday, August 23, 2016 at 12:48:18 AM UTC+2, Wouter van Oortmerssen wrote:

If top-down would require more intermediate structures than bottom-up, I would go with bottom-up. I don't think it does though. I am not sure why we'd require that many stacks/maps etc.
 
The schemaless would likely need less support datastructures than typed flatbuffers. I don't recall the details but the following is needed: a tempory table to collect the offsets added (because this is not done inline in the generated buffer). A temporary vtable to track the signature of the vtable and create a new if needed. A collection of cached vtables (because access to the final buffer is not assumed). A buffer to track generated offsets for translation once the base offset is known. A frame buffer to track the nesting level and frame type (for error reporting). A buffer to track the hashes of vtables. Some are stacks, others are dynamically growing buffers. In addition, there is user controlled stack that can add extra context. The JSON parser uses this to manage very complex scenarios including type field following data field of unions.

Some of these could be done in-place but then nesting wouldn't work, and flatcc allows buffers to be flushed without access to the entire buffer, so in-place would not work well. Most of these buffers are very small and sometimes not used at all.
... 
Ok.. I am not sure how I see that is top down. It is completely equivalent to current bottom-up code, if the "begin" calls end up not doing anything important.

You cannot create a string and add to table, then create another string and add to the table because when you create the second string, the buffer is blocked by the current table being built. This extends to adding child tables. In top down you have to create them all in advance before adding to the table.
 
 
The way I see it, you have one stack, and every time you call "add" you just add an item to the stack. Whenever you encounter an "end", it looks at all the items between that and the corresponding start (which I would use a local variable for, but I guess this is one additional stack in your case), and then a) figure out how compact the vector/map can be encoded, and b) sort the items if it is a map. Then replace all the items on the stack with an offset to the newly created object.

Yes, this is largely how it is. Its just that there are a lot of extra details in typed flatbuffers that might not be needed in untyped buffers.
In many cases flatcc indeed has a begin operation that does next to nothing, but even then, it sometimes need to something in big-endian.
 
Right, they could be their own type. But what is the typical use case? 

First of all, the premise is that the buffer can be emitted partially so you don't need memory for the full buffer. If you have large vectors, such as sensor samples, you need to periodically emit data. Yes, you could sent a buffer every time you feel like it, but it may be more convenient and bandwidth efficient, not to mention radio power use, to keep larger series together, but the vector won't fit into RAM on sensor nodes.

Another use case is simply draining data from a database query and transmitting those data as a flatbuffer. In a high performance, highly concurrent system, you cannot afford to use very large memory buffers so the problem is the same as for sensor nodes, only the scale is different.

Finally, the 2GB limit may be a problem in this context because it forces you to be aware of this limit even if could get away with draining data using only say 16KB of working memory.

mikkelfj

unread,
Aug 23, 2016, 10:20:55 AM8/23/16
to FlatBuffers, mik...@dvide.com
Correction:  In top down you have to create them all in advance before adding to the table.
Should read In bottom up you have to ...

Wouter van Oortmerssen

unread,
Aug 24, 2016, 6:18:24 PM8/24/16
to mikkelfj, FlatBuffers
You cannot create a string and add to table, then create another string and add to the table because when you create the second string, the buffer is blocked by the current table being built. This extends to adding child tables. In top down you have to create them all in advance before adding to the table.

Not necessarily, because the way I see it, nothing of the current table gets built until all children have been collected on the stack and "end" is reached, so writing strings/sub-objects is totally fine.

The way I see it, you have one stack, and every time you call "add" you just add an item to the stack. Whenever you encounter an "end", it looks at all the items between that and the corresponding start (which I would use a local variable for, but I guess this is one additional stack in your case), and then a) figure out how compact the vector/map can be encoded, and b) sort the items if it is a map. Then replace all the items on the stack with an offset to the newly created object.

Yes, this is largely how it is. Its just that there are a lot of extra details in typed flatbuffers that might not be needed in untyped buffers.
In many cases flatcc indeed has a begin operation that does next to nothing, but even then, it sometimes need to something in big-endian.

Yeah, I don't think anything is needed here.

First of all, the premise is that the buffer can be emitted partially so you don't need memory for the full buffer. If you have large vectors, such as sensor samples, you need to periodically emit data. Yes, you could sent a buffer every time you feel like it, but it may be more convenient and bandwidth efficient, not to mention radio power use, to keep larger series together, but the vector won't fit into RAM on sensor nodes.

Another use case is simply draining data from a database query and transmitting those data as a flatbuffer. In a high performance, highly concurrent system, you cannot afford to use very large memory buffers so the problem is the same as for sensor nodes, only the scale is different.

Finally, the 2GB limit may be a problem in this context because it forces you to be aware of this limit even if could get away with draining data using only say 16KB of working memory.

SFB does not have a 2GB limit.

It could even support writing the buffer in pieces if we do forward construction, since all it retains from past written items is a temporary absolute offset (which it then writes out as a relative offset).

Brian Corrigan

unread,
Aug 24, 2016, 9:07:19 PM8/24/16
to FlatBuffers
For what it's worth, I think this would be pretty neat.  I think fixed schema makes a lot of sense when you control both sides of the API (client, server).  However, particularly when offering something up publicly, you tend to want very flexible API endpoints, and arbitrary schema comes up almost immediately.  

mikkelfj

unread,
Aug 25, 2016, 9:29:06 AM8/25/16
to FlatBuffers, mik...@dvide.com


On Thursday, August 25, 2016 at 12:18:24 AM UTC+2, Wouter van Oortmerssen wrote:
You cannot create a string and add to table, then create another string and add to the table because when you create the second string, the buffer is blocked by the current table being built. This extends to adding child tables. In top down you have to create them all in advance before adding to the table.

Not necessarily, because the way I see it, nothing of the current table gets built until all children have been collected on the stack and "end" is reached, so writing strings/sub-objects is totally fine.

 
The difference is if the user must postpone adding parent fields before creating children, or children can be added after adding some fields to a parent or a parents parent. The latter approach allows for much more modular construction of buffers. I don't believe this is possible in todays interfaces, except for C, but if it will be for schema less flatbuffers, then I'm all for it.

Wouter van Oortmerssen

unread,
Aug 26, 2016, 12:14:19 PM8/26/16
to mikkelfj, FlatBuffers
Brian: thanks! Yes, that's what I'm learning too :)

Mikkel: It generally will not be possible to add fields to an object after you call "end" on it. In theory it can be done, but in practice means that potentially the whole object needs to be re-encoded, and moved (and the same for ALL of its parents!). That is an expensive operation. Maybe we'll support it eventually just because we can, but it will be marked clearly as expensive and only something you should do if you really need it.

--
You received this message because you are subscribed to the Google Groups "FlatBuffers" group.
To unsubscribe from this group and stop receiving emails from it, send an email to flatbuffers+unsubscribe@googlegroups.com.

mikkelfj

unread,
Aug 26, 2016, 12:26:04 PM8/26/16
to FlatBuffers, mik...@dvide.com


On Friday, August 26, 2016 at 6:14:19 PM UTC+2, Wouter van Oortmerssen wrote:
Brian: thanks! Yes, that's what I'm learning too :)

Mikkel: It generally will not be possible to add fields to an object after you call "end" on it.

No, that shouldn't be possible, but you should be able to continue to add to an object before and after you call end on a child object. This likely still requires caching and can be prohibitive for very large tables. Copying will still be required, but this operation is very fast from hot cache. This already happens in FlatCC.

Wouter van Oortmerssen

unread,
Aug 26, 2016, 12:55:13 PM8/26/16
to mikkelfj, FlatBuffers
Yeah, I don't see why adding fields to an object (of any structure) before or after a child object requires extra work. This all comes for free when using a stack to build objects.

mikkelfj

unread,
Aug 26, 2016, 1:09:03 PM8/26/16
to FlatBuffers, mik...@dvide.com


On Friday, August 26, 2016 at 6:55:13 PM UTC+2, Wouter van Oortmerssen wrote:
Yeah, I don't see why adding fields to an object (of any structure) before or after a child object requires extra work. This all comes for free when using a stack to build objects.
 
Right :) 

Wouter van Oortmerssen

unread,
Sep 21, 2016, 9:32:31 PM9/21/16
to mikkelfj, FlatBuffers
Ok, the current WiP of Schemaless FlatBuffers now builds front to back:

It's a LOT more feature complete than the previous version (all supported typed except for fixed sized vectors have been implemented), so be good if interested parties had a look :)

The test should give an impression of the API. I tried to keep it simple and elegant. The reading API in particular never fails, i.e. if you access values of the wrong type, those values will either be converted or you get defaults. This allows for processing of dynamic data in a less brittle way, with less error checking needed (error checking is still possible if desired).


--

mikkelfj

unread,
Sep 26, 2016, 9:26:08 AM9/26/16
to FlatBuffers, mik...@dvide.com
Great, I will get back once I get a chance to give it a closer look.
To unsubscribe from this group and stop receiving emails from it, send an email to flatbuffers...@googlegroups.com.

Wouter van Oortmerssen

unread,
Oct 3, 2016, 9:12:21 PM10/3/16
to mikkelfj, FlatBuffers

To unsubscribe from this group and stop receiving emails from it, send an email to flatbuffers+unsubscribe@googlegroups.com.

Maxim Zaks

unread,
Oct 4, 2016, 9:35:11 AM10/4/16
to FlatBuffers, mik...@dvide.com
This looks great!!!
I haven't looked in to implementation details yet, but would it be nicer to write something like:

slb.Map([&]() {
  slb.KeyWithVector("vec", [&]() {
    slb.Int(-100);
    slb.String("Fred");
    slb.IndirectFloat(4.0f);
  });
  slb.KeyWithUInt("foo", 100);
});

I think it exposes the intent much clearer.

Keep up the great work!

Cheers,

Maxim

Wouter van Oortmerssen

unread,
Oct 5, 2016, 12:08:05 PM10/5/16
to Maxim Zaks, FlatBuffers, mikkelfj
Thanks Maxim :)
I thought about that, but there's quite a lot of methods for values, and we'd need a second version of each. Then again I guess that is worth it.


To unsubscribe from this group and stop receiving emails from it, send an email to flatbuffers+unsubscribe@googlegroups.com.

Maxim Zaks

unread,
Oct 5, 2016, 12:32:10 PM10/5/16
to FlatBuffers, maxim...@googlemail.com, mik...@dvide.com
In Swift I would try to do something like:

slb.add { slb in
 slb
.add("vec"){ slb in
   slb
.add(-100) // or if you like operators slb << -100
   slb
.add("
Fred"// or if you like operators slb << "Fred"
      
slb.add(4.0// or if you like operators slb << 4.0
 }
}

The notation looks fancy because of "Trailing closure syntax". Other wise it should be same in C++.

This however implies polymorphic dispatch which can be slow. I would try to go around it in Swift with Generics, if I am lucky the compiler will generate multiple different functions dependent on the usage. Not sure if it would be possible to go around performance penalty with templates in C++. With the operator notation ... I feel more bad than good about it :)

Maxim Zaks

unread,
Oct 5, 2016, 12:59:59 PM10/5/16
to FlatBuffers, maxim...@googlemail.com, mik...@dvide.com
Forgot to mention.

I thought the builder could infer if you adding multiple values inside of a closure, that they are all part of vector. But I am not sure any more if it would be ambiguous.

Wouter van Oortmerssen

unread,
Oct 5, 2016, 1:32:30 PM10/5/16
to Maxim Zaks, FlatBuffers, mikkelfj
You could certainly do it with overloading in C++, but Add is not really any shorter than Int, and the latter avoids you writing an unintended wrong type.


To unsubscribe from this group and stop receiving emails from it, send an email to flatbuffers+unsubscribe@googlegroups.com.

Wouter van Oortmerssen

unread,
Oct 5, 2016, 4:32:38 PM10/5/16
to Maxim Zaks, FlatBuffers, mikkelfj
https://github.com/google/flatbuffers/commit/6f37a52b740d540da4166804190c4ad86d955071

That has methods that take a an optional key, and also a generic Add method that will take any type, including vectors and maps :)

mikkelfj

unread,
Oct 8, 2016, 6:46:06 AM10/8/16
to FlatBuffers, maxim...@googlemail.com, mik...@dvide.com
Here is another traversible binary schemaless format: https://github.com/liteserver/binn

Wouter van Oortmerssen

unread,
Oct 10, 2016, 11:57:00 AM10/10/16
to mikkelfj, FlatBuffers, Maxim Zaks
There's a lot of binary schemaless formats out there. But none are readable without parsing (and thus allocation), which is what I am trying to achieve here.

To unsubscribe from this group and stop receiving emails from it, send an email to flatbuffers+unsubscribe@googlegroups.com.

mikkelfj

unread,
Oct 10, 2016, 2:17:11 PM10/10/16
to FlatBuffers, mik...@dvide.com, maxim...@googlemail.com
I have no opinion on binn, I just accidentally found it - it seems to load the data into memory as you say, although I'm not sure why that is necessary.

As to SFB, I have only briefly read through the new documentation and not looked at the C++ interface.

A few points:
The vector has the size field early. This prevents streaming of on the fly generated data such as sensor nodes with little memory, and requires a seek on larger mmap'ed files. Notably SSD needs to update a lot to change little.

The vector also has an intelligent adaptive size field, which may not be a problem for streaming if it can be hinted, but if it has to learned, the vector cannot be streamed before the entire vector is known even if the size is last. I did not look at how vectors are referenced, but I presume it goes to the head rather than the tail. It needs to be the tail if streaming is to be supported. It looks odd, but performance should not suffer noticeably and can easily be translated while reading.

The map interface and key/string/blob values. It is not entirely clear if/when strings are used as blobs and if they are then zero terminated beyond their size.
For keys, it is extremely useful to support integer keys. I have become aware of a binary search implementation that is about faster than linear scan down to 4 elements or so. This means maps can be highly efficient and there are certain data structures that would directly benefit from such a format. But if all keys must adhere to strcmp performance is all gone. Still nice compared to JSON, but orders of magnitude slower.

The relation to TFB (typed FB) and import/export functionality is unclear to me, after a very brief first reading.
Also, are array elements homogenous, meaning you can't ever mix floats and strings as you can in JSON in the general case?

Wouter van Oortmerssen

unread,
Oct 10, 2016, 4:10:31 PM10/10/16
to mikkelfj, FlatBuffers, Maxim Zaks
On Mon, Oct 10, 2016 at 11:17 AM, mikkelfj <mik...@dvide.com> wrote:
I have no opinion on binn, I just accidentally found it - it seems to load the data into memory as you say, although I'm not sure why that is necessary.

As to SFB, I have only briefly read through the new documentation and not looked at the C++ interface.

A few points:
The vector has the size field early. This prevents streaming of on the fly generated data such as sensor nodes with little memory, and requires a seek on larger mmap'ed files. Notably SSD needs to update a lot to change little.

This is generally not a streaming friendly format (much like TFB), since it is a DAG with offsets that are not guaranteed to be neatly contained.

Also, much like TFB, it will want the entire buffer in memory before it is finished in the general case. That is because it may still be referring to earlier keys or key vectors, or other shared data objects.

That said, it is very suitable for streaming in case where you do not want to hold on the items you're streaming, since those can be written into the buffer first, and the vector that refers to them later (unlike TFB, there are no inline structs, vectors always come after all of its non-scalar content).
 
The vector also has an intelligent adaptive size field, which may not be a problem for streaming if it can be hinted, but if it has to learned, the vector cannot be streamed before the entire vector is known even if the size is last. I did not look at how vectors are referenced, but I presume it goes to the head rather than the tail. It needs to be the tail if streaming is to be supported. It looks odd, but performance should not suffer noticeably and can easily be translated while reading.

Can you be precise about the exact use case, because I'm not seeing when this is necessary. Are you talking about streamed writes (as what I am responding to above) or streamed reads? Again same caveats apply that in the general case this format is not designed for streaming internally to the buffers.
 
The map interface and key/string/blob values. It is not entirely clear if/when strings are used as blobs and if they are then zero terminated beyond their size.

Currently when a string is a blob is up to the user, and will always be followed a 0 terminator. I'm considering making blob its own separate type, as this may be cleaner. That depends on encoding though, as the current main type field is 4 bits, and we're running out of enums. I may merge it with the 2 bit vector type field to gain more room.
 
For keys, it is extremely useful to support integer keys. I have become aware of a binary search implementation that is about faster than linear scan down to 4 elements or so. This means maps can be highly efficient and there are certain data structures that would directly benefit from such a format. But if all keys must adhere to strcmp performance is all gone. Still nice compared to JSON, but orders of magnitude slower.

Integer key-ed maps may indeed by nice to have, but I would simply make this its own type, i.e. it would use a typed vector of ints as its key vector and use a custom binary search.
 
The relation to TFB (typed FB) and import/export functionality is unclear to me, after a very brief first reading.
Also, are array elements homogenous, meaning you can't ever mix floats and strings as you can in JSON in the general case?

They are independent (an incompatible) formats, but one can be embedded in the other (typically SFB inside a TFB, but who knows there are use cases for the reverse). They would be convertable into eachother much like JSON (and also into/from JSON).

mikkelfj

unread,
Oct 11, 2016, 2:59:58 AM10/11/16
to FlatBuffers, mik...@dvide.com, maxim...@googlemail.com


On Monday, October 10, 2016 at 10:10:31 PM UTC+2, Wouter van Oortmerssen wrote:
 
This is generally not a streaming friendly format (much like TFB), since it is a DAG with offsets that are not guaranteed to be neatly contained.

Also, much like TFB, it will want the entire buffer in memory before it is finished in the general case. That is because it may still be referring to earlier keys or key vectors, or other shared data objects. 

That said, it is very suitable for streaming in case where you do not want to hold on the items you're streaming, since those can be written into the buffer first, and the vector that refers to them later (unlike TFB, there are no inline structs, vectors always come after all of its non-scalar content).
 
Correct, this means that it is largely irrelevant where the size field is located in offset vectors, but for vectors with inline data it still holds.
 
The vector also has an intelligent adaptive size field, which may not be a problem for streaming if it can be hinted, but if it has to learned, the vector cannot be streamed before the entire vector is known even if the size is last. I did not look at how vectors are referenced, but I presume it goes to the head rather than the tail. It needs to be the tail if streaming is to be supported. It looks odd, but performance should not suffer noticeably and can easily be translated while reading.

Can you be precise about the exact use case, because I'm not seeing when this is necessary. Are you talking about streamed writes (as what I am responding to above) or streamed reads? Again same caveats apply that in the general case this format is not designed for streaming internally to the buffers.
 
I am referring to streamed writes, which is when you would have to provide the hint. For example a sensor node might write humidity to a vector every two seconds for half a day and given the value is in percentage, it has a pretty good idea about the range. 

But in general I am also referring to situations like map/reduce processing really large files that don't sit well in memory, such as extracting the area codes of phone lists. Here it is very expensive to load the data into a database and a buffering format won't work well. Since a process may have multiple steps, it is equally important that the format can be read partially while generating the new output. Currently CSV is about the only format that I am aware of, that can do this.

It is sort of acceptable to require that offset vectors be buffered because you can segment the raw data.

An inline vector of integer pairs can be handled as two vectors, as in column oriented db storage, which is very efficient for reads. For streamed writes it won't work. Hence inline struct vectors with tail stored size fields may still be desireable.

Currently when a string is a blob is up to the user, and will always be followed a 0 terminator. I'm considering making blob its own separate type, as this may be cleaner. That depends on encoding though, as the current main type field is 4 bits, and we're running out of enums. I may merge it with the 2 bit vector type field to gain more room.
 
I did not look into the type encoding, so can't speak of that but msgpack v5 apparently needed a separate utf8 string.
 
For keys, it is extremely useful to support integer keys. I have become aware of a binary search implementation that is about faster than linear scan down to 4 elements or so. This means maps can be highly efficient and there are certain data structures that would directly benefit from such a format. But if all keys must adhere to strcmp performance is all gone. Still nice compared to JSON, but orders of magnitude slower.

Integer key-ed maps may indeed by nice to have, but I would simply make this its own type, i.e. it would use a typed vector of ints as its key vector and use a custom binary search.

If you mean format supported type, then this is fine, if you mean the user can simple create two vectors, then no so much:
this would be hard to communicate in a general format. Similar to recent discussions about DIY struct unions in TFB.
  
The relation to TFB (typed FB) and import/export functionality is unclear to me, after a very brief first reading.
Also, are array elements homogenous, meaning you can't ever mix floats and strings as you can in JSON in the general case?

They are independent (an incompatible) formats, but one can be embedded in the other (typically SFB inside a TFB, but who knows there are use cases for the reverse). They would be convertable into eachother much like JSON (and also into/from JSON).

This didn't answer my question about mixed types (and I didn't bother to read spec carefully enough) - but  can an offset reference different types in different vector elements? This will be closer to encode all of JSON and further away from being convertible to TFB. Not saying one is better than the other.

Actually I think TFB should have vectors of unions where the union type is embedded in the tables vtable, if that were an option.

Wouter van Oortmerssen

unread,
Oct 12, 2016, 5:11:11 PM10/12/16
to mikkelfj, FlatBuffers, Maxim Zaks
Mikkel,
 
I am referring to streamed writes, which is when you would have to provide the hint. For example a sensor node might write humidity to a vector every two seconds for half a day and given the value is in percentage, it has a pretty good idea about the range. 

Ahh I see what you're saying. Basically the size field may cause the vector to need
a bigger bit-width, which is only known at the end.

Storing the size in the same bit-width as the data was a conscious choice to keep
things simple, fast, and reduce the amount of space needed for type info. The majority of sizes tend to be pretty small so this works well.
 
But in general I am also referring to situations like map/reduce processing really large files that don't sit well in memory, such as extracting the area codes of phone lists. Here it is very expensive to load the data into a database and a buffering format won't work well. Since a process may have multiple steps, it is equally important that the format can be read partially while generating the new output. Currently CSV is about the only format that I am aware of, that can do this.

Having a buffer partially in memory is a use case I would want people to use thru mmap, though I guess special purpose streaming readers that work directly with files or networking could exist.
 
It is sort of acceptable to require that offset vectors be buffered because you can segment the raw data.

An inline vector of integer pairs can be handled as two vectors, as in column oriented db storage, which is very efficient for reads. For streamed writes it won't work. Hence inline struct vectors with tail stored size fields may still be desireable.

And those two vectors would sit in two separate SFBs? Since you can't write to
two vectors at once.

I'll take your word for it that this is a common use case, though I am not sure I want to flip vectors for it. It is an additional indirection that is not generally needed.
 
I did not look into the type encoding, so can't speak of that but msgpack v5 apparently needed a separate utf8 string.

It matters most for languages like Java where you have to unpack the utf-8 to access it as a string, something you don't want to do with blobs.

Integer key-ed maps may indeed by nice to have, but I would simply make this its own type, i.e. it would use a typed vector of ints as its key vector and use a custom binary search.

If you mean format supported type, then this is fine, if you mean the user can simple create two vectors, then no so much:
this would be hard to communicate in a general format. Similar to recent discussions about DIY struct unions in TFB.

Yes, a type directly supported by format & API, we'd have the current MAP type and a INTMAP type.
 
They are independent (an incompatible) formats, but one can be embedded in the other (typically SFB inside a TFB, but who knows there are use cases for the reverse). They would be convertable into eachother much like JSON (and also into/from JSON).

This didn't answer my question about mixed types (and I didn't bother to read spec carefully enough) - but  can an offset reference different types in different vector elements? This will be closer to encode all of JSON and further away from being convertible to TFB. Not saying one is better than the other.

It supports both untyped and typed vectors, so yes. It is meant to be a super-set of JSON. Converting to TFB has the same caveats as converting from JSON to TFB has.
 
Actually I think TFB should have vectors of unions where the union type is embedded in the tables vtable, if that were an option.

I would love to overhaul TFB's union support, but sadly at this point that would be very disruptive.

Wouter

mikkelfj

unread,
Oct 20, 2016, 3:18:50 AM10/20/16
to FlatBuffers, mik...@dvide.com, maxim...@googlemail.com


On Wednesday, October 12, 2016 at 11:11:11 PM UTC+2, Wouter van Oortmerssen wrote:
Mikkel,
 
I am referring to streamed writes, which is when you would have to provide the hint. For example a sensor node might write humidity to a vector every two seconds for half a day and given the value is in percentage, it has a pretty good idea about the range. 

Ahh I see what you're saying. Basically the size field may cause the vector to need
a bigger bit-width, which is only known at the end.

I wasn't actually thinking about that, but that is a valid point too. Without being familiar with the api, my thinking was that you keep appending integers to the vector and it decides at the end what is the smallest possible representation. If the vector can be flushed during this process, it is too late to make that decision at the end.
 
But in general I am also referring to situations like map/reduce processing really large files that don't sit well in memory, such as extracting the area codes of phone lists. Here it is very expensive to load the data into a database and a buffering format won't work well. Since a process may have multiple steps, it is equally important that the format can be read partially while generating the new output. Currently CSV is about the only format that I am aware of, that can do this.

Having a buffer partially in memory is a use case I would want people to use thru mmap, though I guess special purpose streaming readers that work directly with files or networking could exist.
 
mmap is definitely of interest, especially for reading. But for writing large amounts of data you need to flush it occasionally, and mmaped or not, it needs to the next address range in the buffer, or you have to rearrange the buffer later as is the case today with TFB. This lack of partial writing appears to be a major showstopper for some potential flatbuffer users.

 
It is sort of acceptable to require that offset vectors be buffered because you can segment the raw data.

An inline vector of integer pairs can be handled as two vectors, as in column oriented db storage, which is very efficient for reads. For streamed writes it won't work. Hence inline struct vectors with tail stored size fields may still be desireable.

And those two vectors would sit in two separate SFBs? Since you can't write to
two vectors at once.
 
That is exactly why I wrote it won't work for writes. No I wasn't thinking of two separate buffers - although of course, in some scenarios it may not be a bad idea to have a separate index buffer.


I'll take your word for it that this is a common use case, though I am not sure I want to flip vectors for it. It is an additional indirection that is not generally needed.
 
Umm I'd rather not be the judge of what a common use case is - common use cases also tend to reflect what is practical with the technology at hand. All I can say is that structured data is hard to write outside of text formats and databases when you cannot buffer everything, and that flatbuffers is close to solving it, but not quite.
  
Integer key-ed maps may indeed by nice to have, but I would simply make this its own type, i.e. it would use a typed vector of ints as its key vector and use a custom binary search.

If you mean format supported type, then this is fine, if you mean the user can simple create two vectors, then no so much:
this would be hard to communicate in a general format. Similar to recent discussions about DIY struct unions in TFB.

Yes, a type directly supported by format & API, we'd have the current MAP type and a INTMAP type.
Good
  
They are independent (an incompatible) formats, but one can be embedded in the other (typically SFB inside a TFB, but who knows there are use cases for the reverse). They would be convertable into eachother much like JSON (and also into/from JSON).

This didn't answer my question about mixed types (and I didn't bother to read spec carefully enough) - but  can an offset reference different types in different vector elements? This will be closer to encode all of JSON and further away from being convertible to TFB. Not saying one is better than the other.

It supports both untyped and typed vectors, so yes. It is meant to be a super-set of JSON. Converting to TFB has the same caveats as converting from JSON to TFB has.

OK, that makes sense. I still think there is space for tighter integration where you want it, just as with JSON. Say, if you limit you data set to valid schema data, you can do such and such - for example parsing JSON to flatbuffers as we already can.

Mikkel

Wouter van Oortmerssen

unread,
Oct 21, 2016, 7:28:14 PM10/21/16
to mikkelfj, FlatBuffers, Maxim Zaks
 Mikkel,
 
I wasn't actually thinking about that, but that is a valid point too. Without being familiar with the api, my thinking was that you keep appending integers to the vector and it decides at the end what is the smallest possible representation. If

Yup.
 
the vector can be flushed during this process, it is too late to make that decision at the end.
 
mmap is definitely of interest, especially for reading. But for writing large amounts of data you need to flush it occasionally, and mmaped or not, it needs to the next address range in the buffer, or you have to rearrange the buffer later as is the case today with TFB. This lack of partial writing appears to be a major showstopper for some potential flatbuffer users.

I don't think either TFB or SFB are great for streaming data larger than what fits in memory, by their very definition (being offset based) it is not a good match. Many binary formats (e.g. Protobuf) support streaming much better out of the box.

I feel that for both systems, if you want to stream large things, you should be streaming a list of buffers, not expect to stream inside a single buffer.

So maybe SFB could be coaxed into supporting that (hey, at least it supports 64bit sizes unlike TFB), but why? The whole point of offsets is random access, which entails a single buffer should be all memory resident or at least mmap-ed.

Umm I'd rather not be the judge of what a common use case is - common use cases also tend to reflect what is practical with the technology at hand. All I can say is that structured data is hard to write outside of text formats and databases when you cannot buffer everything, and that flatbuffers is close to solving it, but not quite.

To design anything, you have to have some use cases in mind, and you will suck at others. No format out there is universally great at everything. TFB was designed with game developers in mind, and I am very happy it is finding a lot of use cases
outside of that, but that is to some extend a coincidence.

I don't see how FB is close to solving that, it is actually one of the worst formats for "cannot buffer everything" scenarios. But that is assuming you insist on it being a single buffer. To me a stream of buffers is a very clean solution. I guess the only bad thing is if you wanted to stream very small records, where both TFB and SFB have some "startup cost".
 
OK, that makes sense. I still think there is space for tighter integration where you want it, just as with JSON. Say, if you limit you data set to valid schema data, you can do such and such - for example parsing JSON to flatbuffers as we already can.

Yes, agreed. Also SFB -> TFB will generally be higher fidelity and very quick. 
 

mikkelfj

unread,
Oct 22, 2016, 12:50:10 AM10/22/16
to FlatBuffers, mik...@dvide.com, maxim...@googlemail.com

I don't think either TFB or SFB are great for streaming data larger than what fits in memory, by their very definition (being offset based) it is not a good match. Many binary formats (e.g. Protobuf) support streaming much better out of the box. 
I feel that for both systems, if you want to stream large things, you should be streaming a list of buffers, not expect to stream inside a single buffer.

Assuming the recent create buffer with size prefix call aligns the buffers correctly, this certainly is an option. But you still need to index the buffers separately if you need random access, such as looking up a phone record.

The point is, you can add a final buffer to the stream with the offsets to all the other buffers, and store the location of the buffer somewhere. If you replace buffer with table and last buffer with vector of tables, you have a standard portable way of doing all that. And there isn't anything intrinsic preventing that from happening in flatbuffers. But it depends on the use case, sometimes it really is better with separate buffers. Either way, you need to limit the active data set in memory while writing, but keeping smaller arrays in a single buffer is much simpler than maintaining a separate non-portable index.
 
So maybe SFB could be coaxed into supporting that (hey, at least it supports 64bit sizes unlike TFB), but why? The whole point of offsets is random access, which entails a single buffer should be all memory resident or at least mmap-ed.
 
Well exactly, but reading and writing are separate steps - so why prevent efficient mmap reading of all data for reading because because the buffer is split into separate sub buffers due to a technicality during write. Also, why would you use another supposedly better streaming format that is at least 100x slower, has no mmap access for read, and requires much more memory all data should be accessible at once, and/or requires a much larger code base for a library or database, when it isn't really necessary? But this perhaps is more relevant to a modified TFB since I have no clue about time/space metrics of SFB.
 
Umm I'd rather not be the judge of what a common use case is - common use cases also tend to reflect what is practical with the technology at hand. All I can say is that structured data is hard to write outside of text formats and databases when you cannot buffer everything, and that flatbuffers is close to solving it, but not quite.

To design anything, you have to have some use cases in mind, and you will suck at others. No format out there is universally great at everything. TFB was designed with game developers in mind, and I am very happy it is finding a lot of use cases
outside of that, but that is to some extend a coincidence.

A lot of technology is used, adapted and optimized outside the originally intended use case because the technology has unforeseen benefits. Granted you can't design for the unforeseen, but you can adapt.


I don't see how FB is close to solving that, it is actually one of the worst formats for "cannot buffer everything" scenarios. But that is assuming you insist on it being a single buffer. To me a stream of buffers is a very clean solution. I guess the only bad thing is if you wanted to stream very small records, where both TFB and SFB have some "startup cost".
 
I already answered above with this sentence in mind. You can easily make flatbuffers mostly streamable so only the high level arrays need to stay in memory. But it possibly requires a top down stack approach - with is already present in flatcc but may not be as natural in other implementations? The startup cost is a significant concern, especially due to alignment overhead. But the lack of portable indexing of buffers is also a concern although it is a workable solution.

Sam Paioletti

unread,
Oct 24, 2016, 1:02:25 PM10/24/16
to FlatBuffers, mik...@dvide.com, maxim...@googlemail.com
I was digging through the schemaless fork a little bit, one thought that came up is would it be feasible to separate the type/offset data from the payload data.

Thinking being if you were sending a bunch of (identical schema, different values) SFBs you could just send the schema info once and then send the SFB w/o schema the remaining times so your not sending the typedata each time.

Sort of like dynamically creating an IDL

Could also do some checking against a dynamic IDL at runtime to make sure it was compatible w/o sending data, might be able to optimize reuse, etc.

Might not be worth the trouble (or just a bad idea, i'm not prideful)...just was a thought as I was thinking about use cases.

Sam

Wouter van Oortmerssen

unread,
Oct 24, 2016, 2:53:03 PM10/24/16
to Sam Paioletti, FlatBuffers, mikkelfj, Maxim Zaks
That is generally an idea that could work, but in many cases doesn't make data smaller, since you now need a way to refer to that separate type/offset data, using an index of some sort.

Types are already a single byte, which is hard to beat.

You can make a binary schema is you define an object with a fixed set of fields, and then refer to that binary schema N times. But that is already assuming a relatively static use case, SFB is meant for very dynamic cases, where potentially each object is different. If your data is really regular such that you can define a schema for it (binary or otherwise), you should probably be using regular FlatBuffers instead.

--
You received this message because you are subscribed to the Google Groups "FlatBuffers" group.
To unsubscribe from this group and stop receiving emails from it, send an email to flatbuffers+unsubscribe@googlegroups.com.

Sam Paioletti

unread,
Oct 24, 2016, 2:56:08 PM10/24/16
to Wouter van Oortmerssen, FlatBuffers, mikkelfj, Maxim Zaks

Sounds good. It would only make sense if it made it smaller… Thanks for the reply.

To unsubscribe from this group and stop receiving emails from it, send an email to flatbuffers...@googlegroups.com.

Wouter van Oortmerssen

unread,
Oct 24, 2016, 3:34:20 PM10/24/16
to mikkelfj, FlatBuffers, Maxim Zaks
Mikkel,

Assuming the recent create buffer with size prefix call aligns the buffers correctly,

Yup, it takes particular care of that. That is easy in TFB because things are built backwards, in SFB if you wanted to length-prefix a buffer after the fact that would be inefficient, though I suppose you can pre-allocate one at the start of the buffer.
 
this certainly is an option. But you still need to index the buffers separately if you need random access, such as looking up a phone record.

True. Though I'd expect (yes, I am going to make use-case assumptions again!) that many uses of streaming are when the items are pretty loosely related, such as with logging, time-series data, or multiplayer networking messages.

The point is, you can add a final buffer to the stream with the offsets to all the other buffers, and store the location of the buffer somewhere. If you replace buffer with table and last buffer with vector of tables, you have a standard portable way of doing all that. And there isn't anything intrinsic preventing that from happening in flatbuffers. But it depends on the use case, sometimes it really is better with separate buffers. Either way, you need to limit the active data set in memory while writing, but keeping smaller arrays in a single buffer is much simpler than maintaining a separate non-portable index.

It seems to me that if you can't keep data in memory or mmap them, a vector of offsets to them is not all that useful, since you're not that likely to want to access them in random order (more likely to just incrementally process them).

Also when are you going to send this vector of offsets? at the "end"? Many uses of streaming have no clear end. And if this vector could be sent intermittendly, then now you need to distinguish between two kinds of buffers.

In all, I think, if needed, this kind of structure is really easy to build on top of SFB (or TFB), allowing each use case to make their own trade-offs.

So maybe SFB could be coaxed into supporting that (hey, at least it supports 64bit sizes unlike TFB), but why? The whole point of offsets is random access, which entails a single buffer should be all memory resident or at least mmap-ed.
 
Well exactly, but reading and writing are separate steps - so why prevent efficient mmap reading of all data for reading because because the buffer is split into separate sub buffers due to a technicality during write. Also, why would you use another supposedly better streaming format that is at least 100x slower, has no mmap access for read, and requires much more memory all data should be accessible at once, and/or requires a much larger code base for a library or database, when it isn't really necessary? But this perhaps is more relevant to a

Sure, I see this. But pretty easy to achieve with a steam of buffers and your index vector.

Or, like I said in the past, this particular use case can also be streamed in a single buffer if a) you don't expect to share strings between two consecutive streamed items, and b) if the vector of offsets to all items can fit in memory and c) we'd modify the writing code to not assume a std::vector.
 
modified TFB since I have no clue about time/space metrics of SFB.

To model "objects" in SFB you have 2 choices: maps or untyped vectors. The former uses binary search for lookup, so is naturally not particularly fast, though is more compact than the binary search feature in TFB. Untyped vectors should be pretty fast, and only a very small constant step slower than TFB, and often more compact than TFB (because of variable bit-width and smaller header).

A lot of technology is used, adapted and optimized outside the originally intended use case because the technology has unforeseen benefits. Granted you can't design for the unforeseen, but you can adapt.

Sure, and I love opening up new "markets", but adapting a binary format can sometimes be hard when set in stone :)

I already answered above with this sentence in mind. You can easily make flatbuffers mostly streamable so only the high level arrays need to stay in memory. But it possibly requires a top down stack approach - with is already present in flatcc but may not be as natural in other implementations? The startup cost is a significant concern, especially due to alignment overhead. But the lack of portable indexing of buffers is also a concern although it is a workable solution.

Maybe have a look at the current C++ code and docs, it's not particularly big, and it has a stack. "top down stack" is ambiguous to me :)

Wouter
 

mikkelfj

unread,
Oct 24, 2016, 4:18:25 PM10/24/16
to FlatBuffers, mik...@dvide.com, maxim...@googlemail.com
I guess we are going around in circles now :)

My main point is just that when you have choice, make it as streamable as possible when there are no real downsides to it.
I understand things can "easily" by done in other ways, and I realize TFB is already shrink-wrapped, but SFB is not.


On Monday, October 24, 2016 at 9:34:20 PM UTC+2, Wouter van Oortmerssen wrote:
Mikkel,

Assuming the recent create buffer with size prefix call aligns the buffers correctly,

Yup, it takes particular care of that. That is easy in TFB because things are built backwards, in SFB if you wanted to length-prefix a buffer after the fact that would be inefficient, though I suppose you can pre-allocate one at the start of the buffer.

You reserve a single re-write for the size and the table object, and possile make it optional to store the descriptor separately when it is not feasible to go back and update. If the offset is stored at the end, you only need the size.

Doing this has the interesting property that you get a transactional confirmation that the write succeeded. I did this a long time ago abusing the WAV file format for a generic interpreted file format.

BTW I'm mostly done adding the size prefix to flatcc TFB as we speak.

  
this certainly is an option. But you still need to index the buffers separately if you need random access, such as looking up a phone record.

True. Though I'd expect (yes, I am going to make use-case assumptions again!) that many uses of streaming are when the items are pretty loosely related, such as with logging, time-series data, or multiplayer networking messages.
 
Exactly.


The point is, you can add a final buffer to the stream with the offsets to all the other buffers, and store the location of the buffer somewhere. If you replace buffer with table and last buffer with vector of tables, you have a standard portable way of doing all that. And there isn't anything intrinsic preventing that from happening in flatbuffers. But it depends on the use case, sometimes it really is better with separate buffers. Either way, you need to limit the active data set in memory while writing, but keeping smaller arrays in a single buffer is much simpler than maintaining a separate non-portable index.

It seems to me that if you can't keep data in memory or mmap them, a vector of offsets to them is not all that useful, since you're not that likely to want to access them in random order (more likely to just incrementally process them).
 
Nope, because the writing side may be vastly different from the reader for many reasons. A sensor node has little memory. A big server may have lots of memory but can only afford to dedicate a lot of ram if it can be lost, so it can mmap read-only but not allocate a full write buffer without partial commits.
 

Also when are you going to send this vector of offsets? at the "end"? Many uses of streaming have no clear end. And if this vector could be sent intermittendly, then now you need to distinguish between two kinds of buffers.
 
You have to eventually define an end,  just like syslog eventually rolls over. So yes - you could roll over at every item in a buffer, but you want to process large chunks. Still, I think one buffer at time is great in many cases but not always. For example search a time interval of logs - it helps to have flatbuffer indexing on top.


In all, I think, if needed, this kind of structure is really easy to build on top of SFB (or TFB), allowing each use case to make their own trade-offs.
I totally agree. And you can't chop off partially writes in SFB or modified TFB if you don't somehow segment data. The point is, however, that you can keep the overlay indexing in the same flatbuffer structure. You cannot just stupidly create one big vector. You can't even do that in main memory (usually).

So maybe SFB could be coaxed into supporting that (hey, at least it supports 64bit sizes unlike TFB), but why? The whole point of offsets is random access, which entails a single buffer should be all memory resident or at least mmap-ed.
 
Well exactly, but reading and writing are separate steps - so why prevent efficient mmap reading of all data for reading because because the buffer is split into separate sub buffers due to a technicality during write. Also, why would you use another supposedly better streaming format that is at least 100x slower, has no mmap access for read, and requires much more memory all data should be accessible at once, and/or requires a much larger code base for a library or database, when it isn't really necessary? But this perhaps is more relevant to a

Sure, I see this. But pretty easy to achieve with a steam of buffers and your index vector.
 
Yep - but why not allow it within if possible? Even if it sometimes makes sense to just stream buffers. You can build complex data structures in FB.

Or, like I said in the past, this particular use case can also be streamed in a single buffer if a) you don't expect to share strings between two consecutive streamed items, and b) if the vector of offsets to all items can fit in memory and c) we'd modify the writing code to not assume a std::vector.
 
Well - I think this is what I'm arguing, just such that partial data can flushed to disk which a few details prevents right now.

modified TFB since I have no clue about time/space metrics of SFB.

To model "objects" in SFB you have 2 choices: maps or untyped vectors. The former uses binary search for lookup, so is naturally not particularly fast, though is more compact than the binary search feature in TFB. Untyped vectors should be pretty fast, and only a very small constant step slower than TFB, and often more compact than TFB (because of variable bit-width and smaller header).
 
You can actually make SFB maps really fast if you add a hierarchy where each level has few members. The following binary search can be repeated four times to search 16 elements with no loop at all: a is the vector (pointer), k is the size, b is a pointer, h is an index. Post cond: k indexes last match if present, otherwise largest smaller, or first entry.

#define BINARY_SEARCH_ITERATION                                             \
        h = k / 2;                                                          \
        b = a + h;                                                          \
        a = *b <= m ? b : a;                                                \
        k -= h;

 
A lot of technology is used, adapted and optimized outside the originally intended use case because the technology has unforeseen benefits. Granted you can't design for the unforeseen, but you can adapt.

Sure, and I love opening up new "markets", but adapting a binary format can sometimes be hard when set in stone :)
 
As long as it doesn't become a Kodak moment :) 
 
Maybe have a look at the current C++ code and docs, it's not particularly big, and it has a stack. "top down stack" is ambiguous to me :)
I'm not sure its necessary with top-down - you just need a vector to collect the offsets. However, with top down in the API, you can start a table, then start adding to a vector, then end the table.
If there are multiple levels of vectors this may be very convient to have. The problem with the current C++ interface if I understand correctly, is that you are not allow to construct the outermost object before you have complete finalized all members - it is less important if it is on a managed stack or you stuff objects on your own vector, as you have hinted at earlier. 

Mikkel
Reply all
Reply to author
Forward
0 new messages