Mikkel / Hải: the whole reason to even consider a binary schemaless format is to be able to make significant improvements over the state of the art (e.g. msgpack), not simply to extend interop options.I feel that in-place reading of data with no parsing / object allocation is a significant enough improvement over msgpack.You're right that a new format could totally be written front to back. The one disadvantage I can see is that you need to reserve a large (32bit?) root pointer at the start and back-patch it, whereas now it is written at the end and can always be a byte. Unlike typed FB, there's no need for this to be limited to 32bit either, 64bit buffers are possible with the current design.One way around that would be to say that the root offset is the LAST thing in the buffer, though that may be strange as well? Certainly, I want formats to be optimized for reading at the expense of writing where needed, and reading from the front sounds generally more useful and cache friendly.As for bottom up, I think here the reasons for doing bottom up are stronger in SFB than TFB (we need a better name! :): The code actually decides on the fly whether to encode offsets and data of its children as 8/16/32/64 quantities, and to do that it needs to have all children already serialized. Being able to do this automatically and optimally without too much overhead is a huge win (many small vectors can use 8bit offsets).Yes, feel free to move this to a new thread :)On Sun, Aug 14, 2016 at 6:06 PM, Hải Nguyễn Trung <hain...@gmail.com> wrote:From the document and the source code, I learned that the benefit of schemaless FB over Msgpack is that SL FB has quick access to sub values, saying values inside a vector or nested object. But the performance gain is not so much, IMHO.I am with you on considering top down approach rather than bottom up.Another thing about FB in general (including Schemaless FB) is that FB designates for read heavy operation, meanwhile for dynamic storage, an equal read-write ratio should be considered.We have some existed choices in our shopping list for schemaless design: Amazon Ion, ArangoDB Velocypack (they both claim to be more accessible than Msgpack), and of course Msgpack for this widely adopted.By the way, we should move the topic of schemaless into another post, leaving this topic for its main purpose: What do people use FlatBuffers for?
On Saturday, August 13, 2016 at 3:10:00 PM UTC+7, mikkelfj wrote:Is there any very significant benefit to this format over msgpack?Yes msgpack is sort of parsed, and it isn't super fast, but I think tighter msgpack integration should be considered before crearting a new format just becuase. And a schemaless design will make flatbuffers less portable because implemetations will lack for a long time, whereas msgpack will be available. Now, if there really are great benefits to fb schemaless, this may be worthwhile.That said, for dynamic storage of data in a database, it can be very tricky to manage compiled flatbuffer schema, and msgpack may not be fast enough for all purposes.The following relates to the layout of the current flatbuffer system - and I don't know the schemaless approach well enough to know if this applies here. But some learnings:Bottom up:I did the C implementation which showed that top down can be done fast without preventing bottom up use, but it does add some implementation complexity. I also adds some memory overhead, but likely less than application space memory otherwise needed for bottom up. Having worked with flatbuffers in both C and Go, I must say the bottom up approach has some limitations, although of course I am biased: If you read in a number of external optional string fields, you must first create all the flatbuffer strings, then subsequently add the strings to the buffer. But since the data is optional, you end up with a lot of temporary variables and repeated tests for presence in separate stages of building the buffer. With top down you can just test field presence once, and create the string and add it one step. Top down is also less error prone with respect to violating flatbuffer construction rules.Back to front:We have seen issues with streaming data and flushing data to disk. With back to front this requires and extra step to reverse incremental flushes to disk. And in terms of building the buffer or speed of access, these benefits are at most minor or non-existent. Front to back doesn't solve all problems because length fields would also have to be stored after, say, a vector, mening the vector should be referenced by pointing to a tail header. With this, memory constrained devices can more efficiently build buffers in partial steps and flush as data arrives and data can be streamed to disk or network without reordering.I understand all of this from a historical perspective, but I think the bottom up and and back to front approach should be reconsidered in a new design.
On Thursday, August 11, 2016 at 1:26:06 AM UTC+2, Wouter van Oortmerssen wrote:Hải:Here is a very first attempt at Schemaless FlatBuffers (branch schemaless):This is far from finished. It's something that be can be used to give feedback on its design.It comes with a basic implementation, some tests, and a rationale doc that should explain more of the direction this is going.Let me know what you think :)On Tue, Aug 2, 2016 at 8:16 PM, Hải Nguyễn Trung <hain...@gmail.com> wrote:Hello Wouter,How the Schemaless is going? May I have an access to the early design (wire format) for this one?Thank,Hai.
On Saturday, July 30, 2016 at 12:14:35 AM UTC+7, Wouter van Oortmerssen wrote:Maxim: that is trying to emulate a dynamically typed system in FlatBuffers, and will be rather inefficient. I'm talking about something that does that natively.Hải: yes, it is zero-copy like FlatBuffers, but it knows about types in-line.On Fri, Jul 29, 2016 at 1:46 AM, Hải Nguyễn Trung <hain...@gmail.com> wrote:Awesome! Does it support zero serialization by using offset to value (like FlatBuffers)? Does it support schemaless using meta tag preceding the real value, like JSON?Cannot wait to see it out!By the way, I don't like the name schemaless, it sounds like "She-male-les" :)
On Wednesday, July 27, 2016 at 11:54:35 PM UTC+7, Wouter van Oortmerssen wrote:Yes, it would be nice to get a first version of schemaless FlatBuffers out, so y'all can comment on it. Maybe I'll put it in a seperate branch for now. I'll see what I can do.On Tue, Jul 26, 2016 at 7:16 AM, Hải Nguyễn Trung <hain...@gmail.com> wrote:Hello Wouter,I have the similar interest in schemaless solution with zero copy serialization functionality. May I join the early development of this schemaless FlatBuffers, if this version towards to community driven?I have some prior experience with schema design of Capnproto, FlatBuffers, ProtoBuf, as well as schemaless one like MsgPack, Amazon Ion. I think of getting the best of both worlds: flexibility of schema-less, as well as speed of schema design.My main goal is a format of object data with reflection forn mutation, and certainly high speed.
On Tuesday, July 26, 2016 at 1:31:43 AM UTC+7, Wouter van Oortmerssen wrote:srinisingha: FlatBuffers is really designed for strongly typed (schema) use cases. Though we have some reflection functionality, it is clumsy compared to a system that was designed for it. The object API does not improve this situation.If you want to store any data (schema-less), you may be better off with a different serialization solution for the moment. I have been working on a schema-less version of FlatBuffers, but that may take a while longer to be ready.agallego: currently that is not possible, but I agree it should be possible to substitute your own data-types. Whatever string you replace it with would need to correspond to the STL API though.On Fri, Jul 22, 2016 at 11:58 AM, agallego <galleg...@gmail.com> wrote:Hi Wouter,Is there a way to define the string type used in the schema file? ie.:typedef string my::StringThis object API works really well for my RPC. You can simply use the type and I'll build it and put it on a flatbuffer and ship it downstream.Thanks.
On Wednesday, July 20, 2016 at 8:47:49 PM UTC-4, Wouter van Oortmerssen wrote:On Wed, Jul 20, 2016 at 11:20 AM, Wouter van Oortmerssen <w...@google.com> wrote:The first version will be out shortly :)On Mon, Jul 18, 2016 at 5:52 PM, <srini...@gmail.com> wrote:Great to hear that. Thank you Wouter.May I take a preview about new API, hence to evaluate the efficiency to use it as data structure for the database?Thank you. Srini.
On Tuesday, July 19, 2016 at 3:30:52 AM UTC+7, Wouter van Oortmerssen wrote:Yes, the reflection API does automatic resizing when mutating strings etc. It is not ideal, but it is useful for patchups.The new API has an optional way to unpack a FlatBuffer into C++ data structures, which can then be mutated and serialized back with a single call. Of course this not in-line, and thus not as efficient as the base flatbuffer reading/writing code.On Sun, Jul 17, 2016 at 6:30 PM, <srini...@gmail.com> wrote:Thank you Wouter, I will take a look of Reflection API.In any case, update in place with new value that bigger than old one cause the whole buffer to be moved to larger buffer to better suit the new value. What is forthcoming API you mentioned? Can we overcome this re-allocation issue with an affordable price?I saw in CapnProto, it use multiple arenas (like multiple memory zones - I guess) which seem reasonable. What do you think about that?Thank. Srini.
On Thursday, July 14, 2016 at 11:50:58 PM UTC+7, Wouter van Oortmerssen wrote:srinisingha: JSON is even less mutable than FlatBuffers (you can't modify it in place), so if JSON is sufficient for a database, FlatBuffers will be better.FlatBuffers can be modified in C++ using the reflection API, but that isn't easy. There's also a forthcoming object API that will make mutation a lot easier. If you're using a language other than C++, this may take a while to arrive.On Thu, Jul 14, 2016 at 2:38 AM, <srini...@gmail.com> wrote:I have an idea of storing flatbuffer instead of JSON in a data store (Kind of MongoDB). But it seems FB is intended for immutability (read heavily), not for mutable operation (write/update). For example, update a string field may case performance degradation because the total memory layout has been changed. I Flatbuffer suitable to use as data layout for database?
On Saturday, May 28, 2016 at 9:38:53 AM UTC+7, David Budworth wrote:Not actually in use yet, but I'm likely to use it in a trading app where we have to transmit a high speed data set from our exchange interface, written in go, to a set of back trade idea finding servers, written in java.Looking at flatbuffers for both the minimal allocations (read: gc overhead) and fast encode/decode.
On Wednesday, May 20, 2015 at 11:06:52 AM UTC-7, Wouter van Oortmerssen wrote:We've seen a lot of people use FlatBuffers, both here and on GitHub, yet we have no idea what products FlatBuffers ends up in. At Google we'd like to know, to see what impact FlatBuffers is having.So, what do you use FlatBuffers for?What products will it be used for?How is it improving on whatever you used before?I would expect many of you can't say so publicly, in which case I would appreciate a private email: wvo at "thatsearchengine" dot com.--
You received this message because you are subscribed to the Google Groups "FlatBuffers" group.
To unsubscribe from this group and stop receiving emails from it, send an email to flatbuffers...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.--
You received this message because you are subscribed to the Google Groups "FlatBuffers" group.
To unsubscribe from this group and stop receiving emails from it, send an email to flatbuffers...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.--
You received this message because you are subscribed to the Google Groups "FlatBuffers" group.
To unsubscribe from this group and stop receiving emails from it, send an email to flatbuffers...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.--
You received this message because you are subscribed to the Google Groups "FlatBuffers" group.
To unsubscribe from this group and stop receiving emails from it, send an email to flatbuffers...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.--
You received this message because you are subscribed to the Google Groups "FlatBuffers" group.
To unsubscribe from this group and stop receiving emails from it, send an email to flatbuffers...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.--
You received this message because you are subscribed to the Google Groups "FlatBuffers" group.
To unsubscribe from this group and stop receiving emails from it, send an email to flatbuffers...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.--
You received this message because you are subscribed to the Google Groups "FlatBuffers" group.
To unsubscribe from this group and stop receiving emails from it, send an email to flatbuffers...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.--
You received this message because you are subscribed to the Google Groups "FlatBuffers" group.
To unsubscribe from this group and stop receiving emails from it, send an email to flatbuffers...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
> I feel that in-place reading of data with no parsing / object allocation is a significant enough improvement over msgpack.I figured as much, but I wonder in terms of practical use - memory and speed - but you'll never know without trying - its just that there are so many formats that is becoming a major issue.
I think looking into a shadow vtable with types and names may also be an option, extending the current format and resctricting vtables to same type tables. Disregarding schemaless, the fb format would benefit from being able to decide if a voffset is a specific reference type for copying data etc. But preferably with front to back support.
You suggestion about storing the root offset last is in line with what I wrote earlier: front to back is insufficient, you also need each reference to point to the tail and have the header placed there. So yes, this may sound weird, but is necessary.
In terms of caching, I believe modern intel achitectures work equally well scanning back to front and front to back. Random access beyond one page is never efficient because prefetching is disabled to prevent unnecessary page misses - or something along those lines. For simpler devices there are no prefetching and back to front has same performance. So caching isn't a big deal wrt. front to back vs back to front. In fact, binary seach has been shown to be faster than linear scanning in nearly all array sizes provided an efficient implementation (which is rarely seen though).
As to top down, there is a need for buffering and when this is done, you still get you offset size optimization because you can adjust at the end (I presume without studying details) - this can also be done with integer arrays. In my FlatCC implementation I also buffer so voffsets can be computed correctly. Offset sizes can also be adjusted. The exception is when the input vector is given entirely in one call. This is a major weakness for very large vectors that are constructed piecemeal, but in most cases the temporary storage can happen on a reusable stack that typically consumes less space than application space equivalent temporary storage. Notably, this tempory stack will be in hot cache, which is a big deal.
The overhead of this extra buffering step is not large due to hot cache and because it is optional: if you provide the entire vector as input, it is not required, and compared to the time needed for constructing a buffer vs. memory bandwidth, it also suggests low overhead (fb write is roughly 10x slower than fb read).I'm sort of tempted to support a fragmented vector type where each fragment can have its own offset size or element size. This solves the issue of buffering very large vectors which is useful for sensor nodes streaming data on the wire, but it is also an additional complexity that will rarely be needed. The alternative is to do this at the application level.
Sorry, I don't buy that reasoning. Whenever someone introduces a new X, there's always complaints we already have too many X (languages, operating systems, file formats, standards..).
I looked into simply writing type/name data along with the current FlatBuffer data, but you'd end up with something way less optimal and flexible than the Schemaless system I am proposing. Schemaless doesn't need vtables since it doesn't deal with optional/missing fields. But what it does need is open ended dictionaries which FlatBuffers doesn't support well. It is too much of a mismatch.
Ok, so front to back can be considered.
In terms of caching, I believe modern intel achitectures work equally well scanning back to front and front to back. Random access beyond one page is never efficient because prefetching is disabled to prevent unnecessary page misses - or something along those lines. For simpler devices there are no prefetching and back to front has same performance. So caching isn't a big deal wrt. front to back vs back to front. In fact, binary seach has been shown to be faster than linear scanning in nearly all array sizes provided an efficient implementation (which is rarely seen though).What about ARM?
But yes, this is probably not the biggest deal. What could be an issue if a Schemaless buffer sit inside a regular FlatBuffer (a use case which I anticipate will be useful).As to top down, there is a need for buffering and when this is done, you still get you offset size optimization because you can adjust at the end (I presume without studying details) - this can also be done with integer arrays. In my FlatCC implementation I also buffer so voffsets can be computed correctly. Offset sizes can also be adjusted. The exception is when the input vector is given entirely in one call. This is a major weakness for very large vectors that are constructed piecemeal, but in most cases the temporary storage can happen on a reusable stack that typically consumes less space than application space equivalent temporary storage. Notably, this tempory stack will be in hot cache, which is a big deal.Top down buffering can be done, but requires more temporary storage (need a whole stack rather than just the current object), which may get unwieldy when mixing dictionaries with other objects, etc.).
Can you give an example of what you think top-down construction of some nested objects would look like, with a variety of data types? I am guessing I am failing to see the advantage.
I'm sort of tempted to support a fragmented vector type where each fragment can have its own offset size or element size. This solves the issue of buffering very large vectors which is useful for sensor nodes streaming data on the wire, but it is also an additional complexity that will rarely be needed. The alternative is to do this at the application level.We still want to keep in-place access fast, so a fragmented vector doesn't sound great to me.
I tend to agree, but in this specific area, there really are a lot, and a lack of restraint. But I agree, access with allocation is desireable.
I have done most of this already in FlatCC and in particular JSON parsing where the low-level typeless write api manages some 5+ stacks, most rather small and the name lookup happens in a ternary trie implemented in generated code, but could have been a binary search in a map instead (though here the situation is reverse since write does lookup while schemaless would have read lookup). And it is not trivial to do, but performance and overhead is not a big deal if done right and the typeless builder with stacks are still only about 1000 lines of C or so, and most of this is alignment and vtable management. JSON parsing is only 4 times slower than direct buffer building, and direct buffer building is on par with C++ +/- some percentages.
(Note that in FlatCC it is actually not necessary to create each string before appending to the vector or storing as a field, since there are _str method variants to do this directly, but the following pseudocode has no such support, even so, such _str methods would be difficult to support in a bottom up only api).
Well, I am not convinced fragmented vectors are worth the complexity, but performance is not a concern to me, compared to the benefits. Fragmentation would be a fixed size 4K 8K, or 16K element count with a parent vector indexing into each fragment, similar to a STL Deque datatype. Since at these sizes we end up hitting 2nd level cache, the indirection is fairly cheap since this parent vector will reside in hot cache similar to a vtable. There is no point in fragmentation for small vectors, and fragmentation should always be trivial to index, i.e. not variable sized fragments. Fragmented vectors would not be an overhead to vectors that are not fragmented since you have to branch out on offset size in the schemaless design, and could branch out on fragmented and non-fragmented access at the same time (roughly, need to see implementation in detail). Memory is totally fragmented anyway down to 64 bytes cachelines typically, but, placement of fragments may affect prefetch if the fragments are smaller than page sizes, which they should not be.
If top-down would require more intermediate structures than bottom-up, I would go with bottom-up. I don't think it does though. I am not sure why we'd require that many stacks/maps etc.
Ok.. I am not sure how I see that is top down. It is completely equivalent to current bottom-up code, if the "begin" calls end up not doing anything important.
The way I see it, you have one stack, and every time you call "add" you just add an item to the stack. Whenever you encounter an "end", it looks at all the items between that and the corresponding start (which I would use a local variable for, but I guess this is one additional stack in your case), and then a) figure out how compact the vector/map can be encoded, and b) sort the items if it is a map. Then replace all the items on the stack with an offset to the newly created object.
Right, they could be their own type. But what is the typical use case?
You cannot create a string and add to table, then create another string and add to the table because when you create the second string, the buffer is blocked by the current table being built. This extends to adding child tables. In top down you have to create them all in advance before adding to the table.
The way I see it, you have one stack, and every time you call "add" you just add an item to the stack. Whenever you encounter an "end", it looks at all the items between that and the corresponding start (which I would use a local variable for, but I guess this is one additional stack in your case), and then a) figure out how compact the vector/map can be encoded, and b) sort the items if it is a map. Then replace all the items on the stack with an offset to the newly created object.Yes, this is largely how it is. Its just that there are a lot of extra details in typed flatbuffers that might not be needed in untyped buffers.In many cases flatcc indeed has a begin operation that does next to nothing, but even then, it sometimes need to something in big-endian.
First of all, the premise is that the buffer can be emitted partially so you don't need memory for the full buffer. If you have large vectors, such as sensor samples, you need to periodically emit data. Yes, you could sent a buffer every time you feel like it, but it may be more convenient and bandwidth efficient, not to mention radio power use, to keep larger series together, but the vector won't fit into RAM on sensor nodes.Another use case is simply draining data from a database query and transmitting those data as a flatbuffer. In a high performance, highly concurrent system, you cannot afford to use very large memory buffers so the problem is the same as for sensor nodes, only the scale is different.Finally, the 2GB limit may be a problem in this context because it forces you to be aware of this limit even if could get away with draining data using only say 16KB of working memory.
You cannot create a string and add to table, then create another string and add to the table because when you create the second string, the buffer is blocked by the current table being built. This extends to adding child tables. In top down you have to create them all in advance before adding to the table.Not necessarily, because the way I see it, nothing of the current table gets built until all children have been collected on the stack and "end" is reached, so writing strings/sub-objects is totally fine.
--
You received this message because you are subscribed to the Google Groups "FlatBuffers" group.
To unsubscribe from this group and stop receiving emails from it, send an email to flatbuffers+unsubscribe@googlegroups.com.
Brian: thanks! Yes, that's what I'm learning too :)Mikkel: It generally will not be possible to add fields to an object after you call "end" on it.
Yeah, I don't see why adding fields to an object (of any structure) before or after a child object requires extra work. This all comes for free when using a stack to build objects.
--
To unsubscribe from this group and stop receiving emails from it, send an email to flatbuffers...@googlegroups.com.
To unsubscribe from this group and stop receiving emails from it, send an email to flatbuffers+unsubscribe@googlegroups.com.
slb.Map([&]() { slb.KeyWithVector("vec", [&]() { slb.Int(-100); slb.String("Fred"); slb.IndirectFloat(4.0f); }); slb.KeyWithUInt("foo", 100);});To unsubscribe from this group and stop receiving emails from it, send an email to flatbuffers+unsubscribe@googlegroups.com.
slb.add { slb in
slb.add("vec"){ slb in
slb.add(-100) // or if you like operators slb << -100
slb.add("Fred") // or if you like operators slb << "Fred"
slb.add(4.0) // or if you like operators slb << 4.0 }
}To unsubscribe from this group and stop receiving emails from it, send an email to flatbuffers+unsubscribe@googlegroups.com.
To unsubscribe from this group and stop receiving emails from it, send an email to flatbuffers+unsubscribe@googlegroups.com.
I have no opinion on binn, I just accidentally found it - it seems to load the data into memory as you say, although I'm not sure why that is necessary.As to SFB, I have only briefly read through the new documentation and not looked at the C++ interface.A few points:The vector has the size field early. This prevents streaming of on the fly generated data such as sensor nodes with little memory, and requires a seek on larger mmap'ed files. Notably SSD needs to update a lot to change little.
The vector also has an intelligent adaptive size field, which may not be a problem for streaming if it can be hinted, but if it has to learned, the vector cannot be streamed before the entire vector is known even if the size is last. I did not look at how vectors are referenced, but I presume it goes to the head rather than the tail. It needs to be the tail if streaming is to be supported. It looks odd, but performance should not suffer noticeably and can easily be translated while reading.
The map interface and key/string/blob values. It is not entirely clear if/when strings are used as blobs and if they are then zero terminated beyond their size.
For keys, it is extremely useful to support integer keys. I have become aware of a binary search implementation that is about faster than linear scan down to 4 elements or so. This means maps can be highly efficient and there are certain data structures that would directly benefit from such a format. But if all keys must adhere to strcmp performance is all gone. Still nice compared to JSON, but orders of magnitude slower.
The relation to TFB (typed FB) and import/export functionality is unclear to me, after a very brief first reading.Also, are array elements homogenous, meaning you can't ever mix floats and strings as you can in JSON in the general case?
This is generally not a streaming friendly format (much like TFB), since it is a DAG with offsets that are not guaranteed to be neatly contained.Also, much like TFB, it will want the entire buffer in memory before it is finished in the general case. That is because it may still be referring to earlier keys or key vectors, or other shared data objects.
That said, it is very suitable for streaming in case where you do not want to hold on the items you're streaming, since those can be written into the buffer first, and the vector that refers to them later (unlike TFB, there are no inline structs, vectors always come after all of its non-scalar content).
The vector also has an intelligent adaptive size field, which may not be a problem for streaming if it can be hinted, but if it has to learned, the vector cannot be streamed before the entire vector is known even if the size is last. I did not look at how vectors are referenced, but I presume it goes to the head rather than the tail. It needs to be the tail if streaming is to be supported. It looks odd, but performance should not suffer noticeably and can easily be translated while reading.Can you be precise about the exact use case, because I'm not seeing when this is necessary. Are you talking about streamed writes (as what I am responding to above) or streamed reads? Again same caveats apply that in the general case this format is not designed for streaming internally to the buffers.
Currently when a string is a blob is up to the user, and will always be followed a 0 terminator. I'm considering making blob its own separate type, as this may be cleaner. That depends on encoding though, as the current main type field is 4 bits, and we're running out of enums. I may merge it with the 2 bit vector type field to gain more room.
For keys, it is extremely useful to support integer keys. I have become aware of a binary search implementation that is about faster than linear scan down to 4 elements or so. This means maps can be highly efficient and there are certain data structures that would directly benefit from such a format. But if all keys must adhere to strcmp performance is all gone. Still nice compared to JSON, but orders of magnitude slower.Integer key-ed maps may indeed by nice to have, but I would simply make this its own type, i.e. it would use a typed vector of ints as its key vector and use a custom binary search.
The relation to TFB (typed FB) and import/export functionality is unclear to me, after a very brief first reading.Also, are array elements homogenous, meaning you can't ever mix floats and strings as you can in JSON in the general case?They are independent (an incompatible) formats, but one can be embedded in the other (typically SFB inside a TFB, but who knows there are use cases for the reverse). They would be convertable into eachother much like JSON (and also into/from JSON).
I am referring to streamed writes, which is when you would have to provide the hint. For example a sensor node might write humidity to a vector every two seconds for half a day and given the value is in percentage, it has a pretty good idea about the range.
But in general I am also referring to situations like map/reduce processing really large files that don't sit well in memory, such as extracting the area codes of phone lists. Here it is very expensive to load the data into a database and a buffering format won't work well. Since a process may have multiple steps, it is equally important that the format can be read partially while generating the new output. Currently CSV is about the only format that I am aware of, that can do this.
It is sort of acceptable to require that offset vectors be buffered because you can segment the raw data.An inline vector of integer pairs can be handled as two vectors, as in column oriented db storage, which is very efficient for reads. For streamed writes it won't work. Hence inline struct vectors with tail stored size fields may still be desireable.
I did not look into the type encoding, so can't speak of that but msgpack v5 apparently needed a separate utf8 string.
Integer key-ed maps may indeed by nice to have, but I would simply make this its own type, i.e. it would use a typed vector of ints as its key vector and use a custom binary search.If you mean format supported type, then this is fine, if you mean the user can simple create two vectors, then no so much:this would be hard to communicate in a general format. Similar to recent discussions about DIY struct unions in TFB.
They are independent (an incompatible) formats, but one can be embedded in the other (typically SFB inside a TFB, but who knows there are use cases for the reverse). They would be convertable into eachother much like JSON (and also into/from JSON).This didn't answer my question about mixed types (and I didn't bother to read spec carefully enough) - but can an offset reference different types in different vector elements? This will be closer to encode all of JSON and further away from being convertible to TFB. Not saying one is better than the other.
Actually I think TFB should have vectors of unions where the union type is embedded in the tables vtable, if that were an option.
Mikkel,I am referring to streamed writes, which is when you would have to provide the hint. For example a sensor node might write humidity to a vector every two seconds for half a day and given the value is in percentage, it has a pretty good idea about the range.Ahh I see what you're saying. Basically the size field may cause the vector to needa bigger bit-width, which is only known at the end.
But in general I am also referring to situations like map/reduce processing really large files that don't sit well in memory, such as extracting the area codes of phone lists. Here it is very expensive to load the data into a database and a buffering format won't work well. Since a process may have multiple steps, it is equally important that the format can be read partially while generating the new output. Currently CSV is about the only format that I am aware of, that can do this.Having a buffer partially in memory is a use case I would want people to use thru mmap, though I guess special purpose streaming readers that work directly with files or networking could exist.
It is sort of acceptable to require that offset vectors be buffered because you can segment the raw data.An inline vector of integer pairs can be handled as two vectors, as in column oriented db storage, which is very efficient for reads. For streamed writes it won't work. Hence inline struct vectors with tail stored size fields may still be desireable.And those two vectors would sit in two separate SFBs? Since you can't write totwo vectors at once.
I'll take your word for it that this is a common use case, though I am not sure I want to flip vectors for it. It is an additional indirection that is not generally needed.
Integer key-ed maps may indeed by nice to have, but I would simply make this its own type, i.e. it would use a typed vector of ints as its key vector and use a custom binary search.If you mean format supported type, then this is fine, if you mean the user can simple create two vectors, then no so much:this would be hard to communicate in a general format. Similar to recent discussions about DIY struct unions in TFB.Yes, a type directly supported by format & API, we'd have the current MAP type and a INTMAP type.
They are independent (an incompatible) formats, but one can be embedded in the other (typically SFB inside a TFB, but who knows there are use cases for the reverse). They would be convertable into eachother much like JSON (and also into/from JSON).This didn't answer my question about mixed types (and I didn't bother to read spec carefully enough) - but can an offset reference different types in different vector elements? This will be closer to encode all of JSON and further away from being convertible to TFB. Not saying one is better than the other.It supports both untyped and typed vectors, so yes. It is meant to be a super-set of JSON. Converting to TFB has the same caveats as converting from JSON to TFB has.
I wasn't actually thinking about that, but that is a valid point too. Without being familiar with the api, my thinking was that you keep appending integers to the vector and it decides at the end what is the smallest possible representation. If
the vector can be flushed during this process, it is too late to make that decision at the end.
mmap is definitely of interest, especially for reading. But for writing large amounts of data you need to flush it occasionally, and mmaped or not, it needs to the next address range in the buffer, or you have to rearrange the buffer later as is the case today with TFB. This lack of partial writing appears to be a major showstopper for some potential flatbuffer users.
Umm I'd rather not be the judge of what a common use case is - common use cases also tend to reflect what is practical with the technology at hand. All I can say is that structured data is hard to write outside of text formats and databases when you cannot buffer everything, and that flatbuffers is close to solving it, but not quite.
OK, that makes sense. I still think there is space for tighter integration where you want it, just as with JSON. Say, if you limit you data set to valid schema data, you can do such and such - for example parsing JSON to flatbuffers as we already can.
I don't think either TFB or SFB are great for streaming data larger than what fits in memory, by their very definition (being offset based) it is not a good match. Many binary formats (e.g. Protobuf) support streaming much better out of the box.
I feel that for both systems, if you want to stream large things, you should be streaming a list of buffers, not expect to stream inside a single buffer.
So maybe SFB could be coaxed into supporting that (hey, at least it supports 64bit sizes unlike TFB), but why? The whole point of offsets is random access, which entails a single buffer should be all memory resident or at least mmap-ed.
Umm I'd rather not be the judge of what a common use case is - common use cases also tend to reflect what is practical with the technology at hand. All I can say is that structured data is hard to write outside of text formats and databases when you cannot buffer everything, and that flatbuffers is close to solving it, but not quite.To design anything, you have to have some use cases in mind, and you will suck at others. No format out there is universally great at everything. TFB was designed with game developers in mind, and I am very happy it is finding a lot of use casesoutside of that, but that is to some extend a coincidence.
I don't see how FB is close to solving that, it is actually one of the worst formats for "cannot buffer everything" scenarios. But that is assuming you insist on it being a single buffer. To me a stream of buffers is a very clean solution. I guess the only bad thing is if you wanted to stream very small records, where both TFB and SFB have some "startup cost".
--
You received this message because you are subscribed to the Google Groups "FlatBuffers" group.
To unsubscribe from this group and stop receiving emails from it, send an email to flatbuffers+unsubscribe@googlegroups.com.
Sounds good. It would only make sense if it made it smaller… Thanks for the reply.
To unsubscribe from this group and stop receiving emails from it, send an email to flatbuffers...@googlegroups.com.
Assuming the recent create buffer with size prefix call aligns the buffers correctly,
this certainly is an option. But you still need to index the buffers separately if you need random access, such as looking up a phone record.
The point is, you can add a final buffer to the stream with the offsets to all the other buffers, and store the location of the buffer somewhere. If you replace buffer with table and last buffer with vector of tables, you have a standard portable way of doing all that. And there isn't anything intrinsic preventing that from happening in flatbuffers. But it depends on the use case, sometimes it really is better with separate buffers. Either way, you need to limit the active data set in memory while writing, but keeping smaller arrays in a single buffer is much simpler than maintaining a separate non-portable index.
So maybe SFB could be coaxed into supporting that (hey, at least it supports 64bit sizes unlike TFB), but why? The whole point of offsets is random access, which entails a single buffer should be all memory resident or at least mmap-ed.Well exactly, but reading and writing are separate steps - so why prevent efficient mmap reading of all data for reading because because the buffer is split into separate sub buffers due to a technicality during write. Also, why would you use another supposedly better streaming format that is at least 100x slower, has no mmap access for read, and requires much more memory all data should be accessible at once, and/or requires a much larger code base for a library or database, when it isn't really necessary? But this perhaps is more relevant to a
modified TFB since I have no clue about time/space metrics of SFB.
A lot of technology is used, adapted and optimized outside the originally intended use case because the technology has unforeseen benefits. Granted you can't design for the unforeseen, but you can adapt.
I already answered above with this sentence in mind. You can easily make flatbuffers mostly streamable so only the high level arrays need to stay in memory. But it possibly requires a top down stack approach - with is already present in flatcc but may not be as natural in other implementations? The startup cost is a significant concern, especially due to alignment overhead. But the lack of portable indexing of buffers is also a concern although it is a workable solution.
Mikkel,Assuming the recent create buffer with size prefix call aligns the buffers correctly,Yup, it takes particular care of that. That is easy in TFB because things are built backwards, in SFB if you wanted to length-prefix a buffer after the fact that would be inefficient, though I suppose you can pre-allocate one at the start of the buffer.
this certainly is an option. But you still need to index the buffers separately if you need random access, such as looking up a phone record.True. Though I'd expect (yes, I am going to make use-case assumptions again!) that many uses of streaming are when the items are pretty loosely related, such as with logging, time-series data, or multiplayer networking messages.
The point is, you can add a final buffer to the stream with the offsets to all the other buffers, and store the location of the buffer somewhere. If you replace buffer with table and last buffer with vector of tables, you have a standard portable way of doing all that. And there isn't anything intrinsic preventing that from happening in flatbuffers. But it depends on the use case, sometimes it really is better with separate buffers. Either way, you need to limit the active data set in memory while writing, but keeping smaller arrays in a single buffer is much simpler than maintaining a separate non-portable index.It seems to me that if you can't keep data in memory or mmap them, a vector of offsets to them is not all that useful, since you're not that likely to want to access them in random order (more likely to just incrementally process them).
Also when are you going to send this vector of offsets? at the "end"? Many uses of streaming have no clear end. And if this vector could be sent intermittendly, then now you need to distinguish between two kinds of buffers.
In all, I think, if needed, this kind of structure is really easy to build on top of SFB (or TFB), allowing each use case to make their own trade-offs.
So maybe SFB could be coaxed into supporting that (hey, at least it supports 64bit sizes unlike TFB), but why? The whole point of offsets is random access, which entails a single buffer should be all memory resident or at least mmap-ed.Well exactly, but reading and writing are separate steps - so why prevent efficient mmap reading of all data for reading because because the buffer is split into separate sub buffers due to a technicality during write. Also, why would you use another supposedly better streaming format that is at least 100x slower, has no mmap access for read, and requires much more memory all data should be accessible at once, and/or requires a much larger code base for a library or database, when it isn't really necessary? But this perhaps is more relevant to aSure, I see this. But pretty easy to achieve with a steam of buffers and your index vector.
Or, like I said in the past, this particular use case can also be streamed in a single buffer if a) you don't expect to share strings between two consecutive streamed items, and b) if the vector of offsets to all items can fit in memory and c) we'd modify the writing code to not assume a std::vector.
modified TFB since I have no clue about time/space metrics of SFB.To model "objects" in SFB you have 2 choices: maps or untyped vectors. The former uses binary search for lookup, so is naturally not particularly fast, though is more compact than the binary search feature in TFB. Untyped vectors should be pretty fast, and only a very small constant step slower than TFB, and often more compact than TFB (because of variable bit-width and smaller header).
A lot of technology is used, adapted and optimized outside the originally intended use case because the technology has unforeseen benefits. Granted you can't design for the unforeseen, but you can adapt.Sure, and I love opening up new "markets", but adapting a binary format can sometimes be hard when set in stone :)
Maybe have a look at the current C++ code and docs, it's not particularly big, and it has a stack. "top down stack" is ambiguous to me :)