FlatBuffers as Directed graph

1,099 views
Skip to first unread message

Maxim Zaks

unread,
May 26, 2016, 7:13:58 AM5/26/16
to FlatBuffers
Hi everyone,

I wrote a blog post, how I managed to encode and decode directed graphs with cycles in FlatBuffersSwift implementation.

Feedback is very much appreciated.

mikkelfj

unread,
May 27, 2016, 3:12:28 PM5/27/16
to FlatBuffers
I am not sure exactly what you are trying to accomplish but I suspect you intend to store cycles as native FlatBuffer references.

The FlatBuffer format can only store acyclic graphs as direct object references. An offset is allways positive. If you introduce cycles you need a negative offset, and such a buffer will never pass validation by a valid FlatBuffer verifier either because it doesn't like negative offsets, or because a negative offset becomes a very large out of bounds offset.

Having an acyclic requirement vastly simplifies validating buffers.

If you need more complex graph structures, you should use an index table:

Enumerate all persons in a single vector and store all friends as an integer array. It is the same problem to solve, but it avoids violating the FlatBuffers format.

The reflection.fbs schema uses this approach.

Maxim Zaks

unread,
May 28, 2016, 3:40:13 AM5/28/16
to FlatBuffers
Offset is specified as Int32 not UInt32. At least this is what I see in C# Implementation.
Ah I see that in C++ there are two types of offset one signed and one unsigned:

Well, this is unfortunate :)

As far as I know, there is no rule in FBS which prohibits user to declare a recursive type, like the one that I use in the example.
Recursive table definitions can result in DAG but also in cycles. Btw. flatc crashes when feed with recursive struct definition. Which is fine because recursive structs are physically impossible.

Using an index table is a "walking stick". Same as describing DAG or complex networks in JSON or XML. I believe we can do better with FlatBuffers.

I haven't looked at the validator yet, but it should not be more complicated than what I mentioned for decoding. Keep a set of already visited table offsets, if you see that FBS has recursive definitions. It makes the process a bit slower, but again I think it is totally worth it. The users would be able to do so much more with FlatBuffers than with other serialisation formats.

mikkelfj

unread,
May 28, 2016, 7:13:32 AM5/28/16
to FlatBuffers


I haven't looked at the validator yet, but it should not be more complicated than what I mentioned for decoding. Keep a set of already visited table offsets, if you see that FBS has recursive definitions.
 
That is vastly more complicated. It requires dynamic allocation and hurts high performance gateways. Personally I can see the usefulness of cycles, but there are many potential problems, and the fact remains that it is non-portable with respect to existing implementations.

You cannot use C# as a reference for hvad is a valid format. The language neutral datatype uoffset_t is used for offsets, and by default it is a 32-bit unsigned integer pointing forward, although other sizes and directions can be configured as long as the direction is always the same.



Maxim Zaks

unread,
May 29, 2016, 5:21:29 AM5/29/16
to FlatBuffers
I used C# as reference because this was the one that I was most familiar with. However it implies that C# code will not be able to decode binaries which are encoded with C++ and have very high offset numbers. This is a bug.
Now for the next version of FlatBuffers, I think it would be interesting to debate if having offsets as signed int32 would make sense because of the possibility to represent cycles.

As to the complexity during validation. As I mentioned before the overhead comes only if user defines recursive table definitions. If high performance is your goal, avoid recursion.
Actually for real high throughput applications, people should stick with structs as much as possible. However if it is about flexibility cycles is as important feature as being able to define unions or almost as important as backwards and forwards compatibility.

Maxim Zaks

unread,
May 29, 2016, 5:29:27 AM5/29/16
to FlatBuffers
Btw. there is another caveat to being able to represent cycles. We will not be able to export binaries to JSON if they have cycles.
However if the binary represents a graph and not hierarchical tree, the JSON export becomes not really suitable anyways.

mikkelfj

unread,
May 29, 2016, 3:09:35 PM5/29/16
to FlatBuffers


On Sunday, May 29, 2016 at 11:29:27 AM UTC+2, Maxim Zaks wrote:
Btw. there is another caveat to being able to represent cycles. We will not be able to export binaries to JSON if they have cycles.
However if the binary represents a graph and not hierarchical tree, the JSON export becomes not really suitable anyways.

Correct, that was one of the other issues I was referring to. If you export binary flatbuffer schema to JSON, as I do in my sample project


you will notice a graph expansion for the above reason because type objects are shared, or something like that. This can worst case be exponential and is actually a security concern wrt. denial of service attacks.

If you control the schema, this can be mitigated by using integer references in all places, or you can put a limit ot the JSON output size. I think I have an option to limit JSON output size in part because of this expansion problem. Note that I here refer to flatcc's JSON feature, not the above example which is unrelated to that feature.

That said, it is not necessarily a problem with shared sub-graphs in JSON. Lots of JSON has redundancies like that.

Shared sub-graphs and lack of sufficient validation related to overlapping memory regions are two reasons why flatcc does not support in-place mutation. (It does support sorting vectors in-place, but that is intended for the one creationg the buffer in the first place, and can thus be trusted).

mikkelfj

unread,
May 29, 2016, 3:28:56 PM5/29/16
to FlatBuffers

On Sunday, May 29, 2016 at 11:21:29 AM UTC+2, Maxim Zaks wrote:
I used C# as reference because this was the one that I was most familiar with. However it implies that C# code will not be able to decode binaries which are encoded with C++ and have very high offset numbers. This is a bug.
 
I am not familiar with the C# implementation, but Wouter mentioned something related to this recently.
There is a difference between the offset type in the binary buffer and the data type used to represent the data. Thus you can store an unsigned offset in a signed integer. You are correct  that that this limits the represenation size but FlatBuffers should never be issumed to be larger than 2GB, also because vtable offsets are signed offsets and those cannot span more than 2GB. I would not call this a bug, but it means you cannot use an implementation as a definitive format guide, which is what I was referring to earlier.

In addition, constrained platforms may impose much stricter limits on sizes, but that just means that these plaforms are only capable of reading small buffers, which is reasonable. For example, the flatcc reader uses `size_t` to represent vector counts amongst other things. A 16-bit bit compiler may sue 16 bit integers for this. Presumably such a system would never receive a buffer larger than 32K and then it is a non-issue. For builders, flatcc checks that sizes don't exceed the representatble size for the platform (at least in theory).
 
Now for the next version of FlatBuffers, I think it would be interesting to debate if having offsets as signed int32 would make sense because of the possibility to represent cycles.
I have also made suggestions along those lines, but I am not totally convinced it is a good idea. What I definitely would like though, is offsets that go the other way (front to back buffers) because you can the build buffers while appending to a file. The reason this isn't so is purely historical.

As to the complexity during validation. As I mentioned before the overhead comes only if user defines recursive table definitions. If high performance is your goal, avoid recursion.

Nto good enough. It is perfectly valid to have a hierarchy of graphical objects that refer to child level objects of same type. The only way to prevent recursion is to validate there is none and to limit recursion depth - given some adversersary trying to create problematic buffers.
 
Actually for real high throughput applications, people should stick with structs as much as possible.
No, you use tables for schema evolution, and for recursive types as in the above example, which cannot be expressed with structs alone, unless you use integer indexing.
 
However if it is about flexibility cycles is as important feature as being able to define unions or almost as important as backwards and forwards compatibility.

But integer indexing for this purpose is not at all a bad idea and works well with JSON, and gives you the opportunity to sort things in all sorts of interesting ways.

Still, I agree cycles can be useful. But they are also problematic.

Maxim Zaks

unread,
May 29, 2016, 4:40:53 PM5/29/16
to FlatBuffers
Hm Maybe I don't get something again, sorry for being stubborn, but in JSON expansion you can keep a set of already expanded absolute offsets. 
If you hit already expanded table you can decide or configure what to do. 
If it is a closed system and you know you don't have any cycles, let it make duplications. 
If it can have cycles you have to detect them. In this case you need a queue which represents current path. On cycle reject the input.
If it is an open system and can be targeted by DoS attack. reject the input. 
I would enforce schemas, which result in strictly hierarchical data representation if it meant to be converted to JSON.
It would be reasonable to write a schema analysis tool. Which would help users to avoid writing "fragile schemas".

As for mutating in place. I actually made it part of the reader. Reader provides setters for scalar values on tables and in scalar vectors. It also checks if the scalar was set to default and is absent. I don't see how it can be dangerous. Well with vectors I don't have sorted vectors yet. I would have to restrict in place replacement to non sorted vectors only.

Maxim Zaks

unread,
May 29, 2016, 4:57:01 PM5/29/16
to FlatBuffers
So by using int32 in Swift implementation I am actually safe because of the 2GB limit :)
Restricting or I would rather say configuring the offset type might be a pretty good Idea anyways. 
It could be part of the fbs, if you know that you send small payloads there is no need in 4 bytes offsets.
Your comment on struct is totally true I was over simplistic sorry about that.
Keeping limitation of 2GB in mind there is almost no need for using uint32 as offset type.
Having it as int32 allows cycles, building buffers while appending to files, and general in place mutation of buffers.
If you want to replace a non scalar property which has a different size than the given one, add it upfront and exchange the offset.
It is not ideal because the previous child might be a zombie, but I guess there are some cases which might be ok with such situation.
E.g. have a look at "Flattening using FlatBuffers" paragraph of this book:

Back to cycles. For me it is just about user adoption. FlatBuffers becomes an intriguing option for application state representation, if it would support cycles out of the box.

Wouter van Oortmerssen

unread,
May 31, 2016, 5:22:51 PM5/31/16
to Maxim Zaks, FlatBuffers
An table/string/vector offset in FlatBuffers is an unsigned type, and represented as such in languages that support it. If you try to create cycles, this will break in most implementations, and should be avoided. FlatBuffers only supports DAGs. If you wanted a graph, you'd have to use indices to a vector of tables instead.

--
You received this message because you are subscribed to the Google Groups "FlatBuffers" group.
To unsubscribe from this group and stop receiving emails from it, send an email to flatbuffers...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Maxim Zaks

unread,
Jun 1, 2016, 2:53:41 AM6/1/16
to FlatBuffers, maxim...@googlemail.com
Hi Wouter,

I checked all Builder implementations. 
C++, Python and GO do it correctly they distinguish between signed and unsigned offsets.

PHP and JavaScript can't distinguish because of the nature of the language. However I found this article from Mozilla (https://developer.mozilla.org/en/docs/Web/JavaScript/Typed_arrays) - not sure how portable it is. And PHP could use the pack function http://us2.php.net/manual/en/function.pack.php

Java and C# do it wrong, even though uint is a type in those languages.

So the result is 3 do it wright and 4 do it "wrong". Even though, IMHO it should not be a problem, see my last reply to Mikkel.

Now what I am interested more is the rational why cycles has to be prohibited?
I understand that if in next version of flatbuffers will switch to signed offset, it will become not forwards compatible. 
Is this the only reason, or are there any performance concerns that I am not aware of.

Thanks in advance,

Maxim

Wouter van Oortmerssen

unread,
Jun 2, 2016, 1:00:12 PM6/2/16
to Maxim Zaks, FlatBuffers
Java and C# do it wrong, even though uint is a type in those languages.

Java has no uint.

Now what I am interested more is the rational why cycles has to be prohibited?
I understand that if in next version of flatbuffers will switch to signed offset, it will become not forwards compatible. 
Is this the only reason, or are there any performance concerns that I am not aware of.

Yes, this will not be backwards compatible. Signed table offsets are outside of the FlatBuffers spec, and thus produce corrupt binaries that will crash many implementations, present and future.

Even if we were to make a FlatBuffers v2 format that allows signed offsets, I am not sure if we'd want to. Currently the C++ verifier for example is simple & fast (because no cycle tracking required), and can guarantee that traversals are finite, and within a certain bound of steps. That is very important for applications of FlatBuffers that are susceptible to DOS attacks.

Most serialization formats only allow trees. FlatBuffers allows DAGs which is already pretty cool, and allows cyclic graphs using indices into vectors. I don't feel native cyclic graph support is worth it.

Wouter

Maxim Zaks

unread,
Jun 3, 2016, 8:23:06 AM6/3/16
to FlatBuffers, maxim...@googlemail.com


Am Donnerstag, 2. Juni 2016 19:00:12 UTC+2 schrieb Wouter van Oortmerssen:
Java and C# do it wrong, even though uint is a type in those languages.

Java has no uint.
Oh my :)  I feel a bit embarrassed, specifically being long time Java developer :)

Now what I am interested more is the rational why cycles has to be prohibited?
I understand that if in next version of flatbuffers will switch to signed offset, it will become not forwards compatible. 
Is this the only reason, or are there any performance concerns that I am not aware of.

Yes, this will not be backwards compatible. Signed table offsets are outside of the FlatBuffers spec, and thus produce corrupt binaries that will crash many implementations, present and future.
Isn't it just forwards compatibility which will suffer? As we have the restriction of 2GB which as far as I understand is imposed by reusability of vTables (vTable offsets are signed offsets).
With this restriction, every old client will produce offsets which will fit into signed int32 even thought they are stored as uint32.
Only new binaries produced by new clients with an offset pointing to the right and not to the left, will be not valid. Old clients will try to read outside of the occupied space and fail with offset assertion.

Even if we were to make a FlatBuffers v2 format that allows signed offsets, I am not sure if we'd want to. Currently the C++ verifier for example is simple & fast (because no cycle tracking required), and can guarantee that traversals are finite, and within a certain bound of steps. That is very important for applications of FlatBuffers that are susceptible to DOS attacks.
As I discussed with Mikkel, I think those applications should not have a recursive table definition in the first place. We also could introduce a special attribute allow_cycles. If the attribute is not present, verifier can still perform simple & fast validation.

Most serialisation formats only allow trees. FlatBuffers allows DAGs which is already pretty cool, and allows cyclic graphs using indices into vectors. I don't feel native cyclic graph support is worth it.

As far as I understand using indices into vector would look something like following

table Root {
 people
:[Person];
}
table
Person {
 name
: string;
 friends
: [uint];
}

root_type Root;


In this case we still need to do late binding as the index of the person will be determined only after the person is added to the vector.
So first create a vector of zeros (size of friends) for every person. 
Than create person tables pointing to the vector of friends. 
Create vector of people. 
Now replace values inside every vector of friends according to there position inside of vector of people.
This makes encoding hard to implement, and decoding will also have an additional step.

Another possibility would be:

table Root {
  people
: [Person];
  relationships
: [Relationship]
}

table
Person {
  name
: string;
}

table
Relationship {
  freinds
: [uint];
}

root_type
Root;

Now we can first serialise all person tables than build up people vector than build all relationship tables put them in relationships vector etc...
The encoding becomes more obvious to implement but still very tedious and it will be impossible to generate API which will hide the tedious part away. The decoding is now even more complicated.

IMHO 
table Person (allow_cycles) {
 name
: string;
 friends
: [Person];
}
root_type
Person;

Is the most straight forward representation of application state there is. The decoding is amazingly simple. Encoding also makes sense even if we don't generate API which traverses the graph for you and does late bindings.
 

Wouter
 

For me it is a question of usability. 
Application states and also game states often have cycles because this is the nature of the world :)
It would be great if we could represent it in a sane manner even if we would have to explicitly opt in, because of verification or some other reasons.

Regards,

Maxim

Wouter van Oortmerssen

unread,
Jun 6, 2016, 8:00:32 PM6/6/16
to Maxim Zaks, FlatBuffers
allow_cycles would have to behave such that it refuses to generate code for implementations that don't support it.. and you'd have to commit to supporting this for all languages (optionally), which is a fair bit of work.

As I already explained, I don't see this being super useful given the complications.

--

Maxim Zaks

unread,
Jun 9, 2016, 5:53:10 AM6/9/16
to FlatBuffers, maxim...@googlemail.com
The steps that are involved are:

1. Change offset to int32 making all language implementations consistent and also consistent with 2 GB restriction.
2. Generate an additional methods for tables who allow_cycles. Something like auto cursor = monster_builder.reserve_friend(); this method will return the offset from the end of the buffer (I call them cursor) which can be replaced in the late binding phase. Something similar has to be done for unions and addition of elements to vector of tables.
3. Add method to builder class which can replace offsets given a cursor builder.replaceOffset(cursor, offset); This is what a user calls in late binding phase.
4. Either skip validation for tables marked as allow_cycles or build up a set of already visited table offsets and if given property table offset is negative check if it points to a valid table offset.

The implementation in FlatBuffersSwift took just a couple of hours. I guess point (4) is the tricky one and I don't have it in Swift, but I feel like it should not be very "expensive" to implement.

Wouter van Oortmerssen

unread,
Jun 17, 2016, 2:15:21 PM6/17/16
to Maxim Zaks, FlatBuffers
As I said before, it is not quite that simple, since we need to support this for all languages, which is a lot of work. Having a feature which only works in some languages when a certain flag is on is just not that great, given the minor benefits. People have yet to fully utilize the DAG functionality, which is a much bigger deal.
Reply all
Reply to author
Forward
0 new messages