Manual unions vs table wrapping

2,321 views
Skip to first unread message

Matthias Vallentin

unread,
Mar 7, 2017, 2:14:48 PM3/7/17
to FlatBuffers
I'm in the process of mapping a data model to flatbuffers and would like to have your opinion about the design options. Here are the two approaches I can't decide between:

enum DataType : ubyte {
 
NoneType,
 
BooleanType,
 
IntegerType,
 
CountType,
 
RealType,
 
TimestampType,
 
TimespanType,
 
StringType,
 
VectorType,
 
SetType,
 
MapType
}


table
MapEntry {
  key
:Data;
  value
:Data;
}

table
Data {
  which
:DataType = NoneType;
 
boolean:bool = false;
  integer
:long = 0;
  count
:ulong = 0;
  real
:double = 0.0;
  timestamp
:ulong = 0;
  timespan
:ulong = 0;
  str
:string;
  vector
:[Data];
 
set:[Data];
  map
:[MapEntry];
}



This is effectively a manual union. The alternative would be using the union construct directly:

table Boolean {
  data
:bool = false;
}


table
Integer {
  data
:long = 0;
}


table
Count {
  data
:ulong = 0;
}


table
Real {
  data
:double = 0.0;
}


table
Timestamp {
  data
:ulong = 0;
}


table
Timespan {
  data
:ulong = 0;
}


table
String {
  data
:string = "";
}


table
Vector {

  data
:[Data];
}


table
Set {
  data
:[Data];
}


table
MapEntry {
  key
:Data;
  value
:Data;
}


table
Map {
  data
:[MapEntry];
}

union DataUnion {
 
Boolean,
 
Integer,
 
Count,
 
Real,
 
Timestamp,
 
Timespan,
 
String,
 
Vector,
 
Set,
 
Map
}


table
Data {
  data
:DataUnion;
}

What I'm concerned about is two things:
  1. What approach results in a more space-efficient representation?
  2. What approach incurs less access costs at runtime?
Your advice would be much appreciated.

mikkelfj

unread,
Mar 8, 2017, 7:12:55 AM3/8/17
to FlatBuffers

What I'm concerned about is two things:
  1. What approach results in a more space-efficient representation?
  2. What approach incurs less access costs at runtime?
The manual union will likely be the most efficient in terms of time and space. Certainly in time because there is less indirection. The type lookup is essentially the same for both representations.

The the manual union there is a cost because the vtable has a non-trivial size. If there are multiple tables of different types, each type will have a different vtable. So if you have a few different types, the cost of vtables may be signficant. If you have a lot of tables, many will have the same type and share vtables. If you limit the number of possible types, the vtables will become smaller and make the approach more feasible. Each extra field costs 2 bytes in vtable space.

With a proper union, you will have reference to a separate table, and the table will have its own small vtable and a reference to that vtable. The two extra references total 8 bytes which is - the small vtable also consumes about 10 bytes + padding. So it may costs more than the vtable overhead of the manual approach because 18 bytes cover . Extra space is also "wasted" on extra padding because there are more moving parts.

Overall there is no clear answer. I would say use proper unions when the data you model is complex, and use manual unions for simplicity when you have relatively few mostly simple types.

Note that support is being added for vectors of unions, but these are to my knowledge not yet widely supported. Otherwise, manual unions would the most practical way to deal with vectors of unions.

Matthias Vallentin

unread,
Mar 8, 2017, 12:30:29 PM3/8/17
to mikkelfj, FlatBuffers
> Overall there is no clear answer. I would say use proper unions when the
> data you model is complex, and use manual unions for simplicity when you
> have relatively few mostly simple types.

Thanks for the detailed explanations, that made the tradeoff clear. I
have one more question about vtable sharing, as it is mentioned in the
"internals" section of the flatbuffers site [1]:

Vtables are shared between any objects that happen to have the same
vtable values.

What does that mean? Say I have this schema:

table T {
x: A;
y: B;
}

table Ts {
xs: [T];
}

Does it mean that the vtable of the elements of xs (instances of T) will
be shared among a subset having the same active elements, e.g., shared
among all instances of T where x and y is present?

> Note that support is being added for vectors of unions, but these are to my
> knowledge not yet widely supported. Otherwise, manual unions would the most
> practical way to deal with vectors of unions.

Great, looking forward to seeing the current restrictions being reduced.

[1] https://google.github.io/flatbuffers/md__internals.html

mikkelfj

unread,
Mar 8, 2017, 12:41:11 PM3/8/17
to FlatBuffers, mik...@dvide.com, vall...@icir.org

Does it mean that the vtable of the elements of xs (instances of T) will
be shared among a subset having the same active elements, e.g., shared
among all instances of T where x and y is present? 

Yes, if you have one instance of T only storing x, and another only storing y, these will be different, but if several tables follow the same pattern, e.g. storing only x, or storing both x and y, then they will share the vtable. At least, this is the ideal situation. Some implementations might not do this for speed or simplicity.

Actually, if you store T values that only hold an x value, you can also have Ts share the same vtable as all the T's. This is because the vtable in both cases has an entry for field id 0. Although Ts and T have a different number of fields, there is no need to make the vtable larger than the highest active field, which is field 0 for both T and Ts. I'm not sure if this is done in C++ but I believe it is done in the C implementation AFAIR (I wrote it).

Note that the sharing also depends on default values. If T.x has a default value and that value is stored, the vtable will be empty for that field unless the value is stored with the force option.

Wouter van Oortmerssen

unread,
Mar 8, 2017, 4:28:04 PM3/8/17
to mikkelfj, FlatBuffers, vall...@icir.org
The "manual union" produces much larger vtables, many of which will be unique, so that is significant overhead. Your schema for regular unions also has one level of indirection that is not necessary: you could remove table Data, and make MapEntry directly use DataUnion twice. I'd personally recommend regular unions unless the amount of union options was really small.

But the bigger question is whether you need this at all. You are using a strongly typed system (FlatBuffers), and then you write a schema that undoes this by emulating a dynamically typed data structure.

I'd recommend seeing if your data can be made to be more strongly typed.

If you really want to store "any data" as your schema suggest, you'd be better off with FlexBuffers, as that essentially implements exactly this kind of "union of everything" in a much more compact and friendly way.

--
You received this message because you are subscribed to the Google Groups "FlatBuffers" group.
To unsubscribe from this group and stop receiving emails from it, send an email to flatbuffers+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Matthias Vallentin

unread,
Mar 8, 2017, 5:27:49 PM3/8/17
to Wouter van Oortmerssen, mikkelfj, FlatBuffers
> The "manual union" produces much larger vtables, many of which will be
> unique, so that is significant overhead. Your schema for regular unions
> also has one level of indirection that is not necessary: you could remove
> table Data, and make MapEntry directly use DataUnion twice. I'd personally
> recommend regular unions unless the amount of union options was really
> small.
>
> But the bigger question is whether you need this at all. You are using a
> strongly typed system (FlatBuffers), and then you write a schema that
> undoes this by emulating a dynamically typed data structure.


Here's some background on my use case: My C++ project has a type-erased
class, called it Data, which is essentially std::variant<T, U, ....> (a
discriminated union) over the types I've mentioned. There are a few
further types, like IP addresses, TCP ports, etc.

In order to create an efficient wire format, my idea was to simply
create 1-to-1 mapping between this type-erased structure and a
flatbuffers schema. The reason why I want to use flatbuffers is because
my application spends most of its time (de)serializing these data
instances. They get pumped through the system at rates order of 2M/sec,
indexed, and queried. I also want to memory-map large sequences of
vectors of these data elements for random access.

Therefore, the static version appeared to be the right solution. Given
the restrictions of what types can go into unions (tables only, at this
point), I ended up with the this "manual union," which is
unfortunately space-inefficient due to the large vtable size.

> If you really want to store "any data" as your schema suggest, you'd be
> better off with FlexBuffers, as that essentially implements exactly this
> kind of "union of everything" in a much more compact and friendly way.

Flexbuffers look like the right tool---almost. Users can dispatch on
Reference::GetType() for type-safe visitation of the contained type.
That's what I'm currently emulating. But since my data model is a bit
richer, I would have to encode it into flexbuffers regardless by
introducing a new discriminator for non-integral types (like an IP
address, for example) and store all of them in, say, a flexbuffers
string.

What would you say makes more sense in this scenario? Beefed up
flexbuffers or the more elaborate route of flatbuffers?

Maxim Zaks

unread,
Mar 10, 2017, 4:02:24 AM3/10/17
to FlatBuffers
May I suggest a third alternative?

table MapEntry{
  key
: Data;
  value
: Data;
}

table
Data{

 
boolean:Bool;
  integer
:Long;
  count
:ULong;
  real
:Double;
  timestamp
:ULong;
  timespan
:ULong;

  str
:string;
  vector
:[Data];
 
set:[Data];
  map
:[MapEntry];
}

struct Bool{
  value
:bool;
}

struct Long{
  value
:long;
}

struct ULong{
  value
:ulong;
}

struct Double{
  value
:double;
}

As far as I understand Data can have only one of the fields set. So during loading you just has to have a null check, while going through the fields. Our you could still have a enum Type if you don't want to waste time falling through different cases.

This design means that there could be up to 10 different vTable instances for Data table. Each takes 24bytes (<240 bytes of overhead, if you send kilobyte or megabytes of date this cost seems neglectable to me). If you have repeating tables which are from the same "type" the vTable gets reused. IMHO it basically depends on how many Data instances you have per buffer and how heterogenous they are.

That all sad I agree with Wouter that you should try FlexBuffers first as they were designed specifically for such use case.
BTW complex / non integral types can be expressed as a map or as a vector in FlexBuffers.

Matthias Vallentin

unread,
Mar 10, 2017, 10:47:05 AM3/10/17
to Maxim Zaks, FlatBuffers
> May I suggest a third alternative?

I'm not entirely sure if I understand the main point of your suggested
schema: is the absence of the artificial enum or the wrapping of some
integral types into structs? It sounds like it's the former, but I'm not
sure how the latter fits in the picture since it wouldn't reduce the
vtable size.

> As far as I understand Data can have only one of the fields set.

Right. It's effectively a sum type (or variant in C++ speak).

> If you have repeating tables which are from the same "type" the vTable
> gets reused.

Yeah, I wondered about that. Where does FlatBuffers keep track of the
vtables? Is there some map in the context of the FlatBuffersBuilder
that gets looked up on construction when encountering duplicates? Say I
have:

table Batch {
xs: [Data];
}

The with very large batch sizes (> 100K), the vtable size may be
negligible, assuming I indeed have at most N unique vtable
instantiations, where N is the amount of unique types in Data.

> That all sad I agree with Wouter that you should try FlexBuffers first as
> they were designed specifically for such use case.
> BTW complex / non integral types can be expressed as a map or as a vector
> in FlexBuffers.

Yeah, I'll do some space comparisons eventually. FlexBuffers seems like
the right tool for the job, it's a bit more work to manually translate
non-integral types, e.g., the map (which can hold Data as keys as well,
as opposed to the string-only Flexbuffers map):

auto x = make_map(); // instance of map<data, data>.
auto e = DataType::Map; // type discriminator.
flexbuffers::Builder fbb;
fbb.UInt(e); // first store discriminator
fbb.Vector([&] { /* then encode map here as data pairs */ });
fbb.Finish();

That sort of encoding is a one-time effort, though. Intuitively, I would
image that the unique vtable optimization would achieve greater space
reductions compared to the manual FlexBuffers encoding...but that's just
a hunch right now.

Matthias

Wouter van Oortmerssen

unread,
Mar 10, 2017, 1:24:47 PM3/10/17
to Wouter van Oortmerssen, mikkelfj, FlatBuffers
You could try encoding a representative set of data in both. I bet FlexBuffers would be a lot smaller, but yes, FlatBuffers will be type-safer, especially when dealing with custom types.

If you have to be able to distinguish between a uint32_t which contains an IPv4 address and one which is an arbitrary other kind of value, then yes, you'd have to do your own tagging, which will be a bit wasteful. At the very minimum you'd have to wrap it in a 2 element vector as a (tag, value) pair. Another way would be to manually store 3 vectors, of keys, values, and tags.

Maxim Zaks

unread,
Mar 10, 2017, 2:36:57 PM3/10/17
to FlatBuffers, maxim...@googlemail.com, vall...@icir.org
vTables store relative offsets to the value from the table start. So if you have tables which are structurally same a vTable can be reused. This is done by the FlatBufferBuilder.
If you are interested in internals I have an article here:
And a transcript of a talk I did last year:
Slide 27 show a diagram how a simple example with three tables and one vTable reuse looks like as a graph.

Matthias Vallentin

unread,
Mar 15, 2017, 7:57:12 PM3/15/17
to FlatBuffers, maxim...@googlemail.com
> vTables store relative offsets to the value from the table start. So if you 
> have tables which are structurally same a vTable can be reused. This is 
> done by the FlatBufferBuilder.

Good, that's what I was hoping.

> If you are interested in internals I have an article here:
> And a transcript of a talk I did last year:
> Slide 27 show a diagram how a simple example with three tables and one 
> vTable reuse looks like as a graph.

Very helpful, thanks for sharing. I hope these find their way into the
official tutorial eventually. (At least the documentation could link to them.)

Matthias Vallentin

unread,
Mar 15, 2017, 8:05:46 PM3/15/17
to FlatBuffers, w...@google.com, mik...@dvide.com
> You could try encoding a representative set of data in both. I bet
> FlexBuffers would be a lot smaller, but yes, FlatBuffers will be
> type-safer, especially when dealing with custom types.

I've done initial size checking after reducing my Data model the following:

    table Data {
      which:DataType = NoneType;
      integer:long = 0;
      count:ulong = 0;
      real:double = 0.0;
      bytes:[ubyte];
      vector:[Data];
      map:[MapEntry];
    }

Now consider this tab-separated input record, a vector of Data of 208 bytes in ASCII:

1258531221.486539       Pii6cUUq1v4     192.168.1.102   68      192.168.1.1     67      udp     -       0.163820        301     300     SF      -       0       Dd      1       329     1       328     (empty)

Mapping this to flatbuffers produces a 616 byte buffer. I got this value as follows:

    flatbuffers::FlatBufferBuilder builder;
    auto offset = ...; // build Data
    builder.Finish(offset);
    auto size = builder.GetSize(); // gives me 616

Does this ~3x increase over ASCII sound reasonable for the given table above?

Wouter van Oortmerssen

unread,
Mar 15, 2017, 9:23:25 PM3/15/17
to Matthias Vallentin, FlatBuffers, mikkelfj
Sadly that is probably reasonable, given that this kind of dynamic typing emulation is really inefficient in FlatBuffers. Try the same in FlexBuffers, it should be a lot closer.

Matthias Vallentin

unread,
Mar 18, 2017, 11:19:38 PM3/18/17
to FlatBuffers, matt...@vallentin.net, mik...@dvide.com, Dominik Charousset

Try the same in FlexBuffers, it should be a lot closer.

Unfortunately not much. They end up in the same ballpark. Here's a summary of my benchmarks:

(1) ASCII: 1.0 (baseline)
(2) Raw binary serialization: 2.60
(3) Packed binary serialization: 0.99
(4) FlatBuffers: 3.71
(5) FlexBuffers (manual): 3.23
(6) FlexBuffers (serializer): 3.83
 
Some explanations: the data I'm using is attached to this email. (1) represents this raw ASCII data, with bytes counted via wc -c, but lines starting with '#' excluded. (2) represents a simple binary serializer that only does byte-swapping of integers and varint encoding of the length of sequences. This is the default serializer in the C++ Actor Framework (CAF) without modifications. In fact, I played with my variant implementation and switched the discriminator type from a size_t to uint8_t and got this down 1.6. In my workload, everything is a type-erased structure implemented as variant. (3) is twist of (2) that does varint and zig-zag coding of all integral types, effectively bringing the size back to the in-memory structures of the ASCII representation. (4) was my attempt to coerce the data into a flatbuffers structure, with the table below. (5) the same as (4), but just with flexbuffers. (6) implements the CAF serializer API based with flexbuffers, whereas (5) was the hand-rolled version comparable to (4).

Here's the FlatBuffers table for (4), where the which field is a giant enum.

table Data {
  which
:DataType = NoneType;
  integer
:long = 0;
  count
:ulong = 0;
  real
:double = 0.0;
  bytes
:[ubyte];
  vector
:[Data];
}


My conclusion: with flat- and flexbuffers and a workload of type-erased, heterogeneous data instances, the price is within 3-4x. With massive data volumes, this is not tractable without compression. However, compression defeats the purpose in my case because I want to memory-map large sequences of data and then perform random access.

I wonder what would happen if I use filesystems with transparent compression, such as ZFS. While it wouldn't bring down the virtual memory size, it may have a drastic effect on storage, plus some extra cost incurred by the FS transparently decompression blocks.
conn.log

mikkelfj

unread,
Mar 19, 2017, 3:42:08 AM3/19/17
to FlatBuffers, matt...@vallentin.net, mik...@dvide.com, dominik.c...@haw-hamburg.de


On Sunday, March 19, 2017 at 4:19:38 AM UTC+1, Matthias Vallentin wrote:

Try the same in FlexBuffers, it should be a lot closer.

Unfortunately not much. They end up in the same ballpark. Here's a summary of my benchmarks:

(1) ASCII: 1.0 (baseline)
(2) Raw binary serialization: 2.60
(3) Packed binary serialization: 0.99
(4) FlatBuffers: 3.71
(5) FlexBuffers (manual): 3.23
(6) FlexBuffers (serializer): 3.83

Some more pointers from own and others tests, from memory:

gzip msgpack and gzip json is about the same with gzip json possibly being faster.
LZ4 flatbuffers vs LZ4 json both about 50% saving, but flatbuffer being double json on large data and half json on small data.
gzip being half LZ4 (and slower). 
The schema driven flatbuffer <-> json I implemented with flatcc takes json performance into a factor 4 slower than building flatbuffers directly. Thus, converting to and from json and compressing with gzip or zstd (if you can live with the patent clause) might be an option if you can manage buffers for expansion.

I once tested my own implementation of a length prefixed integer format where the first few bits encodes the length of the rest in a single format up to very long strings. I expected the smaller size to pay off in terms of disk I/O and memory bandwidth but it was a tie due to overhead of encoding/decoding. That was a few years ago and did not consider multi core aspects.

I wonder what would happen if I use filesystems with transparent compression, such as ZFS. While it wouldn't bring down the virtual memory size, it may have a drastic effect on storage, plus some extra cost incurred by the FS transparently decompression blocks.

I think this could be practical approach, but it depends on how well you can exploit locality of reference to avoid cost of excessive decomression, and the choice of compression. It could use a lot of RAM.

Another option, if you have vectors of data, is to use the Arrow formats Union type which may or may not be supported on the on disk Feather format. It uses flatbuffers for metadata only.

A union is a table of types (like the new flatbuffer union vectors), but the data is stored in one of several type specific tables with a bitmap for presence/absence at that row.
Each table is likely much more compressible than separate unions.

Personally would be inclined to use a database, or compress offline data in suitable chunks, then expand into memory mapped files (on disk or in memory) and mmap as needed in flatbuffer format - this gives you fast access and simplicity without paying excessive disk cost - ZFS might do this for you. But for advanced database storage layouts, I wouldn't be using flatbuffers format in direct form at scale, only as a schema and external transport. More compact formats that you have been exploring are fine for online high volume message exhanges in a serial stream, but not for lookup, and due to encoding overhead I would probably still choose flatbuffers, possible LZ4 compressed, or implement a custom flatbuffer compressor.

Maxim Zaks

unread,
Mar 20, 2017, 3:48:54 PM3/20/17
to FlatBuffers, matt...@vallentin.net, mik...@dvide.com, dominik.c...@haw-hamburg.de
First of all what is the idea behind transforming this data in to FlatBuffers or FlexBuffers.
IMHO Killer feature of FlatBuffers and FlexBuffers is random value acces.
If this is also desirable goal in your case I would start analysing the data and see how it can be represented in a more concise way.
So lets walk through the columns and see how they can be represented:
ts: represents timestamp and looks to me like a typical double - 8bytes
uuid: it looks like a base64 represented number to me as well but as I guess it would be hard to convert lets keep it as a string - 12bytes+size(size of size depends on wether you chose Flat or Flex buffers)
id.orig_h: is either IPv4 or IPv6 address there are mulitple ways how it cann be represented but I gues the simples solution is an array of bytes 4-16bytes + size
id.orig_p: is a port AFAIK it should be representable by 2bytes
id.resp_h and id.resp_p are same
proto: is an enum (tcp/udp) so 1byte
service: the definition says it should be a string but honestly it is an enum (unknown/dns/http/ssl) so again 1byte
duration: looks like a float but lets be generous and make it a double 8bytes
orig_bytes: the schema says count from the first glance looks like a short but again lets be generrous uint - 4bytes
resp_bytes: same as orig_bytes
conn_state: is a string with just a couple of characters could potentially be an enum but well - count*bytes + 1(null termination) + size
local_orig: is a bool - 1byte
missed_bytes: this is again a count and so lets take uint - 4bytes
history: string
orig_pkts: count
orig_ip_bytes : count
resp_pkts: count
resp_ip_bytes: count
tunnel_parents : that is a tricky one, shema says table[string] but effectivle in whol edata it just a null so I don't know what to do about it :)

Having a FlatBuffer which represents this table in a more accurate way should give you much better compaction rate. Specially if you can figure out if some count values does not have to be uint but maybe just a short. And if you can get rid of some string in favor of enums this is a huge win. generally only strings are quite wastefull when you transfer them from text based format in something like FlatBuffers because it is the same amount of bytes + null termination (1byte) + size (4bytes) + memory allignment (up to 7 bytes)

Regarding FlexBuffers I would suggest following if it is technically posible for you store the data not as array of dictionaries but as a dictionary of arrays.
This is what almost every performance oriented tabular serialisation format does.

so in your case you would have only one map with column names as keys and the value is an array of:
doubles, strings, byte array, etc...

This means that the keys are stored only once, the data is stored contiguously and there for should have nice memory alignment and everything is still nicely accesable like:
table["proto"][25]

Id did not took the time to write some code to validate my assumptions so take it as ideas not confirmed facts.

Wouter van Oortmerssen

unread,
Mar 20, 2017, 8:00:29 PM3/20/17
to Maxim Zaks, FlatBuffers, Matthias Vallentin, mikkelfj, dominik.c...@haw-hamburg.de
I don't see how FlexBuffers can be 3.23x bigger than ascii. Looking at your ascii data, that appears to be completely unstructured data, so that is hard to compare against more structured file formats. Is the FlexBuffers version just a single vector of values, much like in ascii, or what?

--

Maxim Zaks

unread,
Mar 21, 2017, 1:27:57 PM3/21/17
to FlatBuffers, maxim...@googlemail.com, matt...@vallentin.net, mik...@dvide.com, dominik.c...@haw-hamburg.de
I made small script which uses Flex Buffers Swift to create a vector of maps and map of vectors:

Bare in mind that in my Swift implementation I implemented only "BUILDER_FLAG_SHARE_KEYS" not the other ones.

As you can see in the OUTPUT section
Plain text takes: 1026256 bytes
FlexBuffer vector of maps: 2802582 bytes
So 2.8x more
FlexBuffer map of vecotrs: 990679 bytes
So: close to 1x

It might be that both binaries can be optimised more. I see that in my implementation vector of strings is not strongly typed and as I mentioned I don't have string sharing implemented as well.

I also attach the binary file to check them out



To unsubscribe from this group and stop receiving emails from it, send an email to flatbuffers...@googlegroups.com.
log_dict_of_vect.flx
log_vect_of_dict.flx

Maxim Zaks

unread,
Mar 22, 2017, 5:40:15 AM3/22/17
to FlatBuffers, maxim...@googlemail.com, matt...@vallentin.net, mik...@dvide.com, dominik.c...@haw-hamburg.de
A quick and dirty implementation of string sharing results in
vector of maps: 2743574 bytes
map of vectors: 857371 bytes
So now flex buffers starts getting smaller than csv and it still provides direct random access without parsing.

This excersise reminded me of our discussion about dataFrames and factors.
I think specially in this case factor would make sense.
Factor would contain internally of two arrays: 
1. labels array which represents disjoint set of values (values can be strings but also numeric values)
2. index array which states labels index for actual array value
In best case scenario the labels array has less than 256 entries and there for the index array would be an array of bytes.
Current vectors with reuse achieve similar results but I think having a special type still makes sense because the Factor accessor class could have methods which let user access labels directly or ask for all occurrences in the index array. This implies linear search, but it is very convenient API for tabular data.
Also it would be a no brainer for users who has tabular data. Just create a map of factors.

Matthias Vallentin

unread,
Mar 22, 2017, 12:28:16 PM3/22/17
to mikkelfj, FlatBuffers, dominik.c...@haw-hamburg.de
> I once tested my own implementation of a length prefixed integer format
> where the first few bits encodes the length of the rest in a single format
> up to very long strings. I expected the smaller size to pay off in terms of
> disk I/O and memory bandwidth but it was a tie due to overhead of
> encoding/decoding. That was a few years ago and did not consider multi core
> aspects.

Interesting, that's another route I'm considering: use flatbuffers for
the light-weight framing and for the keep a use a custom encoding for
the more spacious structures.

> > I wonder what would happen if I use filesystems with transparent
> > compression, such as ZFS. While it wouldn't bring down the virtual memory
> > size, it may have a drastic effect on storage, plus some extra cost
> > incurred by the FS transparently decompression blocks.
> >
>
> I think this could be practical approach, but it depends on how well you
> can exploit locality of reference to avoid cost of excessive decomression,
> and the choice of compression. It could use a lot of RAM.

What do you have in mind when you say "it could use a lot of RAM"?
Because the OS keeps the decompressed blocks for the accessed pages in
an opaque cache?

> Another option, if you have vectors of data, is to use the Arrow formats
> Union type which may or may not be supported on the on disk Feather format.
> It uses flatbuffers for metadata only.

If I understand it correctly, Arrow makes most sense in a columnar
format. The example schema I attached is just one of many files. While
each one individually makes sense to represent in columnar fashion, what
happens in reality is an interleaved stream of various entries, each of
which with a different type. Each record/row has a unique ID, which a
secondary index spits out during access. The access pattern is random
and lookups are highly selective. Therefore, I was thinking that a
(row-order) flatbuffers format would be a good fit, because it provides
random-access and I/O costs would be proportional to the query
selectivity (assuming memory-mapping). While it would be possible to
maintain an Arrow struct for every type and then create a sparse
columnar representation, I would imagine it wouldn't yield the I/O
benefits. That said, I haven't measure it.

Switching to Arrow on the data plane is on my medium term agenda when
actual analytics on the data becomes of interest. Right now, my primary
use case is plain search and I'm trying to minimize the I/O cost through
through random-access and memory. Hence the idea to go with flatbuffers.

> Personally would be inclined to use a database, or compress offline data in
> suitable chunks, then expand into memory mapped files (on disk or in
> memory) and mmap as needed in flatbuffer format - this gives you fast
> access and simplicity without paying excessive disk cost - ZFS might do
> this for you.

Yeah, the chunk-based approach is exactly what I'm currently doing
manually. The trade-off becomes touching more chunks than needed for the
data vs. the I/O gains you get from smaller data [1].

Matthias

[1] https://www.percona.com/blog/2016/03/09/evaluating-database-compression-methods/

Matthias Vallentin

unread,
Mar 22, 2017, 12:37:31 PM3/22/17
to Maxim Zaks, FlatBuffers, mik...@dvide.com, dominik.c...@haw-hamburg.de
Thanks for digging into this!

> IMHO Killer feature of FlatBuffers and FlexBuffers is random value acces.
> If this is also desirable goal in your case I would start analysing the
> data and see how it can be represented in a more concise way.

(Per my other mail, random-access is indeed the desired use case,
motivated by the reduction of I/O cost.)

> tunnel_parents : that is a tricky one, shema says table[string] but
> effectivle in whol edata it just a null so I don't know what to do about it
> :)

table[string] is unfortunate description bug: this is effectively a set
and not a mapping, with a custom separator specified at the beginning of
the log file.

> Having a FlatBuffer which represents this table in a more accurate way
> should give you much better compaction rate.

Agreed. In the provided example, modeling the flatbuffer in column-order
may seem the logical thing to do (i.e., a poor-man's version of Arrow).
But wouldn't that spread out the data and destroy locality of reference?

Matthias

Matthias Vallentin

unread,
Mar 22, 2017, 12:48:54 PM3/22/17
to Wouter van Oortmerssen, Maxim Zaks, FlatBuffers, mikkelfj, dominik.c...@haw-hamburg.de
> I don't see how FlexBuffers can be 3.23x bigger than ascii. Looking at your
> ascii data, that appears to be completely unstructured data, so that is
> hard to compare against more structured file formats. Is the FlexBuffers
> version just a single vector of values, much like in ascii, or what?

Some more context: each row in this log file maps to recursive C++ data
structure, akin to JSON but with more types and more flexibility (e.g.,
all types of a total order and can therefore be used as keys in maps).
For the log file that I attached, you'd roughly get something like this
as internal C++ representation:

[1258531221.486539, "Pii6cUUq1v4", [192.168.1.102, 68, 192.168.1.1, 67], "udp", -, 0.163820s, ...]

Essentially, a nested, heterogeneous sequence of data. Implemented is
this as a discriminated union, which gets encoded in Flexbuffers as UInt
for the discriminator and then whatever the type demands. Here's the
generic serializer that I used for this experiment:

https://gist.github.com/mavam/e762acbea8ef5d9166dc6aaddd4503e8

By sequentially encoding row after row, I ended up with ~3.8x. Because
each type can sometimes be represented more efficiently (e.g., IP
addresses by default use a 16-byte IPv4-within-IPv6 mapping for uniform
representation), I tried to optimize for this cases and hand-optimized
the serialization routines for the discriminated union. That led to the
factor of 3.23.

Matthias

Maxim Zaks

unread,
Mar 23, 2017, 4:51:28 AM3/23/17
to FlatBuffers, w...@google.com, maxim...@googlemail.com, mik...@dvide.com, dominik.c...@haw-hamburg.de
In my first attempt FlexBuffers vector of maps I also end up with about 3.8x overhead, than I turned on key sharing and it went down to 2.8x if I would implement key_vector sharing it probably would go down even more.

If you are going with FlatBuffers or FlexBuffers I would not do any kind of mapping. The main benefit of going with Flat/FlexBuffers is that the data is already stored in a contiguous array and values can be accessed directly. Your data is ReadOnly. If you need some convenience methods, wrap FlatBuffer classes or FlexBuffer in a class which provides you such API. If you want to analyse, visualise or search data try to do it in a streaming way. The buffer/object graph is already in memory, there is no need to replicate it in additional data structures. E.g. recreating an array of all UUIDs. It is already accessible in the memory, you can just write a function which runs through Flex/FlatBuffer vector. So I guess by streaming I mean zero copy :).

Now about storing the data as map of vecotrs vs. vecor of maps. I would say it depends on what you will do with the data. If you are searching for IPs for example, than storing it as a map of vecors is actually quite good because linear search on tightly packed contiguous array is very efficient. Even visualising data could be efficient if your UI can be tweaked in building up column by column and not row by row. Eg. You create visible row containers and populate them with data column by column. If this kind of thing is not posible and you have to work row by row, this might lead to cache misses, but here I would rather measure. It comes to good old memory vs. computation time trade of. We measured that storing data column wise gives you 3x less memory footprint. Now you have to measure how much slower the computation will become. Than it will be easier to decide which trade of you want to favour.

Hope it helps,

Maxim

Maxim Zaks

unread,
Mar 23, 2017, 6:21:13 PM3/23/17
to Matthias Vallentin, Maxim Zaks, FlatBuffers, Wouter van Oortmerssen, mik...@dvide.com, dominik.c...@haw-hamburg.de
As you said, I don't think manually shredding records in flatbuffers is
the right approach, as it attempts to emulate existing technology more
optimized for this use case (e.g., Apache Arrow).


Hm I looked a bit into Arrow. I am much more familiar with Flat and FlexBuffers because I had the time to reimplement them. For me Arrow seems like a generic approach for column based random access data.
Which is a great deal if you have heterogeneous setup and data should flow through multiple systems.
However I would be hesitant to say optimised for now. IMHO the best optimised solutions are the ones that you tailor for your needs. E.g in Arrow base requirements (https://github.com/apache/arrow/blob/master/format/Layout.md)  I see two points which make me a bit careful:
1. All contiguous memory buffers are aligned at 64-byte boundaries and padded to a multiple of 64 bytes. It is probably ok but it could introduced some wasting of memory
2. Arrays are relocatable (e.g. for RPC/transient storage) without pointer swizzling. This is great for taking parts of the data out and resending them, however as far as I understand it means no data reuse. So if I have for example strings which occurs a lot through out the data set, they will be duplicated, because it is required that arrays are relocated so no references outside of the array.

If you know your data and don’t have a zoo of third party systems which you have to integrate it should be possible to tailor the schema in a way which should be more efficient than a stock solution. However how much more efficient it will be and how much time you willing to spend on it, depends on your use case :)

So take my advice with a grain of salt.

mikkelfj

unread,
Mar 24, 2017, 3:06:02 AM3/24/17
to FlatBuffers, mik...@dvide.com, dominik.c...@haw-hamburg.de, matt...@vallentin.net


On Wednesday, March 22, 2017 at 5:28:16 PM UTC+1, Matthias Vallentin wrote:

>> (On ZFS compression...)

> I think this could be practical approach, but it depends on how well you
> can exploit locality of reference to avoid cost of excessive decomression,
> and the choice of compression. It could use a lot of RAM.

What do you have in mind when you say "it could use a lot of RAM"?
Because the OS keeps the decompressed blocks for the accessed pages in
an opaque cache?
yes
 
> Another option, if you have vectors of data, is to use the Arrow formats
> Union type which may or may not be supported on the on disk Feather format.
> It uses flatbuffers for metadata only.

If I understand it correctly, Arrow makes most sense in a columnar
format. The example schema I attached is just one of many files. While
each one individually makes sense to represent in columnar fashion, what
happens in reality is an interleaved stream of various entries, each of
which with a different type.

Arrow has support for unions that branches out to separate column types. If all your fields are variants a record could also be multiple single field columns of variants. Not saying it is preferable, but possible, and likely more compact. Columns compress better. You don't have to store all data in a single column, you can split in chunks.
 
Each record/row has a unique ID, which a
secondary index spits out during access. The access pattern is random
and lookups are highly selective. Therefore, I was thinking that a
(row-order) flatbuffers format would be a good fit, because it provides
random-access and I/O costs would be proportional to the query
selectivity (assuming memory-mapping).

Yes. Max also mentioned the tradeoffs, but it also depends on the search pattern before selecting and the likelyhood of needing neighbours soon after via hot cache.

While it would be possible to
maintain an Arrow struct for every type and then create a sparse
columnar representation, I would imagine it wouldn't yield the I/O
benefits. That said, I haven't measure it.
 
Hard to tell, but I wouldn't recommend doing it, as I said, I just mentioned it as a possibility.
Reply all
Reply to author
Forward
0 new messages