Serialize/De-serialize large data using Flatbuffers, how to?

2,702 views
Skip to first unread message

hardik patel

unread,
Jan 13, 2016, 3:42:09 PM1/13/16
to FlatBuffers
Say, following is my sample schema file:

<email_header.fbs>

namespace EmailNamespace;

table Recipient {
  display_name:string;
  email:string;
}

table EmailHeader {
recipients:[Recipient];
}

root_type EmailHeader;
</email_header.fbs>


For the email analytics system I am developing, I am relying on Flatbuffers to help me serialize/de-serialze header related data. I went through the samples and the tests code to figure out the best way to accomplish this, but was unable to find a way where I can:

1) Say, I have 1000 recipients in a given email, how do I possibly serialize them into Flatbuffers?
2) I tried taking CreateUninitializedVector() approach as follows:
Recipient *recip_buf = nullptr;
std::vector<Recipient> recip_data;
auto recip_1K = fbb.CreateUninitializedVector<Recipient>(1000, &recip_buf);
//in order for this memcpy line to work, i have to fill up all 1000 recipient data into recip_data vector. Since, my recip_data is of type Recipient, I would need a way to populate 1000 instances of //Recipient before even passing to memcpy() function.
memcpy(recip_buf, recip_data, 1000);

Question # 1) Is it okay to modify flatc.exe generated header to include setter/getter to the fields of Recipient type (set_display_name(), set_email()) ?
Question # 2) Do we have a better alternate to achieve this task?

Any pointers/help is greatly appreciated.

Thank you,
Hardik

hardik patel

unread,
Jan 14, 2016, 6:53:55 AM1/14/16
to FlatBuffers
//This is how I implemented this. But still have few issues/questions, please help!

#define VECTOR_SIZE 5 //I initially set this to 1000, then reduced it to 5 for the issue that I have encountered.
//To start creating a buffer, create an instance of FlatBufferBuilder
flatbuffers::FlatBufferBuilder fbb;

//How do I populate &part_buf programmatically using application data (1000 elements)
fbb.StartVector(VECTOR_SIZE, sizeof(Recipient));
printf("\nsizeof(Recipient)=%d", sizeof(Recipient));
std::string display_name = "d#";
std::string email = "u-";
std::string email_domain = "@nothing.com";
std::string i_str;

for (int i = 0; i < VECTOR_SIZE; i++)
{
i_str = std::to_string(i);
display_name.append(i_str);
email.append(i_str);
email.append(email_domain);

fbb.PushBytes(reinterpret_cast<const uint8_t *>(display_name.c_str()), display_name.length());
fbb.PushBytes(reinterpret_cast<const uint8_t *>(email.c_str()), email.length());
display_name.clear();
email.clear();
display_name = "d#";
email = "u-";
}
auto partVector = fbb.EndVector(VECTOR_SIZE);
EmailHeaderBuilder emailHeader(fbb);
emailHeader.add_recipients(partVector);

auto ehLoc = emailHeader.Finish();
FinishEmailHeaderBuffer(fbb,ehLoc);

printf("\nSize of buffer=%d",fbb.GetSize());

const char *fileName = "emailheaderData.bin";

bool result = flatbuffers::SaveFile(fileName, (const char *)fbb.GetBufferPointer(), (size_t)fbb.GetSize(), true); //I verified this by opening emailHeaderData.bin file and it is written appropriately.
printf("\nSaveFile Result = %d", result);

std::string buffer;
result = flatbuffers::LoadFile(fileName, true, &buffer);
printf("\nLoadFile Result = %d", result);

printf("\nLength of buffer(read)=%d", buffer.length()); //This result matches to the result returned by fbb.GetSize()

const EmailHeader *emailHeader = GetEmailHeader((buffer.c_str()));
const flatbuffers::Vector<flatbuffers::Offset<Recipient>> *recipLocation = emailHeader->recipients();

for (size_t i = 0; i < VECTOR_SIZE; i++)
{
const Recipient *aRecipLoc = recipLocation->Get(i);
std::cout << "[" << i << "]-Display Name" << aRecipLoc->display_name() << std::endl; //crashes here with this (A)
std::cout << "[" << i << "]-Email" << aRecipLoc->email()->c_str() << std::endl;
}

(A) Unhandled exception at 0x000F4141 in PlaywithFB.exe: 0xC0000005: Access violation reading location 0x756FF420. Inside flatbuffers.h header, line #146, call to EndianScalar() inside ReadScalar() function. I tried various alternates to fix this without any luck.
(B) I also tried using 'auto' types when collecting result from GetEmailHeader() function and emailHeader->recipients() function calls, but the issue stayed.
(C) Also, do I have to stick to PushBytes(...) calls in order to build the vectors dynamically? Do we have an alternate approach to this?

Could you help me figure out the possible issue in this case?

Thank you,
Hardik

hardik patel

unread,
Jan 15, 2016, 10:28:59 AM1/15/16
to FlatBuffers
Update: I tried directly using fbb ( of type flatbuffers::FlatBufferBuilder) to read the contents of the vector by skipping file write and read operation in between. It still throws:

(A) Unhandled exception at 0x000F4141 in PlaywithFB.exe: 0xC0000005: Access violation reading location 0x756FF420. at this code section in flatbuffers.h header file:
<flatbuffers.h>

</flatbuffers.h>

hardik patel

unread,
Jan 15, 2016, 10:33:31 AM1/15/16
to FlatBuffers
(sorry hit 'Post' too soon by mistake)
Update: I tried directly using fbb ( of type flatbuffers::FlatBufferBuilder) to read the contents of the vector by skipping file write and read operation in between. It still throws:

(A) Unhandled exception at 0x000F4141 in PlaywithFB.exe: 0xC0000005: Access violation reading location 0x756FF420. at this code section in flatbuffers.h header file:
<flatbuffers.h>
template<typename T> T ReadScalar(const void *p) {
  return EndianScalar(*reinterpret_cast<const T *>(p)); //it crashes at this function call.
}
</flatbuffers.h>

This indicates issue with the buffer creation itself and nothing to do with the file read/write or data loss thereof. I am still trying to understand what could have gone wrong in the buffer creation logic.

Again including logic in this post as follows:

<core-logic-to-build-Flatbuffer>
</core-logic-to-build-Flatbuffer>

hardik patel

unread,
Jan 15, 2016, 10:40:36 AM1/15/16
to FlatBuffers
To avoid code cluttering with the questions, I have created public gist for the code block I am referring to at : https://gist.github.com/hardikrpatel/f614e91e0647a23fc8ad

Patrick Julien

unread,
Jan 15, 2016, 11:21:31 AM1/15/16
to FlatBuffers
A buffer under construction does not have any information about its components, you need to remember the offsets yourself when writing.

Also, you can have arrays of arrays, all your "PutBytes" call look like a single array

Patrick Julien

unread,
Jan 15, 2016, 11:50:18 AM1/15/16
to FlatBuffers
Here is your example debugged, I suggest you look at the wiki and the examples:

using namespace EmailNamespace;

int main()
{
#define VECTOR_SIZE 5 //I initially set this to 1000, then reduced it to 5 for the issue that I have encountered.
  //To start creating a buffer, create an instance of FlatBufferBuilder
  flatbuffers::FlatBufferBuilder fbb;

  //How do I populate &part_buf programmatically using application data (1000 elements)
  std::string display_name = "d#";
  std::string email = "u-";
  std::string email_domain = "@nothing.com";

  // Remember the position of the recipient objects
  std::vector<flatbuffers::Offset<Recipient>> recipients;

  for (int i = 0; i < VECTOR_SIZE; i++)
  {
    std::string i_str = std::to_string(i);

    display_name.append(i_str);
    email.append(i_str);
    email.append(email_domain);

    recipients.push_back(CreateRecipient(fbb, fbb.CreateString(display_name), fbb.CreateString(email)));

    display_name.clear();
    email.clear();
    display_name = "d#";
    email = "u-";
  }

  FinishEmailHeaderBuffer(fbb, CreateEmailHeader(fbb, fbb.CreateVector(recipients)));

  auto bb = fbb.GetBufferPointer();
  const EmailHeader *emailHeader = GetEmailHeader(bb);
  const flatbuffers::Vector<flatbuffers::Offset<Recipient>> *recipLocation = emailHeader->recipients();

  for (auto recipient : *recipLocation)
  {
    std::cout << "Display Name: " << recipient->display_name() << std::endl; //crashes here with this (A)
    std::cout << "Email: " << recipient->email()->c_str() << std::endl;
  }

  return 0;

hardik patel

unread,
Jan 15, 2016, 4:06:59 PM1/15/16
to FlatBuffers
Thank you Patrick, this thing worked for me!

However, I see that you have used Create*() API alternates than the one that relies on setting the fields individually. All this while I was using the later and was getting thrown at assert(!nested).

Appreciate your help on this.

Patrick Julien

unread,
Jan 15, 2016, 4:38:53 PM1/15/16
to FlatBuffers
That assertion is telling you were creating a new object before the preceding one was finished.  You cannot have more than one object under construction, e.g.,

  const int VECTOR_SIZE = 5;
  flatbuffers::FlatBufferBuilder fbb;
  std::string display_name = "d#";
  std::string email = "u-";
  std::string email_domain = "@nothing.com";
  std::vector<flatbuffers::Offset<Recipient>> recipients;

  recipients.reserve(VECTOR_SIZE);

  for (int i = 0; i < VECTOR_SIZE; i++) {
    std::string i_str = std::to_string(i);

    display_name.append(i_str);
    email.append(i_str);
    email.append(email_domain);

    auto display_name_offset = fbb.CreateString(display_name);
    auto email_offset = fbb.CreateString(email);
    RecipientBuilder builder(fbb);
    builder.add_email(email_offset);
    builder.add_display_name(display_name_offset);
    recipients.push_back(builder.Finish());

    // recipients.push_back(CreateRecipient(fbb, fbb.CreateString(display_name), fbb.CreateString(email)));
    display_name.clear();
    email.clear();
    display_name = "d#";
    email = "u-";
  }

  FinishEmailHeaderBuffer(fbb, CreateEmailHeader(fbb, fbb.CreateVector(recipients)));

  auto bb = fbb.ReleaseBufferPointer();
  auto emailHeader = GetEmailHeader(bb.get());

  for (auto recipient : *emailHeader->recipients()) {
    std::cout << "Display Name: " << recipient->display_name()->c_str() << std::endl; //crashes here with this (A)
    std::cout << "Email: " << recipient->email()->c_str() << std::endl;
  }

  return 0;


but this would not


   
RecipientBuilder builder(fbb);
    builder
.add_email(fbb.CreateString(display_name));
    builder
.add_display_name(fbb.CreateString(email));
    recipients
.push_back(builder.Finish());

because the strings are being created after the construction of the recipient has started.

hardik patel

unread,
Jan 15, 2016, 4:49:37 PM1/15/16
to FlatBuffers
Very well explained. Thank you Patrick for taking time to explain this!

Cheers,
Hardik

Wouter van Oortmerssen

unread,
Jan 19, 2016, 8:01:37 PM1/19/16
to hardik patel, FlatBuffers
Nice one, Patrick.

Hardik: anything about the documentation that made you think you'd need CreateUninitializedVector for this?

--
You received this message because you are subscribed to the Google Groups "FlatBuffers" group.
To unsubscribe from this group and stop receiving emails from it, send an email to flatbuffers...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

hardik patel

unread,
Jan 28, 2016, 11:27:43 AM1/28/16
to FlatBuffers, hardik...@gmail.com
Wouter, sorry for responding late on this. I actually started with documentation and in my exploration to populate 1000 vectors ended up going through some of the discussions on Flatbuffer's group where I found references to CreateUninitializedVector() method for building vector on-the-fly.

Are you asking since this is not the right API to use by the consumers of Flatbuffers under any circumstances?

Thanks,
Hardik

hardik patel

unread,
Jan 28, 2016, 12:31:50 PM1/28/16
to FlatBuffers, hardik...@gmail.com
Hello Patrick/Wouter,

I have a follow up question on this approach:

Say, I have a huge list of BIG objects (each object with significant number of fields in it) and with this approach when I create vector it seems I cannot serialize buffer to the file until I call FinishEmailHeaderBuffer(...) method on my root object (EmailHeader in this case). My questions are:

1) Does Flatbuffer hold everything in-memory until you call Finish...() method?
2) If yes, is there a better alternate where I keep serializing elements to file as they are encountered instead of collecting/accumulating everything in the vector followed by serialization? Something like stream-writing?

Ours is a desktop application and we are trying to reduce the memory footprint of the process by externalizing/serializing large lists of BIG objects onto the disk. In this case, instead of relying on native Java/C++ serialization we are trying to use Flatbuffers for faster reads/writes.

Appreciate your help and guidance on this.

Thanks,
Hardik

On Tuesday, January 19, 2016 at 5:01:37 PM UTC-8, Wouter van Oortmerssen wrote:

mikkelfj

unread,
Jan 28, 2016, 2:32:13 PM1/28/16
to FlatBuffers, hardik...@gmail.com


On Thursday, January 28, 2016 at 6:31:50 PM UTC+1, hardik patel wrote:
Hello Patrick/Wouter,

I have a follow up question on this approach:

Say, I have a huge list of BIG objects (each object with significant number of fields in it) and with this approach when I create vector it seems I cannot serialize buffer to the file until I call FinishEmailHeaderBuffer(...) method on my root object (EmailHeader in this case). My questions are:

1) Does Flatbuffer hold everything in-memory until you call Finish...() method?
2) If yes, is there a better alternate where I keep serializing elements to file as they are encountered instead of collecting/accumulating everything in the vector followed by serialization? Something like stream-writing?

Ours is a desktop application and we are trying to reduce the memory footprint of the process by externalizing/serializing large lists of BIG objects onto the disk. In this case, instead of relying on native Java/C++ serialization we are trying to use Flatbuffers for faster reads/writes.

The flatcc C generator can to some extend stream to disk if you add you own emitter (not difficult - there is already a paged design in the default impl.), but you would need a final pass to reorder the output of flushed datablocks due to back to front writing, or alternatively fix this when reading the data back.

Any vector must be in temporary memory (a stack mainained by the builder) unless it is provided in full as input to the builder. If the vector contains tables, or strings, only the references are kept in memory until the vector is done and the elements can be appended one at a time and land on disk.

Any table must also be kept in memory until complete (but not strings, vectors, sub table fields - these can be flushed ad hoc).

The Go language inferface actually builds vectors back to front which makes the interface efficient, if a bit inconvenient perhaps, but it would allow flushing the vector to disk before it is complete. It might be something worth considering for flatcc in a future version for the reasons of large data - I have been considering this already, but am not convinced it is a good idea.

Overall, I think FlatBuffers are best suited for small to medium sized data and then have multiple flatbuffers in some overlay file format or database.

Wouter van Oortmerssen

unread,
Jan 29, 2016, 3:34:46 PM1/29/16
to mikkelfj, FlatBuffers, hardik patel
Hardik:

Re CreateUninitializedVector: No, just curious if there's something we need to improve about our documentation.

Re streaming: Currently not possible. It will accumulate the entire FlatBuffer in memory. If that is prohibitive, then separating data into multiple FlatBuffers is the only solution.

--

hardik patel

unread,
Jan 29, 2016, 6:01:30 PM1/29/16
to FlatBuffers, hardik...@gmail.com
Thank you mikkelfj for the detailed explanation.

hardik patel

unread,
Jan 29, 2016, 6:14:06 PM1/29/16
to FlatBuffers, mik...@dvide.com, hardik...@gmail.com
Hi Wouter,

Thank you for your response.

Re - "separating data into multiple FlatBuffers is the only solution" 

I saw in one of your posts where you proposed use of nested_flatbuffer annotation as follows:

<your-comment>
If that's not possible and you wanted to serialize them individually, you could still combine them into a single buffer later on like this:

table Buffers { buffers:[Buffer]; }
table Buffer { buffer:[ubyte] (nested_flatbuffer: "MyRoot"); }

The nested_flatbuffer annotation creates a convenient accessor to the nested data.
</your-comment>

Based on this, I went through the sample/test code of FlatBuffer library but could not find sample code to access the nested_flatbuffer type.

Say, if following is my schema, how does the writing logic append one object at a time to a file and then reader read one object at a time?

<sample-schema>
table RecipientList { 
recipients:[Recipient]; 
}

table Recipient { 
buffer:[ubyte] (nested_flatbuffer: "Recipient"); 
}

root_type Recipient;
</sample-schema>

Thank you once again for explaining me the feasibility and design of some of these approaches.

Appreciate it,
Hardik
Reply all
Reply to author
Forward
0 new messages