Strings and Arenas

1160 views
Skip to first unread message

Austin Schuh

unread,
Jan 14, 2016, 9:06:57 PM1/14/16
to Protocol Buffers
Hi,

I've got an application where I can't allocate memory while using protobufs.  Arenas have been awesome for doing that.  I'm able to allocate a big block of memory at startup time or stack allocate memory for the arena, and then use that for allocating protobufs.  Thanks!

I'd like to be able to allocate strings in the arena.  I'm willing to do the implementation, and wouldn't mind up-streaming if my implementation is complete enough and there is interest.  It looks like I should start by implementing ctype=STRING_PIECE and then allocate memory in the arena to back it.  The class in //src/google/protobuf:arenastring.h looks like the place to do all the operations.  It looks like I need to modify the interface to provide setters and getters to support STRING_PIECE there.

Is that the right place to start?  Is there any more guidance that you can give me?

Thanks,
  Austin

Feng Xiao

unread,
Jan 15, 2016, 2:32:39 PM1/15/16
to Austin Schuh, Protocol Buffers
Hi Austin,

Thanks for contacting us and offering help!

You are looking at the right direction. We actually already opensourced the StringPiece implementation not very long ago:

It's intended to be used to implement "ctype = STRING_PIECE" for string fields and since it's merely a <const char*, size_t> pair, it can be directed at the buffer in the arena. Such features are implemented inside Google but unfortunately it's not opensourced due to dependency issues. We plan to get them out eventually but hasn't have enough time to work on it. Since we already have an internal version of it, we probably won't be able to accept your contributions. I can't give a concrete timeline about when we will get our implementation opensourced also. Sorry for that...

If you need this soon, I suggest you try to implement it as simple as possible. Better to only support lite runtime with arena enabled. Some changes you want to make:
1. Make ArenaStringPtr work with StringPiece, or introduce an ArenaStringPiecePtr which might be easier to implement.
2. Update protocol compiler to use ArenaStringPtr/ArenaStringPiecePtr to store ctype=STRING_PIECE fields and expose a StringPiece API:
// proto
message Foo {
  string bar = 1 [ctype = STRING_PIECE];
}
// generated C++ code
message Foo {
 public:
  StringPiece bar() const;
  void set_bar(StringPiece value);  // Note that we need to do a deep copy here because StringPiece  doesn't own the underlying data.
  void set_alias_bar(StringPiece value);  // Make the field point to the StringPiece data directly. Caller must make sure the underlying data outlives the Foo message.

 private:
  ArenaStringPiecePtr bar_;
};

Look at the string_field.cc implementation in the compiler directory and you can create a string_piece_field.cc implementation based on that. Most of the work will be done here, including not only the generated API but also all the parsing/serialization/copy/constructor/destructor support.

That's pretty all that needed to support StringPiece in lite-runtime + arena. A lot more work will be needed to support other combinations (lite-runtime + no arena, full-runtime + arena, full-runtime + non-arena), but since you have a specific targeted platform and we will opensource the StringPiece support eventually, it's probably not worthwhile to invest time to support anything you don't actually need right now.

Hope this helps.

Regards,
Feng 


Thanks,
  Austin

--
You received this message because you are subscribed to the Google Groups "Protocol Buffers" group.
To unsubscribe from this group and stop receiving emails from it, send an email to protobuf+u...@googlegroups.com.
To post to this group, send email to prot...@googlegroups.com.
Visit this group at https://groups.google.com/group/protobuf.
For more options, visit https://groups.google.com/d/optout.

Austin Schuh

unread,
Jan 15, 2016, 2:50:57 PM1/15/16
to Feng Xiao, Protocol Buffers
Hi Feng,

This is very helpful, thanks!  I'm happy to hear that you are going to open source the implementation eventually, and thankful for the suggestions so I can be API compatible where possible.

With careful googling and knowing what I was looking for, I found a StringPiece implementation in re2 years ago :)

When setting ctype = STRING_PIECE, would you remove/replace the void set_foo(const ::std::string &value) calls, or have add additional ones?  Since ::std::string can be converted to a StringPiece pretty easily, leaving them there should be easy.

One of my use cases is to take in chunks of data from a data source and put them together to make a string.  Ideally, I would be able to grow a string in constant time (assuming constant time chunks), but that probably isn't practical.  It looks like I should be able to instead allocate a StringPiece (or the data inside it) inside the arena when the pieces start coming in, and then hand ownership to it via the set_alias_bar() call above when the string finishes?  Is there a better way to do what I'm trying to do?

I'll need to support full-runtime + arena, but none of the other combinations.  I'll figure something out to make sure the reflection does something sane in my case (CHECK(false) might work for what I want to do, I'll have to try it and see).  The reflection can cheat in my case since I don't care about it not allocating.

Thanks!
  Austin

Feng Xiao

unread,
Jan 15, 2016, 4:30:34 PM1/15/16
to Austin Schuh, Protocol Buffers
For StringPiece only set_foo(StringPiece str) is needed.
 
Since ::std::string can be converted to a StringPiece pretty easily, leaving them there should be easy.

One of my use cases is to take in chunks of data from a data source and put them together to make a string.  Ideally, I would be able to grow a string in constant time (assuming constant time chunks), but that probably isn't practical.  It looks like I should be able to instead allocate a StringPiece (or the data inside it) inside the arena when the pieces start coming in, and then hand ownership to it via the set_alias_bar() call above when the string finishes?
Yes.
 
Is there a better way to do what I'm trying to do?
You may have already noticed that we have another ctype for string fields: ctype = CORD. This Cord string type allows you to concatenate strings more efficiently without reallocate buffers and can also let the string fields share the underlying data buffer with the input data chunks. Inside Google we rely heavily on this Cord type to avoid string/bytes copies in parsing and serialization. It's in our opensource plan as well.

 

I'll need to support full-runtime + arena, but none of the other combinations.  I'll figure something out to make sure the reflection does something sane in my case (CHECK(false) might work for what I want to do, I'll have to try it and see).  The reflection can cheat in my case since I don't care about it not allocating.
I'm pretty sure with reflection, the proto descriptors will be allocated on heap. Is that acceptable in your use case?
 

Thanks!
  Austin

Austin Schuh

unread,
Jan 15, 2016, 7:37:44 PM1/15/16
to Feng Xiao, Protocol Buffers
And then I'm assuming foo() is modified to return StringPiece?
 
Since ::std::string can be converted to a StringPiece pretty easily, leaving them there should be easy.

One of my use cases is to take in chunks of data from a data source and put them together to make a string.  Ideally, I would be able to grow a string in constant time (assuming constant time chunks), but that probably isn't practical.  It looks like I should be able to instead allocate a StringPiece (or the data inside it) inside the arena when the pieces start coming in, and then hand ownership to it via the set_alias_bar() call above when the string finishes?
Yes.
 
Is there a better way to do what I'm trying to do?
You may have already noticed that we have another ctype for string fields: ctype = CORD. This Cord string type allows you to concatenate strings more efficiently without reallocate buffers and can also let the string fields share the underlying data buffer with the input data chunks. Inside Google we rely heavily on this Cord type to avoid string/bytes copies in parsing and serialization. It's in our opensource plan as well.

 

I'll need to support full-runtime + arena, but none of the other combinations.  I'll figure something out to make sure the reflection does something sane in my case (CHECK(false) might work for what I want to do, I'll have to try it and see).  The reflection can cheat in my case since I don't care about it not allocating.
I'm pretty sure with reflection, the proto descriptors will be allocated on heap. Is that acceptable in your use case?

I'm not worried about using reflection in the non-allocation case.  For us, reflection is mostly useful for testing, ShortDebugString, and other places where the user is willing to pay a larger cost to work with the data dynamically.

Thanks!
  Austin

Feng Xiao

unread,
Jan 15, 2016, 7:55:29 PM1/15/16
to Austin Schuh, Protocol Buffers
Right.
 
 
Since ::std::string can be converted to a StringPiece pretty easily, leaving them there should be easy.

One of my use cases is to take in chunks of data from a data source and put them together to make a string.  Ideally, I would be able to grow a string in constant time (assuming constant time chunks), but that probably isn't practical.  It looks like I should be able to instead allocate a StringPiece (or the data inside it) inside the arena when the pieces start coming in, and then hand ownership to it via the set_alias_bar() call above when the string finishes?
Yes.
 
Is there a better way to do what I'm trying to do?
You may have already noticed that we have another ctype for string fields: ctype = CORD. This Cord string type allows you to concatenate strings more efficiently without reallocate buffers and can also let the string fields share the underlying data buffer with the input data chunks. Inside Google we rely heavily on this Cord type to avoid string/bytes copies in parsing and serialization. It's in our opensource plan as well.

 

I'll need to support full-runtime + arena, but none of the other combinations.  I'll figure something out to make sure the reflection does something sane in my case (CHECK(false) might work for what I want to do, I'll have to try it and see).  The reflection can cheat in my case since I don't care about it not allocating.
I'm pretty sure with reflection, the proto descriptors will be allocated on heap. Is that acceptable in your use case?

I'm not worried about using reflection in the non-allocation case.  For us, reflection is mostly useful for testing, ShortDebugString, and other places where the user is willing to pay a larger cost to work with the data dynamically.
Sounds reasonable. It might not be that hard also. Mostly just going through all the "switch (ctype)" cases and adding the missing branches that are stripped out in our opensource process.


Thanks!
  Austin

aram

unread,
Jun 7, 2018, 11:01:28 AM6/7/18
to Protocol Buffers
I am working on the same problem in my project (arena-allocated strings), and came across this topic. It is 2.5 years old now, so I wonder if arena-allocated strings or StringPiece class is going to be included in official protobuf releases any time soon?

Feng Xiao

unread,
Jun 7, 2018, 5:35:54 PM6/7/18
to aram.hamb...@gmail.com, Protocol Buffers
On Thu, Jun 7, 2018 at 8:01 AM aram <aram.hamb...@gmail.com> wrote:
I am working on the same problem in my project (arena-allocated strings), and came across this topic. It is 2.5 years old now, so I wonder if arena-allocated strings or StringPiece class is going to be included in official protobuf releases any time soon?
Unfortunately this has happened yet. The main issue is how to adopt abseil in protobuf library to be able to use the new string_view type and Cord. We haven't figured that out yet. You can follow-up on https://github.com/google/protobuf/issues/1896 and we will post updates there.

X Ah

unread,
Sep 3, 2020, 6:19:21 AM9/3/20
to Protocol Buffers
Hi Feng,
I have an API design problem, Since StringPiece doesn't own the data and it's impossible to get memory from it, So how should the behavior be if I do ParseFromString and the message is not in arena? I think the StringPiece field should be empty if current message doesn't own an arena, but it is strange for user. Could you introduce how Google internal use StringPiece in protobuf?
Thanks!

Adam Cozzette

unread,
Sep 8, 2020, 6:17:26 PM9/8/20
to X Ah, Protocol Buffers
Our StringPiece type has been made obsolete by the std::string_view type introduced in C++17, so we will eventually get rid of StringPiece and replace it with std::string_view (or absl::string_view, which has the same API but is available in C++11). So if you are making a local modification to support allocating strings on arenas, it would probably be best to go straight to std::string_view and avoid our StringPiece type. To handle both the arena and non-arena case, the simplest solution would be to store both a std::string_view and a std::string. When arenas are used, you can have the string_view point into arena-allocated memory, and when arenas are not used, it can just point to the std::string's data.

zhijiang liu

unread,
Jan 18, 2022, 10:12:16 PMJan 18
to Protocol Buffers
Hi, I have a question. When arena-based strings are expected to be released ?

Mike Kruskal

unread,
Jan 25, 2022, 5:17:48 PMJan 25
to Protocol Buffers
We have no plans set to support arena-based strings yet.  This would be blocked on finishing the migration to string_view accessors, but *that* is planned for 2022.
Reply all
Reply to author
Forward
0 new messages