[boost] Questions about incorporating my library into a new library within Boost

Jooseong Lee via Boost

unread,

Jul 24, 2022, 11:46:33 PM7/24/22

to bo...@lists.boost.org, Jooseong Lee

I recently wrote a general-purpose STL-like ordered associative B-Tree
based container in C++.
.
Its API and usage is very similar to std::set, std::map, std::multiset,
std::multimap,
and when the number of elements is relatively large, my benchmark shows
that it is 2.5~3 times faster than std::set and its friends.

Repo link: https://github.com/frozenca/BTree

This was better than I thought, so I'd like to contribute to the Boost
community by incorporating it within the list of Boost libraries.
I'd like to know if it's possible or if you think it's worth it.

Best Regards,
Jooseong Lee

_______________________________________________
Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost

Phil Endecott via Boost

unread,

Jul 25, 2022, 10:29:32 AM7/25/22

to bo...@lists.boost.org, Phil Endecott

Jooseong Lee wrote:
> I recently wrote a general-purpose STL-like ordered associative B-Tree
> based container in C++.
> .
> Its API and usage is very similar to std::set, std::map, std::multiset,
> std::multimap,
> and when the number of elements is relatively large, my benchmark shows
> that it is 2.5~3 times faster than std::set and its friends.
>
> Repo link: https://github.com/frozenca/BTree

I would be interested to see a comparison with Beman Dawes'
B-Tree library from 2010:

https://lists.boost.org/Archives/boost//2010/09/170995.php
https://github.com/Beman/btree

Personally, I often use flat sets/maps. Of course these are best
suited when lookups dominate over modifications. It would be interesting
to know how your library compares for lookups.

I see that you have O(log N) n-th element access, that it is an
interesting feature. Does this mean that every non-leaf node has a
count of its descendants?

I don't see how erase(a,b) can be O(log N), is that a typo?

In the past I have used a lot of memory-mapped, mostly read-only,
binary data structures but these days I'm more likely to use
sqlite. Have you tried to benchmark against that?

I note it is not Boost licensed.

Regards, Phil.

Jooseong Lee via Boost

unread,

Jul 25, 2022, 11:08:22 AM7/25/22

to bo...@lists.boost.org, Jooseong Lee, Phil Endecott

>I see that you have O(log N) n-th element access, that it is an
>interesting feature. Does this mean that every non-leaf node has a
>count of its descendants?

Yes, every non-leaf (and leaf) node has a count of keys in the subtree
where that node is the root.

> I don't see how erase(a,b) can be O(log N), is that a typo?

No, it's not.
In order to explain this, I have to explain split() first:
split(Tree&& tree, key a, key b) splits a tree to two trees, say tree_l and
tree_r, so that
tree_l contains all keys less than a, tree_r contains all keys greater than
b.
By B-Tree's nature, this is O(log N).

erase(a, b) works like this:
1. compute lb = lower_bound(a), ub = upper_bound(b) pair. These are
iterators. O(log N)
2. compute order(lb), order(ub). Therefore, order(ub) - order(lb) is the
number of elements that will be erased. O(log N)
3-1. When order(ub) - order(lb) < 30, the tree erases element one by one
3-2. When it exceeds 30, the algorithm first splits the tree to two trees
by calling spit(std::move(*this), lb, ub)
and joins the two split trees. split() and join() is O(log N), and there is
one split() call and join() call,
so the whole complexity is O(log N)

I plan to benchmark mine with abseil's B-Tree soon.
Thank you for pointing out Beman's implementation and sqlite, I'll list
them too.

I'll add Boost license too, thanks for that

Best,
Jooseong

2022년 7월 25일 (월) 오후 11:29, Phil Endecott via Boost <bo...@lists.boost.org>님이
작성:

Jooseong Lee via Boost

unread,

Jul 26, 2022, 4:05:10 AM7/26/22

to bo...@lists.boost.org, Jooseong Lee

Dear Phil, FYI,

>I would be interested to see a comparison with Beman Dawes'
>B-Tree library from 2010:

https://github.com/Beman/btree seems to depend on an older version of
Boost.Endian,
It doesn't compile with the recent versions of boost. I can't find a proper
version of Boost.Endian which is compatible with.

To compare with Google's 2011 B-Tree implementation, (
https://code.google.com/archive/p/cpp-btree/)
I saw the following results:
https://github.com/frozenca/BTree/tree/295a9ec4b5495a83c546fd3e300d54d4c78d8ad8#Performance

Regards,
Jooseong

2022년 7월 26일 (화) 오전 12:03, Jooseong Lee <froze...@gmail.com>님이 작성:

Phil Endecott via Boost

unread,

Jul 26, 2022, 10:41:10 AM7/26/22

to bo...@lists.boost.org, Phil Endecott

Jooseong Lee wrote:
>> > I don't see how erase(a,b) can be O(log N), is that a typo?
>>
>> No, it's not.
>> In order to explain this, I have to explain split() first:
>> split(Tree&& tree, key a, key b) splits a tree to two trees, say tree_l
>> and tree_r, so that
>> tree_l contains all keys less than a, tree_r contains all keys greater
>> than b.
>> By B-Tree's nature, this is O(log N).
>>
>> erase(a, b) works like this:
>> 1. compute lb = lower_bound(a), ub = upper_bound(b) pair. These are
>> iterators. O(log N)
>> 2. compute order(lb), order(ub). Therefore, order(ub) - order(lb) is the
>> number of elements that will be erased. O(log N)
>> 3-1. When order(ub) - order(lb) < 30, the tree erases element one by one
>> 3-2. When it exceeds 30, the algorithm first splits the tree to two trees
>> by calling spit(std::move(*this), lb, ub)
>> and joins the two split trees. split() and join() is O(log N), and there
>> is one split() call and join() call,
>> so the whole complexity is O(log N)

You have to call the destructors on the elements that you have erased,
so it must be O(N) in general. Your description skips the part where you
drop the portion of the tree between the two splits.

If the element type has a trivial destructor and they are stored contiguously,
then you can try to argue that it is better than O(N). This will depend on
the complexity guarantees of your memory allocator. But your elements
are not
stored contiguously; destroying N elements will require O(N) "pages" to be
released, if your pages are fixed-size. So it is still O(N).

Jooseong Lee via Boost

unread,

Jul 26, 2022, 11:06:44 AM7/26/22

to bo...@lists.boost.org, Jooseong Lee, Phil Endecott

> If the element type has a trivial destructor and they are stored
contiguously,
> then you can try to argue that it is better than O(N). This will depend on
> the complexity guarantees of your memory allocator. But your elements are
not
> stored contiguously; destroying N elements will require O(N) "pages" to be
> released, if your pages are fixed-size. So it is still O(N).

Actually, the keys are stored contiguously:

https://github.com/frozenca/BTree/blob/bc62015a3789e5004ec98be0b82587de4edfd488/fc_btree.h#L145-#L167

For trivially copyable types, the type of key array is std::array;
for other types, the type of key array is std::vector. (in this case, your
argument would be right)

What is not stored contiguously is child, this is stored in
std::vector<std::unique_ptr>.
But when N keys are erased, the number of children erased is not O(N),
because a node can have #(fanout) keys.

Best,
Jooseong

2022년 7월 26일 (화) 오후 11:40, Phil Endecott via Boost <bo...@lists.boost.org>님이
작성:

> Jooseong Lee wrote:

Jooseong Lee via Boost

unread,

Jul 26, 2022, 11:19:35 AM7/26/22

to bo...@lists.boost.org, Jooseong Lee, Phil Endecott

> What is not stored contiguously is child, this is stored in
std::vector<std::unique_ptr>.
> But when N keys are erased, the number of children erased is not O(N),
because a node can have #(fanout) keys.

I have to correct this sentence, the number of children erased is O(N).
But the coefficient is small, because practically B-Trees use large fanouts.

Most technically, a call for erase_range(a, b) would be O(log N) + O(K)
where N is the number of whole keys and K is the number of keys being
erased.
I've updated the repo description, thanks for the catch.

Best,
Jooseong

2022년 7월 27일 (수) 오전 12:06, Jooseong Lee <froze...@gmail.com>님이 작성:

Phil Endecott via Boost

unread,

Jul 28, 2022, 8:58:07 AM7/28/22

to bo...@lists.boost.org, Phil Endecott

Jooseong Lee wrote:
>> What is not stored contiguously is child, this is stored in
>> std::vector<std::unique_ptr>.
>> But when N keys are erased, the number of children erased is not O(N),
>> because a node can have #(fanout) keys.
>
> I have to correct this sentence, the number of children erased is O(N).
> But the coefficient is small, because practically B-Trees use large fanouts.
>
> Most technically, a call for erase_range(a, b) would be O(log N) + O(K)
> where N is the number of whole keys and K is the number of keys being
> erased.
> I've updated the repo description, thanks for the catch.

Good, I'm glad I don't have to give the CS101 complexity lecture :-)

One other comment - how do you deal with strings? To get the locality
benefits of the B-Tree you can't store variable-size data totally
out-of-band.

Regarding the general "how do I get this in Boost?" question, I
encourage you to look back through the mailing list archive at the
last few "Is Boost interested in my XYZ libary?" threads. Some people
post useful replies about what to do next which. Cynically, what
happens most often is that there are one or two emails and then
the library author and any replies disappear. Some persistence is
required to make anything happen. One hint: put something more
descriptive (i.e. "B-Tree") in your subject line!

Jooseong Lee via Boost

unread,

Jul 28, 2022, 9:19:13 AM7/28/22

to bo...@lists.boost.org, Jooseong Lee

> One other comment - how do you deal with strings? To get the locality
> benefits of the B-Tree you can't store variable-size data totally
out-of-band.

For std::string and its friends (i.e. not trivially copyable types) it is
stored as something like std::vector<std::string>, so it is not *that* good
(but still better than std::set<std::string>). Google Abseil's B-Tree
implementation use similar approach.

When I personally use my B-Tree for strings, I use raw char arrays like
std::array<char, 16>

> Regarding the general "how do I get this in Boost?" question, I
> encourage you to look back through the mailing list archive at the
> last few "Is Boost interested in my XYZ libary?" threads.

Thanks, I'll have a look.

Best,
Jooseong

On Thu, Jul 28, 2022, 9:58 PM Phil Endecott via Boost <bo...@lists.boost.org>
wrote:

Rainer Deyke via Boost

unread,

Jul 29, 2022, 9:35:43 AM7/29/22

to bo...@lists.boost.org, Rainer Deyke

On 25.07.22 05:45, Jooseong Lee via Boost wrote:
> I recently wrote a general-purpose STL-like ordered associative B-Tree
> based container in C++.
> .
> Its API and usage is very similar to std::set, std::map, std::multiset,
> std::multimap,
> and when the number of elements is relatively large, my benchmark shows
> that it is 2.5~3 times faster than std::set and its friends.

I've pretty much stopped using std::set and co. What I do use is
boost::unordered_map when order doesn't matter and boost::flat_map (from
Boost.Container) when it does. How does your container compare to
these, performance-wise?

--
Rainer Deyke (rai...@eldwood.com)

Vinnie Falco via Boost

unread,

Jul 29, 2022, 11:26:39 AM7/29/22

to boost@lists.boost.org List, Vinnie Falco, Rainer Deyke

On Fri, Jul 29, 2022 at 6:35 AM Rainer Deyke via Boost
<bo...@lists.boost.org> wrote:
> What I do use is boost::unordered_map

There are some VERY interesting things coming to Boost.Unordered :)

Regards

Jooseong Lee via Boost

unread,

Jul 30, 2022, 12:39:23 AM7/30/22

to bo...@lists.boost.org, Jooseong Lee

Dear Rainer,

Compared with boost::unordered_map and boost::container::flat_set,
I repeated inserting/retrieving/erasing 1,000,000 random std::int64_t in
random order,
and see the following results in my machine: (gcc 11.2 -O3)
(repeated 200 times for each target for B-Tree and unordered_set, repeated
10 times for flat_set)

My B-Tree: insert 196.22 ms, lookup 205.64 ms, erase 233.24 ms
boost::container::flat_set : insert 58402.9 ms, lookup 217.73 ms, erase
59075.2 ms
boost::unordered_set : insert 189.45 ms, lookup 232.19 ms, erase 268.01 ms

Regards,
Jooseong

2022년 7월 29일 (금) 오후 10:35, Rainer Deyke via Boost <bo...@lists.boost.org>님이
작성:

Rainer Deyke via Boost

unread,

Jul 30, 2022, 1:00:36 AM7/30/22

to bo...@lists.boost.org, Rainer Deyke

On 30.07.22 06:38, Jooseong Lee via Boost wrote:
> Dear Rainer,
>
> Compared with boost::unordered_map and boost::container::flat_set,
> I repeated inserting/retrieving/erasing 1,000,000 random std::int64_t in
> random order,
> and see the following results in my machine: (gcc 11.2 -O3)
> (repeated 200 times for each target for B-Tree and unordered_set, repeated
> 10 times for flat_set)
>
> My B-Tree: insert 196.22 ms, lookup 205.64 ms, erase 233.24 ms
> boost::container::flat_set : insert 58402.9 ms, lookup 217.73 ms, erase
> 59075.2 ms
> boost::unordered_set : insert 189.45 ms, lookup 232.19 ms, erase 268.01 ms

OK, I'm convinced. Your container seems to have competitive lookup
times, slightly better even, without the drawbacks of flat_set and
unordered_set (i.e. pathological insert/erase times for flat_set,
undefined iteration order and the need to provide a hash function for
unordered_set).

Jooseong Lee via Boost

unread,

Jul 30, 2022, 1:06:23 AM7/30/22

to bo...@lists.boost.org, Jooseong Lee, Rainer Deyke

Thank you.
Can I request a review for the proposal of the Boost.BTree? What should I
do next?

Repo link: https://github.com/frozenca/BTree

Best, Jooseong

2022년 7월 30일 (토) 오후 2:00, Rainer Deyke via Boost <bo...@lists.boost.org>님이
작성:

> On 30.07.22 06:38, Jooseong Lee via Boost wrote:

Jooseong Lee via Boost

unread,

Aug 11, 2022, 4:33:54 AM8/11/22

to bo...@lists.boost.org, Jooseong Lee

> I recently wrote a general-purpose STL-like ordered associative B-Tree based
container in C++.

> Its API and usage is very similar to std::set, std::map, std::multiset,
std::multimap,
> and when the number of elements is relatively large, my benchmark shows
that it is 2.5~3 times faster than std::set and its friends.

> Repo link: https://github.com/frozenca/BTree
> This was better than I thought, so I'd like to contribute to the Boost
community by incorporating it within the list of Boost libraries.
> I'd like to know if it's possible or if you think it's worth it.

> Best Regards,
> Jooseong Lee

Hello, can I ask if Boost is interested in my library and if I can request
a review?
For performance, I saw the following results:

Inserting/retrieving/erasing 1,000,000 random std::int64_t in random order
in gcc 11.2 -O3, averaged 20~200 times:
(https://github.com/frozenca/BTree#performance)

frozenca::BTreeSet : insert 172ms, lookup 199ms, erase 208ms
Google btree::btree_set : insert 248ms, lookup 229ms, erase 243ms
std::set : insert 912ms, lookup 943ms, erase 1002ms
boost::container::flat_set : insert 58527ms, lookup 208ms, erase 59134ms
boost::unordered_set : insert 194ms, lookup 226ms, erase 272ms

If you're interested, I'd appreciate it if you could take a look.

Jooseong

2022년 7월 30일 (토) 오후 2:00, Rainer Deyke via Boost <bo...@lists.boost.org>님이
작성:

> On 30.07.22 06:38, Jooseong Lee via Boost wrote:

Glen Fernandes via Boost

unread,

Aug 11, 2022, 10:19:24 AM8/11/22

to bo...@lists.boost.org, Glen Fernandes

On Thu, Aug 11, 2022 at 4:34 AM Jooseong Lee wrote:
> Hello, can I ask if Boost is interested in my library and if I can request
> a review?

To gauge interest, simply posting on this list is sufficient. The
process for submission for a formal review is described at:
https://www.boost.org/development/submissions.html

And the review process itself is described at:
https://www.boost.org/community/reviews.html

Glen

Jooseong Lee via Boost

unread,

Aug 11, 2022, 11:05:09 AM8/11/22

to Glen Fernandes, Jooseong Lee, bo...@lists.boost.org

> To gauge interest, simply posting on this list is sufficient. The
> process for submission for a formal review is described at:
> https://www.boost.org/development/submissions.html

> And the review process itself is described at:
> https://www.boost.org/community/reviews.html

Thank you! I wasn't aware of these requirements exactly.
I'll post again after I have worked on this to be fully compatible
with the Boost library conditions and fully backward compatible with at
least C++17.

Jooseong

2022년 8월 11일 (목) 오후 11:19, Glen Fernandes <glen.fe...@gmail.com>님이 작성:

Reply all

Reply to author

Forward