spoilt for choice: choosing the right container

Thomas Lehmann

unread,

Nov 23, 2009, 6:56:33 AM11/23/09

to

Hi!

We have a row base visualization of data and we need a very fast
access to the data
and for updating the individual rows.

Using a sorted vector - initially - everything looks fine as long we
are not forced
to sort on each new added item.

Positive:
- "push_back" is fast
- "distance" is fast
- you can use "reserve"

Negative:
- calling sort very often is bad.
- adding/deleting very often is bad.

Using std::multiset everything looks fine as long we have not to load
big data
at the beginning:

Positive:
- "insert" is (if you like) a "push_back + sort" but fast because of
the
internal binary tree.

Negative:
- there's no way to do a "reserve"
- "distance" is much more expensive

I would need both but how can this be done?

--
[ See http://www.gotw.ca/resources/clcm.htm for info about ]
[ comp.lang.c++.moderated. First time posters: Do this! ]

Jeff Flinn

unread,

Nov 23, 2009, 4:43:50 PM11/23/09

to

Thomas Lehmann wrote:
> Hi!
>
> We have a row base visualization of data and we need a very fast
> access to the data
> and for updating the individual rows.
>
> Using a sorted vector - initially - everything looks fine as long we
> are not forced
> to sort on each new added item.
>
> Positive:
> - "push_back" is fast
> - "distance" is fast
> - you can use "reserve"
>
> Negative:
> - calling sort very often is bad.
> - adding/deleting very often is bad.
>
> Using std::multiset everything looks fine as long we have not to load
> big data
> at the beginning:
>
> Positive:
> - "insert" is (if you like) a "push_back + sort" but fast because of
> the
> internal binary tree.
>
> Negative:
> - there's no way to do a "reserve"

Use a different allocator.

> - "distance" is much more expensive
>
> I would need both but how can this be done?
>

See http://www.boost.org/doc/libs/1_41_0/libs/multi_index/doc/index.html
which allows both set-like and sequential-container-like access.

Jeff

marcin...@gmail.com

unread,

Nov 23, 2009, 4:46:47 PM11/23/09

to

You could write some smart allocator for your class to use with
std::multiset.
Or you could use std::vector for memory management and
boost::intrusive::multiset
for preserving the order. Nothing more comes to my mind ;)

Cheers
Sfider

Nick Hounsome

unread,

Nov 23, 2009, 4:46:07 PM11/23/09

to

There is no way that is best for all uses (or it would almost
certainly already be in the STL)

Firstly you need to check that multiset or vector (or deque) really
aren't fast enough rather than just assume.

Secondly you need to better characterize your usage:
How big is the initial load?
How often do you subsequently add stuff?
How often do you need to calculate distances?

If your data is large and rarely changed it can often make sense to
load into a set and then copy out (or copy pointers out) to a vector
or deque.

If you don't often add stuff and the data set is not large then an
ordered insertion into an ordered vector or deque may be reasonable.
(DO NOT just append and resort - binary search and shift will be
quicker)

I'm a bit ignorant in this area but another idea is the template heap
functions might be useful.

Stephen Howe

unread,

Nov 23, 2009, 4:46:31 PM11/23/09

to

On Mon, 23 Nov 2009 05:56:33 CST, Thomas Lehmann <t.le...@rtsgroup.net> wrote:

>Hi!
>
>We have a row base visualization of data and we need a very fast
>access to the data
>and for updating the individual rows.
>
>Using a sorted vector - initially - everything looks fine as long we
>are not forced
>to sort on each new added item.
>
>Positive:
> - "push_back" is fast
> - "distance" is fast
> - you can use "reserve"
>
>Negative:
> - calling sort very often is bad.

So why do it?
If you have N sorted items in a vector, you add 1 more via push_back(), the idea
of reapplying sort() to the first N items seems
crazy when they are already sorted.
Do either
(i) inplace_merge() for the N and 1 items
(ii) Dont use push_back() but locate where the item would go using lower_bound()
and do an insert(). It remains sorted.

(ii) is not bad for a 1 item, pretty bad for a large number of items

> - adding/deleting very often is bad.

You might be able to mitigate your costs.

If your adding M items to the existing N items, do
push_back() M items, not necessarily sorted
sort() the M items just added
inplace_merge() the N + M items

If your deleting M items, do they have some common pattern?
Because if so you can use the remove-erase idiom which is O(N).

Cheers

Stephen Howe

George Neuner

unread,

Nov 24, 2009, 12:26:17 AM11/24/09

to

On Mon, 23 Nov 2009 05:56:33 CST, Thomas Lehmann
<t.le...@rtsgroup.net> wrote:

>We have a row base visualization of data and we need a very fast
>access to the data and for updating the individual rows.

Database not fast enough?

>Using a sorted vector - initially - everything looks fine as long we
>are not forced to sort on each new added item.
>
>Positive:
> - "push_back" is fast
> - "distance" is fast
> - you can use "reserve"
>
>Negative:
> - calling sort very often is bad.
> - adding/deleting very often is bad.
>
>Using std::multiset everything looks fine as long we have not to load
>big data at the beginning:
>
>Positive:
> - "insert" is (if you like) a "push_back + sort" but fast because
> of the internal binary tree.
>
>Negative:
> - there's no way to do a "reserve"
> - "distance" is much more expensive
>
>I would need both but how can this be done?

Why do you have to use a single data structure? How about using an
unsorted vector for the row data and a separate sorted pointer vector
to index it?

Your "distance" metric is a mystery. Normally distance is a function
of the data and only loosely related to a sorted position based on the
relevant property. If you can explain better what you're trying to do
I or somebody else might have better suggestions.

George

Jerry Coffin

unread,

Dec 3, 2009, 3:34:13 AM12/3/09

to

In article <e3497e4c-5244-4966-a92b-
3630b4...@e7g2000vbi.googlegroups.com>, t.le...@rtsgroup.net
says...

>
> Hi!
>
> We have a row base visualization of data and we need a very fast
> access to the data and for updating the individual rows.
>
> Using a sorted vector - initially - everything looks fine as long we
> are not forced to sort on each new added item.

[ ... ]

> Using std::multiset everything looks fine as long we have not to
> load big data at the beginning:

I'd consider using both. In a typical case, you have a big chunk of
data to start with, and relatively small updates during an individual
execution.

For such cases, you can load your initial data into a vector, but any
added data into a multiset. The multiset will typically stay fairly
small (only the data added during the current execution) so things
like distance over it remain fast. You merge the two when saving the
file, so the next load puts all the existing data into the vector.

There are a few parts that are a bit tricky -- for example, an
overall distance between two items will require adding the distance
from the vector to the distance from the multiset. Most such
situations are _fairly_ straightforward though.

--
Later,
Jerry.