DataItem cache

Lynn McGuire

unread,

Feb 4, 2016, 4:19:51 PM2/4/16

to

I have an object called DataItem that is the basic variant storage unit for our software. In the latest release, there can be tens
of millions of these objects which is using up all our memory in Windows (we run out at 1.9 GB of memory usage). We cannot change to
x64 at the moment so I have decided to build a DataItem cache and use one DataItem for many of the same objects wherever possible. I
will use copy on write mechanism to create new DataItems that are being modified.

I have structured the DataItem cache using a vector inside a vector inside a map:

static std::map <int, std::vector <std::vector <DataItem *>>> g_DataItem_Cache;

In other words, a sparse cube. The outside map is for the identity of the major object type, i.e. SYM_AirCoolerGroup. The middle
vector is for the index of the DataItems in that group, i.e. AIR_DUT. The inside vector is for the various different copies of
DataItems that are referenced by that Data Group type and index. Each Data Group will need to know which version of the Data Item
that it is using.

I am wondering if I can make this more efficient (less memory usage). Any thoughts here? One of my staff is totally for this and
another is not. I am about 25% complete on making the code changes for testing.

Thanks,
Lynn

Paavo Helde

unread,

Feb 4, 2016, 4:29:55 PM2/4/16

to

On 4.02.2016 23:19, Lynn McGuire wrote:
> I have an object called DataItem that is the basic variant storage unit
> for our software. In the latest release, there can be tens of millions
> of these objects which is using up all our memory in Windows (we run out
> at 1.9 GB of memory usage).

AFAIK there is a simple way to increase that to 3 GB.

> We cannot change to x64 at the moment so I
> have decided to build a DataItem cache and use one DataItem for many of
> the same objects wherever possible. I will use copy on write mechanism
> to create new DataItems that are being modified.
>
> I have structured the DataItem cache using a vector inside a vector
> inside a map:
>
> static std::map <int, std::vector <std::vector <DataItem *>>>
> g_DataItem_Cache;
>
> In other words, a sparse cube. The outside map is for the identity of
> the major object type, i.e. SYM_AirCoolerGroup. The middle vector is
> for the index of the DataItems in that group, i.e. AIR_DUT. The inside
> vector is for the various different copies of DataItems that are
> referenced by that Data Group type and index. Each Data Group will need
> to know which version of the Data Item that it is using.
>
> I am wondering if I can make this more efficient (less memory usage).
> Any thoughts here? One of my staff is totally for this and another is
> not.

Totally for what? What is the alternative?

Lynn McGuire

unread,

Feb 4, 2016, 5:03:49 PM2/4/16

to

We will hit the 3 GB barrier also using the current storage methodology.

My other programmer wants to move to x64. At least 1/4 of our customers are still running x86 Windows, not gonna happen yet.

Thanks,
Lynn

Victor Bazarov

unread,

Feb 4, 2016, 5:52:24 PM2/4/16

to

On 2/4/2016 5:03 PM, Lynn McGuire wrote:
> [..]

> My other programmer wants to move to x64. At least 1/4 of our customers
> are still running x86 Windows, not gonna happen yet.

Perhaps a wrong newsgroup (there is comp.software-eng, if you didn't
know), but *why can't you maintain two targets*? Those who have already
moved to x64 should use your x64 application and not be constrained.
Otherwise it seems rather unfair.

V
--
I do not respond to top-posted replies, please don't ask

Marcel Mueller

unread,

Feb 5, 2016, 2:34:15 AM2/5/16

to

On 04.02.16 22.19, Lynn McGuire wrote:
> I have an object called DataItem that is the basic variant storage unit
> for our software. In the latest release, there can be tens of millions
> of these objects which is using up all our memory in Windows (we run out
> at 1.9 GB of memory usage). We cannot change to x64 at the moment so I
> have decided to build a DataItem cache and use one DataItem for many of
> the same objects wherever possible. I will use copy on write mechanism
> to create new DataItems that are being modified.

Whether COW is efficient or not depends on the recombination rate, i.e.
how often happens a modified item to be identical to another instance.
If this is likely COW is bad.

> I have structured the DataItem cache using a vector inside a vector
> inside a map:

If you use COW you do not need a cache at all.
You just need to deal with references that are aware of COW. As soon as
you intend to modify the item a copy is made.

> static std::map <int, std::vector <std::vector <DataItem *>>>
> g_DataItem_Cache;

If you need an index then you probably want to do deduplication rather
than COW. I.e. you seek for identical or matching instances before (or
after) you create new ones.

Because you easily will run into serious race-conditions here I strongly
recommend to use smart pointers here. Intrusive reference count is the
first choice.
Of course, if your application never uses a second thread you are safe.

> In other words, a sparse cube. The outside map is for the identity of
> the major object type, i.e. SYM_AirCoolerGroup. The middle vector is
> for the index of the DataItems in that group, i.e. AIR_DUT.

Note that vector is quite inefficient in dealing with sparse content.
Direct lookup is only efficient for types with a small domain like enum
types where most of the values are really used.

Furthermore std::map is not efficient with respect to memory and memory
cache efficiency too if the number of nodes becomes large. The typical
implementations uses a Red-black tree. You should prefer a B-tree if
memory counts. Almost every database do so. Unfortunately this is not
part of the standard. But there are good public implementations
available. E.g. Google published a quite good Java implementation that
can be ported to C++ with reasonable effort (take care of license issues).

> The inside
> vector is for the various different copies of DataItems that are
> referenced by that Data Group type and index. Each Data Group will need
> to know which version of the Data Item that it is using.

Version?

> I am wondering if I can make this more efficient (less memory usage).
> Any thoughts here? One of my staff is totally for this and another is
> not. I am about 25% complete on making the code changes for testing.

Deduplication can significantly reduce memory usage. Typically for
business database are factors in the order of 10. This is basically one
of the concepts behind in memory databases. I also have achieved factors
up to 100 in some applications.

But there are challenges, too.
Fist of all you strictly need to distinguish between read and write
access. For efficiency reasons writable instances should never make it
to the main index. You should share only immutable instances. Otherwise
you will need to synchronize until death.
I recommend to put this in the type system. I.e. make DataItem immutable
and use LocalDataItem as local, writable copy without deduplication.
LocalDataItem should inherit from DataItem to make reading code to be
able to deal with a mix of a both, i.e. many immutable instances and a
few mutable ones.
At least you should ensure that the shared instances use a compact
memory representation, i.e. no half full vectors and so on. Even
std::string is not the best choice as it is optimized for mutability.

To give further hints, more knowledge about the structure of the
DataItems and your application is required.
What are their properties?
Why do many instances have the same data? (Otherwise your concept would
not work.)
What kind of data do they contain? Strings? Maybe it is easier to
deduplicate them.
What is the typical access pattern to look up the DataItems? Do they
have something like a primary key?
How do changes apply to the data structures? Transactions? Revisions?
Snapshot isolation?
Do you have a database backend?
What about concurrency? Is it likely that the same items are accessed
concurrently? For writing or only for reading?
What about the object lifetime? May you have a memory leak? Since you
deal with raw pointers (i strongly disadvise to do so) this is not that
unlikely.
And last but not least: is it really the space for the DataItems that
clobbers your memory? Or is it management overhead? Or maybe even
fragmentation of the virtual address space?

Marcel

Lynn McGuire

unread,

Feb 5, 2016, 5:27:15 PM2/5/16

to

Answers to your questions:
1. part of the DataItem declaration is below
2. many of the DataItem instances are exactly alike since they are snapshots of a user's workspace
3. all kinds of data: strings, integers, doubles, string arrays, double arrays, integer array, strings larger than 300 characters are
compressed using zlib
4. DataItems are stored in a hierarchical object system using a primary key in DataGroup objects
5. not sure what you are asking
6. no database backend
7. no concurrency (yet)
8. when the storage used is 1.5 GB, the memory leakage is 10 MB (observed)
8a. the lifetime of the objects is controlled by the user by opening a file or closing a file
9. I think that it is DataItems but will not know for sure until completion of the current deduplication project

Here is part of the declaration for the DataItem and DesValue classes. There are no member variables in the ObjPtr class.

class DataItem : public ObjPtr
{
private:

int datatype; // Either #Int, #Real, #String or #Enumerated
int vectorFlag; // Flag indicating value contains an Array.
int descriptorName; // name of Corresponding DataDescriptor
// DataGroup * owner; // The DataGroup instance to which this item belongs
std::vector <DataGroup *> owners; // The DataGroup instance(s) to which this item belongs
DesValue * inputValue; // DesValue containing permanent input value
DesValue * scratchValue; // DesValue containing scratch input value
int writeTag; // a Long representing the object for purposes of reading/writing
int unitsClass; // nil or the symbol of the class
std::string unitsArgs; // a coded string of disallowed units
std::map <int, std::vector <int> > dependentsListMap;
DataDescriptor * myDataDescriptor;
BOOL scratchChangedComVector; // if the scratch value was changed in the changeComVector() method

protected:
virtual void discardInput (DataGroup * ownerDG);

public:
// constructor
DataItem ();
DataItem (const DataItem & rhs);
DataItem & operator = (const DataItem & rhs);

// destructor
virtual ~DataItem ();

// comparison of equality
virtual bool operator == (DataItem const & right) const;
virtual bool operator != (DataItem const & right) const;

virtual int isDataItem () { return true; };

class DesValue : public ObjPtr
{
public:

int datatype; // Either #Int, #Real, or #String.
int vectorFlag; // Flag indicating value contains an Array.
int optionListName; // name of the option list item
int * intValue; // Either nil, an Int, a Real, a String, or an Array thereof.
double * doubleValue;
std::string * stringValue;
std::vector <int> * intArrayValue;
std::vector <double> * doubleArrayValue;
std::vector <std::string> * stringArrayValue;
unsigned char * compressedData;
unsigned long compressedDataLength;
std::vector <unsigned long> uncompressedStringLengths;
int isTouched; // Flag indicating if value, stringValue, or units have been modified since this DesValue was created. Set to
true by setValue, setString, setUnits, and convertUnits.
int isSetFlag; // Flag indicating whether the contents of the DesValue is defined or undefined. If isSet is false, getValue
returns nil despite the contents of value, while getString and getUnits return the empty string despite the contents of stringValue
and units.
int unitsValue; // current string value index in $UnitsList (single or top)
int unitsValue2; // current string value index in $UnitsList (bottom)
std::string errorMessage; // message about last conversion of string to value
std::string unitsArgs; // a coded string of disallowed units

protected:

virtual void deleteValues ();

public:

// constructor
DesValue ();
DesValue (const DesValue & rhs);
DesValue & operator = (const DesValue & rhs);

// destructor
virtual ~DesValue ();

// comparison of equality
virtual bool operator == (DesValue const & right) const;
virtual bool operator != (DesValue const & right) const;

Lynn

Ian Collins

unread,

Feb 5, 2016, 6:02:12 PM2/5/16

to

Lynn McGuire wrote:
> On 2/5/2016 1:34 AM, Marcel Mueller wrote:

>> To give further hints, more knowledge about the structure of the DataItems and your application is required.
>> What are their properties?
>> Why do many instances have the same data? (Otherwise your concept would not work.)
>> What kind of data do they contain? Strings? Maybe it is easier to deduplicate them.
>> What is the typical access pattern to look up the DataItems? Do they have something like a primary key?
>> How do changes apply to the data structures? Transactions? Revisions? Snapshot isolation?
>> Do you have a database backend?
>> What about concurrency? Is it likely that the same items are accessed concurrently? For writing or only for reading?
>> What about the object lifetime? May you have a memory leak? Since you deal with raw pointers (i strongly disadvise to do so) this is
>> not that unlikely.
>> And last but not least: is it really the space for the DataItems that clobbers your memory? Or is it management overhead? Or maybe
>> even fragmentation of the virtual address space?
>>
>>
>> Marcel
>
> Answers to your questions:
> 1. part of the DataItem declaration is below
> 2. many of the DataItem instances are exactly alike since they are snapshots of a user's workspace
> 3. all kinds of data: strings, integers, doubles, string arrays, double arrays, integer array, strings larger than 300 characters are
> compressed using zlib
> 4. DataItems are stored in a hierarchical object system using a primary key in DataGroup objects
> 5. not sure what you are asking
> 6. no database backend

If you have a large number of objects that are suitable for
deduplication, you might be better off using a proper in memory
database. Then you wouldn't have to worry about such things your self.
Being a JSON/BSON fan, I tend towards MongoDB for this type of data.
If the data is relational, I would look to MySQL in memory tables.

> 7. no concurrency (yet)
> 8. when the storage used is 1.5 GB, the memory leakage is 10 MB (observed)
> 8a. the lifetime of the objects is controlled by the user by opening a file or closing a file
> 9. I think that it is DataItems but will not know for sure until completion of the current deduplication project
>
> Here is part of the declaration for the DataItem and DesValue classes. There are no member variables in the ObjPtr class.
>
> class DataItem : public ObjPtr
> {
> private:
>
> int datatype; // Either #Int, #Real, #String or #Enumerated
> int vectorFlag; // Flag indicating value contains an Array.
> int descriptorName; // name of Corresponding DataDescriptor
> // DataGroup * owner; // The DataGroup instance to which this item belongs
> std::vector <DataGroup *> owners; // The DataGroup instance(s) to which this item belongs
> DesValue * inputValue; // DesValue containing permanent input value
> DesValue * scratchValue; // DesValue containing scratch input value
> int writeTag; // a Long representing the object for purposes of reading/writing
> int unitsClass; // nil or the symbol of the class
> std::string unitsArgs; // a coded string of disallowed units
> std::map <int, std::vector <int> > dependentsListMap;

Could these (and the vector above) be fixed size? If not, you might
benefit in both space and performance if you use a custom allocator for
them.

Maps within vectors within maps will probably lead to a very fragmented
heap, wasting both memory and possible cache hits.

> DataDescriptor * myDataDescriptor;
> BOOL scratchChangedComVector; // if the scratch value was changed in the changeComVector() method

BOOL?

>
> protected:
> virtual void discardInput (DataGroup * ownerDG);
>
> public:
> // constructor

Time for Flibble to start a "Gratuitous considered harmful" thread :)

<snip>

> class DesValue : public ObjPtr
> {
> public:
>
> int datatype; // Either #Int, #Real, or #String.
> int vectorFlag; // Flag indicating value contains an Array.
> int optionListName; // name of the option list item
> int * intValue; // Either nil, an Int, a Real, a String, or an Array thereof.
> double * doubleValue;
> std::string * stringValue;
> std::vector <int> * intArrayValue;
> std::vector <double> * doubleArrayValue;
> std::vector <std::string> * stringArrayValue;

Are these local to the class? If so, the allocator comment above might
be relevant.

--
Ian Collins

Ian Collins

unread,

Feb 5, 2016, 6:21:18 PM2/5/16

to

Ian Collins wrote:
>>
>> public:
>> // constructor
>
> Time for Flibble to start a "Gratuitous considered harmful" thread :)

D'oh! "Gratuitous comments considered harmful"

--
Ian Collins

Marcel Mueller

unread,

Feb 6, 2016, 12:11:53 PM2/6/16

to

On 05.02.16 23.26, Lynn McGuire wrote:
> 2. many of the DataItem instances are exactly alike since they are
> snapshots of a user's workspace

You really perform a deep copy to take a snap of your data and you do
this quite often?
No wonder that you run out of memory.

I have some similar requirements. An application that takes a snap of
the entire database on each user action to ensure data consistency. The
snap is nothing but a unique UTC time stamp. Each object instance has a
time stamp too. If an object is too new an older copy is used instead.
For this to work foreign keys never refer directly to the target
instance. They are just small key objects.

> 3. all kinds of data: strings, integers, doubles, string arrays, double
> arrays, integer array, strings larger than 300 characters are compressed
> using zlib

The data in your items looks like you are trying to implement
polymorphism by something like a custom union. Turning into C++
polymorphism could save memory. It looks a bit over-engineering to store
#Int with some hundred bytes.

> 4. DataItems are stored in a hierarchical object system using a primary
> key in DataGroup objects

So you have two levels. The DataGroup and the Key.

> 5. not sure what you are asking
> 6. no database backend
> 7. no concurrency (yet)

You probably never want to implement multi-threading with this data model.

> 8. when the storage used is 1.5 GB, the memory leakage is 10 MB (observed)
> 8a. the lifetime of the objects is controlled by the user by opening a
> file or closing a file
> 9. I think that it is DataItems but will not know for sure until
> completion of the current deduplication project

A memory profiler should give you the answer.

> Here is part of the declaration for the DataItem and DesValue classes. There are no member variables in the ObjPtr class.

The size of the ObjPtr will not be zero, because C++ forbids this.

> class DataItem : public ObjPtr
> {
> private:
>
> int datatype; // Either #Int, #Real, #String or #Enumerated

Polymorphism?

> int vectorFlag; // Flag indicating value contains an Array.

Use one item with all flags rather than individual integers saves Memory.

> int descriptorName; // name of Corresponding DataDescriptor

Redundant? You have DataDescriptor * below.

> // DataGroup * owner; // The DataGroup instance to which this
> item belongs

Is this not always part of caller context?

> std::vector <DataGroup *> owners; // The DataGroup instance(s) to
> which this item belongs

? - Either this or the above is redundant.
If you do not need the uplinks for your business, a reference counter
might be enough.

> DesValue * inputValue; // DesValue containing permanent input value
> DesValue * scratchValue; // DesValue containing scratch input value

Do not clobber the data store with intermediate data used for editing.
Use subclassing or shadow instances for this purpose. I guess only a few
instances have reasonable differences between inputValue and scratchValue.

By the way. What about the memory consumption of the DesValue objects?
They are bulky too.

> int writeTag; // a Long representing the object for purposes of
> reading/writing

No idea. Maybe also better a part of a mutable subclass.

> int unitsClass; // nil or the symbol of the class
> std::string unitsArgs; // a coded string of disallowed units

What about the storage of the strings behind this?

> std::map <int, std::vector <int> > dependentsListMap;

And this one could grow really large.
The nested structure is really not recommended. At least std::multi_map
should be better.

Depending on the number of entries std::map is not the best choice. It
allocates an object for each item. If the number of entries is quite
small or if it changes rarely a sorted vector performs better.

> DataDescriptor * myDataDescriptor;

See above.

> BOOL scratchChangedComVector; // if the scratch value was changed

See flags above.

> virtual int isDataItem () { return true; };

Your class is virtual anyway, so using subclasses for different types is
straight forward.

> class DesValue : public ObjPtr
> {
> public:
>
> int datatype; // Either #Int, #Real, or #String.

Again hand made polymorphism.

> int vectorFlag; // Flag indicating value contains an Array.
> int optionListName; // name of the option list item
> int * intValue; // Either nil, an Int, a Real, a String, or an
> Array thereof.
> double * doubleValue;
> std::string * stringValue;
> std::vector <int> * intArrayValue;
> std::vector <double> * doubleArrayValue;
> std::vector <std::string> * stringArrayValue;
> unsigned char * compressedData;
> unsigned long compressedDataLength;
> std::vector <unsigned long> uncompressedStringLengths;

Really bad design. If there are many objects of this type then you know
where your problem is.

Use sub classes with only one of the fields and avoid the additional
allocations for the referenced value objects.

> int isTouched; // Flag indicating if value, stringValue, or units
> have been modified since this DesValue was created. Set to true by
> setValue, setString, setUnits, and convertUnits.
> int isSetFlag; // Flag indicating whether the contents of the
> DesValue is defined or undefined. If isSet is false, getValue returns
> nil despite the contents of value, while getString and getUnits return
> the empty string despite the contents of stringValue and units.

Join all flags into one field.

> int unitsValue; // current string value index in $UnitsList
> (single or top)
> int unitsValue2; // current string value index in $UnitsList
> (bottom)
> std::string errorMessage; // message about last conversion of
> string to value

This looks again like transitional data that should not be part of the
persistent data model.

> std::string unitsArgs; // a coded string of disallowed units

Adjusting the data model would probably save at least 50% of memory even
without deduplication. Especially the DesValue Objects are unnecessarily
bulky. If you also do not create excessive deep copies of the structures
you should be fine. No need for deduplication as far as I can see. COW
should be sufficient.

It looks like you are trying to reinvent Excel. Well considering the
speed its design is probably similar. ;-)

Marcel

Lynn McGuire

unread,

Feb 6, 2016, 4:11:28 PM2/6/16

to

Thanks !

Lynn

Lynn McGuire

unread,

Feb 8, 2016, 3:15:26 PM2/8/16

to

On 2/6/2016 11:11 AM, Marcel Mueller wrote:

BTW, we first released this software in 1990 on Windows 3.0 as a mixture of C and Smalltalk. This is some of the Smalltalk code that
we converted to C++ in 2002.

And, you are very correct. DesValue needs to go on a severe diet.

Lynn