Issue 34 in stajistics: Make DataSets more compact

0 views
Skip to first unread message

staji...@googlecode.com

unread,
Aug 25, 2010, 9:45:36 AM8/25/10
to stajist...@googlegroups.com
Status: Accepted
Owner: lorant.p...@gmail.com
Labels: Type-Enhancement Priority-Medium

New issue 34 by lorant.p...@gmail.com: Make DataSets more compact
http://code.google.com/p/stajistics/issues/detail?id=34

This is related to issue #10: Implement DataSets with enums.

After measuring the memory consumption of Stajistics it turned out that
DataSets occupy a lot of memory. As my application keeps a record of
measurements over time, a lot of DataSets accumulate. But there are other
problems with DataSets apart from this:

* the HashMaps used to store the data use a lot of memory because of their
sparse nature
* HashMaps store values as objects (i.e. Long or Double), which take up
even more space
* accessing fields via String keys is not very elegant, and could be done
much better
* current DataSets are always mutable, so measurements can be "falsified",
and thread-safety issues also arise
* the whole field meta-data architecture seems overcomplicated and takes up
far too much memory

To remedy these, I created the FastDataSet implementations (see attached
project). Description of the classes:

org.stajistics.data:

DataContainer: immutable store for Field->Value pairs
DataSet: well, a data set (currently not much more complicated than the
DataContainer itself)
DataSetBuilder: a mutable DataContainer that is able to build an immutable
DataSet
DataSetBuilderFactory: a factory for a certain type of DataSets (i.e. with
a fixed set of fields)
Field: an interface that all fields have to implement.
FieldFactory: a simple way to create Field objects. (Another valid approach
is to create enums that implement the Field interface.)

The implementations for these classes can be found in the
org.stajistics.data.fast package.

There are some conceptual differences compared to the current DataSet
implementation in Stajistics:

* I removed the MetaData stuff completely, because I didn't understand in
the first place what they were for. They can be easily added back later if
needed. However, any metadata (such as description of a field etc.) can be
stored in the Field implementation itself.

* There is some reduced flexibility in what a DataSet can contain. There
can be different *types* of DataSets: each type with its own set of
predefined Fields.

Each type of DataSet has a corresponding DataSetBuilderFactory that creates
DataSetBuilders (mutable DataSets, basically) which can be passed around
while values are being set. Once all values are set, DataSetBuilder.build()
will convert the data into an immutable DataSet that can be passed around
further, with the added benefits of immutability.

The implementation was designed to be as fast as possible. It uses
ImmutableMaps from Google Collections and separate arrays to store long and
double values as primitives to work around boxing issues with Maps.

Another change in behavior is that you can set/retrieve values from
DataSets both as long and double -- the required conversion happens
automatically. This might be a disadvantage (as there might be some silent
loss of precision), in which case it is easy to make it throw exceptions
instead.

There is also a way to access the fields of a DataSet via string IDs to
support access through JMX. This method is a tad slower, and therefore
obviously discouraged.

Please check if you think this implementation can be used (and with what
modifications) in Stajistics instead of the current one.

Attachments:
fastdataset-test.zip 9.2 KB

staji...@googlecode.com

unread,
Aug 25, 2010, 9:59:54 AM8/25/10
to stajist...@googlegroups.com

Comment #1 on issue 34 by lorant.p...@gmail.com: Make DataSets more compact
http://code.google.com/p/stajistics/issues/detail?id=34

One possible improvement to support using enums as Fields is to rename some
of the methods in the Field internface:

getName() -> name()
getKind() -> kind()
getType() -> type()
getDefaultValue() -> defaultValue()

Only the first rename is important (so that the name() of the enum instance
can be used for the name of the field), the rest is just for consistency.

Another improvement might be to drop the type()/getType() method, as there
seems to be little need for it anyway. This would change the Field
interface into this:

public interface Field {

public enum Kind {
LONG, DOUBLE
}

String name();

Kind kind();

Object defaultValue();
}

staji...@googlecode.com

unread,
Aug 27, 2010, 3:29:44 PM8/27/10
to stajist...@googlegroups.com

Comment #2 on issue 34 by troy.kin...@gmail.com: Make DataSets more compact
http://code.google.com/p/stajistics/issues/detail?id=34

I like this effort. As you know I've been unhappy with the current DataSet
implementation. Why don't you create a new branch, commit this to it, start
a code review, and we can hammer out some of the feedback I have.

staji...@googlecode.com

unread,
Aug 31, 2010, 12:00:43 PM8/31/10
to stajist...@googlegroups.com

Comment #3 on issue 34 by lorant.p...@gmail.com: Make DataSets more compact
http://code.google.com/p/stajistics/issues/detail?id=34

Okay, so the initial implementation is on the fast-dataset branch now. I
had no time today to fix the tests (going to work on it tomorrow probably).
I'm sort of happy with where I got with it. There are some problems, though.

1) In many places we use the dataset field default values whereas we should
use default values that belong to the DataRecorder/Session. I mostly fixed
these.

2) I have no idea how to do the error handling. The old implementation was
rather easy on illegal field names and such, returning null whenever
something invalid was referenced. With the new API using long and double
return values, mostly there is no way to return sensible defaults in these
cases. Or maybe I don't see how this should be possible. In any case,
currently I simply throw an IllegalArgumentException -- but this should be
changed before the API gets finalized.

I'll probably add more problems to this list as I work my way along with
the tests.

staji...@googlecode.com

unread,
Aug 31, 2010, 12:38:39 PM8/31/10
to stajist...@googlegroups.com

Comment #4 on issue 34 by lorant.p...@gmail.com: Make DataSets more compact
http://code.google.com/p/stajistics/issues/detail?id=34

Another question popped up: why do we want to distinguish between normal
and meta fields? Can't they reside in the same fieldset?

staji...@googlecode.com

unread,
Aug 31, 2010, 7:12:53 PM8/31/10
to stajist...@googlegroups.com

Comment #5 on issue 34 by lorant.p...@gmail.com: Make DataSets more compact
http://code.google.com/p/stajistics/issues/detail?id=34

Okay, tests should be working after r214.

staji...@googlecode.com

unread,
Jun 16, 2011, 8:57:48 PM6/16/11
to stajist...@googlegroups.com
Updates:
Status: Fixed

Comment #6 on issue 34 by troy.kin...@gmail.com: Make DataSets more compact
http://code.google.com/p/stajistics/issues/detail?id=34

After reviewing the proposed strategy, my main issue is that it restricts
DataRecorders to working with longs and doubles only. Although, I agree
that the previous implementation of DataSets was too heavyweight. As a
middle ground, as of revision c8a41b99b324, field-level meta data has been
removed (as it wasn't all that useful anyways). Data is still stored in a
Map, and there is no restriction on what kind of data can be stored. A
separate meta-data Map exists, in case it is wanted by a DataRecorder
implementation, but it is lazily initialized so that it does not normally
consume resources. The purpose of the meta-data is to allow storage of data
that is not "statistical" data that is being recorded. Lastly, the previous
meta-data fields 'collection time stamp' and 'drained session' have been
transformed into first-class fields of DataSet (so that they do not demand
a separate meta data Map).

If there are any issues with the solution to this enhancement, it can be
visited in separate issues.

Closing.

Reply all
Reply to author
Forward
0 new messages