PooledDataArray variant?

103 views
Skip to first unread message

Jeff Bezanson

unread,
May 26, 2016, 12:37:32 PM5/26/16
to julia...@googlegroups.com
Hi all,

I recently ran into a use case for PooledDataArrays, avoiding storing
a huge number of copies of the same string. This is purely for
compression, and works beautifully for that. In my case it's also nice
to be able to use Int8 and Int16, so I'm taking advantage of that.
However, I don't want missing values (type instability being the
biggest problem) and I don't need any categorical behavior.

Would it make sense to have a version of PooledDataArray for this kind
of application? I'm imagining a PooledArrays.jl package just with this
type. Maybe something like this already exists? If not, I'd be happy
to start the legwork if the idea makes sense.

-Jeff

Milan Bouchet-Valat

unread,
May 26, 2016, 1:38:06 PM5/26/16
to julia...@googlegroups.com
Actually, I've recently started to work on John Myles White's
CategoricalData.jl package, which is meant to replace PDAs. At the
moment, I have two different types: CategoricalArray and
NullableCategoricalArray, and the former does not support missing
values.

I wanted to polish several aspects and write some docs before making
the code available, but you can have a look at my fork here:
https://github.com/nalimilan/CategoricalData.jl

Comments welcome ! The original discussion of the design is here:
https://github.com/JuliaStats/DataArrays.jl/issues/73


Regards

Jeff Bezanson

unread,
May 26, 2016, 3:16:22 PM5/26/16
to julia...@googlegroups.com
Is this de-coupled from the notion of categorical data? I want
something that just does the pooling optimization automatically for
all types T, without separately defining the pool or adding any new
ordering behavior. It would probably also be good to store the pool
sorted for fast lookup, but that's a bonus.
> --
> You received this message because you are subscribed to the Google Groups "julia-stats" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to julia-stats...@googlegroups.com.
> For more options, visit https://groups.google.com/d/optout.

Milan Bouchet-Valat

unread,
May 26, 2016, 3:27:17 PM5/26/16
to julia...@googlegroups.com
Le jeudi 26 mai 2016 à 15:16 -0400, Jeff Bezanson a écrit :
> Is this de-coupled from the notion of categorical data? I want
> something that just does the pooling optimization automatically for
> all types T, without separately defining the pool or adding any new
> ordering behavior. It would probably also be good to store the pool
> sorted for fast lookup, but that's a bonus.
It depends on what you mean by "categorical data". CategoricalArray
stores a CategoricalPool, but that's mostly invisible to the user. When
indexed, it returns CategoricalValue objects which are immutable
wrappers storing the value (i.e. the string) and a reference to the
pool. In practice it should be usable as a string in many cases.

Then there's OrdinalArray, which adds an ordering to the values, by
default based on the order of appearance of the levels or on their
insertion order.

Does CategoricalArray suit your needs? The main difference with PDAs is
that it doesn't attempt to act like a standard array by supporting any
operation that the underlying type supports.


Regards

Jeff Bezanson

unread,
May 26, 2016, 3:48:47 PM5/26/16
to julia...@googlegroups.com
> it returns CategoricalValue objects

This is the part I don't want. I want it to behave exactly like a
Vector{T}, just space-optimized.

Andreas Noack

unread,
May 26, 2016, 4:00:48 PM5/26/16
to julia...@googlegroups.com
Isn't the "categorical" interpretation part of the collection and therefore less relevant when you consider an element in isolation. What kind of operations do you have in mind for CategoricalValues?
Reply all
Reply to author
Forward
0 new messages