[Numpy-discussion] draft enum NEP

Bryan Van de Ven

unread,

Mar 9, 2012, 11:55:00 AM3/9/12

to Discussion of Numerical Python

Hi all,

I have started working on a NEP for adding an enumerated type to NumPy.
It is on my GitHub:

https://github.com/bryevdv/numpy/blob/enum/doc/neps/enum.rst

It is still very rough, and incomplete in places. But I would like to
get feedback sooner rather than later in order to refine it. In
particular there are a few questions inline in the document that I would
like input on. Any comments, suggestions, questions, concerns, etc. are
very welcome.

Thanks,

Bryan
_______________________________________________
NumPy-Discussion mailing list
NumPy-Di...@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion

Nathaniel Smith

unread,

Mar 9, 2012, 3:00:48 PM3/9/12

to Discussion of Numerical Python

On Fri, Mar 9, 2012 at 4:55 PM, Bryan Van de Ven <bry...@continuum.io> wrote:
> Hi all,
>
> I have started working on a NEP for adding an enumerated type to NumPy.
> It is on my GitHub:
>
> https://github.com/bryevdv/numpy/blob/enum/doc/neps/enum.rst
>
> It is still very rough, and incomplete in places. But I would like to
> get feedback sooner rather than later in order to refine it. In
> particular there are a few questions inline in the document that I would
> like input on. Any comments, suggestions, questions, concerns, etc. are
> very welcome.

Hi Bryan,

That's excellent, an enumerated type would be very useful. From a
quick read, though, what I'd really like to see is some discussion of
the goals here -- like some example situations where you see these
being used, and the problems they're intended to solve? Because for
example, C "enums" are designed to solve a completely different
problem than something like an R "factor", and off the top of my head
I don't know how well either maps onto hdf5 enumerated types. Another
example is that I can't tell from the document what the motivation for
having both "open" and "closed" enums is?

(Also, general question: is there some technical advantage to being
able to represent more complicated dtypes as strings, that justifies
making up these mini-languages like "enum:uint16[A, B, C, D, E:128]"?
It can't be necessary for pickling or anything, right, since AFAICT
there's already no string representation for structured dtypes? It
just seems like it'd be simpler and more elegant to use some Python
syntax like 'dtype(Enum(["a", "b", "c"], storage=np.uint16))' instead
of writing a tiny one-off parser and wedging what's really a data
structure into a string, but I may be missing something.)

-- Nathaniel

David Gowers (kampu)

unread,

Mar 9, 2012, 5:48:42 PM3/9/12

to Discussion of Numerical Python

Hi,

On Sat, Mar 10, 2012 at 3:25 AM, Bryan Van de Ven <bry...@continuum.io> wrote:
> Hi all,
>
> I have started working on a NEP for adding an enumerated type to NumPy.
> It is on my GitHub:
>
> https://github.com/bryevdv/numpy/blob/enum/doc/neps/enum.rst
>
> It is still very rough, and incomplete in places. But I would like to
> get feedback sooner rather than later in order to refine it. In
> particular there are a few questions inline in the document that I would
> like input on. Any comments, suggestions, questions, concerns, etc. are
> very welcome.

"t = np.dtype('enum', map=(n,v))"

^ Is this supposed to be indicating 'this is an enum with values
ranging between n and v'? It could be a bit more clear.

Is it possible to partially define an enum? That is, give the maximum
and minimum values, and only some of the enumeration value:name
mappings?
For example, an enum where 0 means 'n/a', +n means 'Type A Object
#(n-1)' and -n means 'Type B Object #(abs(n) - 1)'. I just want to map
the non-scalar values, while having a way to avoid treating valid
scalar values (eg +64) as out-of-range.
Example of what I mean:

"t = np.dtype('enum[N_A:0]', range = (-127, 127))"
(defined values being printed as a string, undefined being printed as a number.)

David

Wes McKinney

unread,

Mar 11, 2012, 7:03:17 PM3/11/12

to Discussion of Numerical Python

I'll have to think about this (a little brain dump here). I have many
use cases in pandas where this would be useful which are basically
direct translations of R's factor data type. Note that R always
coerces the levels (the unique values) AFAICT to string type. However,
mapping back to a well-dtyped array is important, too. So the
temptation might be to do something like this:

ndarray: dtype storage type (uint32 or something)
mapping : khash with type PyObject* -> uint32

Now, one problem with this is that you want the mapping + dtype to be
invertible (otherwise you're left doing some type inference). The way
that I implement the mapping is to restrict the labeling to be from 0
to N - 1 which makes things easier. If we decide that having an
explicit value mapping

The nice thing about this is that the same set of core algorithms can
be used to fix numpy.unique. For example you would like to be able to
do:

enum_arr = np.enum(arr)

(this seems like a reasonable API to me) and that is a direct
equivalent of R's factor function. You need to be able to pass an
explicit ordering when calling the enum/factor function. If not
specified, you should have an option to either sort or not-- for
example suppose you convert an array of 1 million integers to enum but
you don't particularly care about the uniques (which could be very
large, up to the size of the array) being ordered (no need to pay N
log N for large N).

One nice thing about khash is that it can be serialized fairly easily.

Have you looked much at how I use enum-like ideas in pandas? It would
be great if I could offload some of this data algorithmic work to
NumPy.

We will want the enum data type to integrate with text file readers--
if you "factorize as you go" you can drastically reduce the memory
usage of a structured array (or pandas DataFrame) columns with
long-ish strings and relatively few unique values.

- Wes

David Gowers (kampu)

unread,

Mar 11, 2012, 7:50:12 PM3/11/12

to Discussion of Numerical Python

Hi Wes,

On Mon, Mar 12, 2012 at 9:33 AM, Wes McKinney <wesm...@gmail.com> wrote:
>
> Now, one problem with this is that you want the mapping + dtype to be
> invertible (otherwise you're left doing some type inference). The way
> that I implement the mapping is to restrict the labeling to be from 0
> to N - 1 which makes things easier.

> If we decide that having an explicit value mapping

(...?)

You might want to finish whatever thought that was :)

Mark Wiebe

unread,

Mar 13, 2012, 9:44:39 PM3/13/12

to Discussion of Numerical Python

On Fri, Mar 9, 2012 at 8:55 AM, Bryan Van de Ven <bry...@continuum.io> wrote:

Hi all,

I have started working on a NEP for adding an enumerated type to NumPy.
It is on my GitHub:

https://github.com/bryevdv/numpy/blob/enum/doc/neps/enum.rst

It is still very rough, and incomplete in places. But I would like to
get feedback sooner rather than later in order to refine it. In
particular there are a few questions inline in the document that I would
like input on. Any comments, suggestions, questions, concerns, etc. are
very welcome.

This looks like a great start to me.

I think the open/closed enum distinction will need to be explored a little bit more, because it interacts with dtype immutability/hashability. Do you know if there are any examples of Python objects in the wild that dynamically convert from not being hashable (i.e. raising an exception if used as a dict key) to become hashable?

It might be worth adding a section which briefly compares and contrasts the proposed functionality with enums in various programming languages. Here are two links I found to try and get an idea:

MS on C# enum usage:

http://msdn.microsoft.com/en-us/library/cc138362.aspx

Wikipedia on C++ enum class:

http://en.wikipedia.org/wiki/C%2B%2B11#Strongly_typed_enumerations

For example, the C# enum has a way to enable a "flags" mode, which will create successive powers of 2. This may not be a feature NumPy needs, but if people are finding it useful in C#, maybe it would be useful here too.

Cheers,

Mark

Dag Sverre Seljebotn

unread,

Mar 14, 2012, 1:08:38 AM3/14/12

to numpy-di...@scipy.org

On 03/13/2012 06:44 PM, Mark Wiebe wrote:
> On Fri, Mar 9, 2012 at 8:55 AM, Bryan Van de Ven <bry...@continuum.io
> <mailto:bry...@continuum.io>> wrote:
>
> Hi all,
>
> I have started working on a NEP for adding an enumerated type to NumPy.
> It is on my GitHub:
>
> https://github.com/bryevdv/numpy/blob/enum/doc/neps/enum.rst
>
> It is still very rough, and incomplete in places. But I would like to
> get feedback sooner rather than later in order to refine it. In
> particular there are a few questions inline in the document that I would
> like input on. Any comments, suggestions, questions, concerns, etc. are
> very welcome.
>
>
> This looks like a great start to me.
>
> I think the open/closed enum distinction will need to be explored a
> little bit more, because it interacts with dtype
> immutability/hashability. Do you know if there are any examples of
> Python objects in the wild that dynamically convert from not being
> hashable (i.e. raising an exception if used as a dict key) to become
> hashable?

In Sage, the matrix objects are mutable when constructed, and you can
set_immutable to make them immutable.

The way I look at that though is that it is part of the construction
phase of the object, you'd typically construct, fill it in, then
set_immutable (to finish construction), then use it.

set/frozenset is an example of the opposite, and a design I personally
like better (i.e., "frozen_dtype" :-)).

Dag

>
> It might be worth adding a section which briefly compares and contrasts
> the proposed functionality with enums in various programming languages.
> Here are two links I found to try and get an idea:
>
> MS on C# enum usage:
> http://msdn.microsoft.com/en-us/library/cc138362.aspx
> Wikipedia on C++ enum class:
> http://en.wikipedia.org/wiki/C%2B%2B11#Strongly_typed_enumerations
>
> For example, the C# enum has a way to enable a "flags" mode, which will
> create successive powers of 2. This may not be a feature NumPy needs,
> but if people are finding it useful in C#, maybe it would be useful here
> too.
>
> Cheers,
> Mark
>
>
> Thanks,
>
> Bryan
> _______________________________________________
> NumPy-Discussion mailing list

> NumPy-Di...@scipy.org <mailto:NumPy-Di...@scipy.org>
> http://mail.scipy.org/mailman/listinfo/numpy-discussion

Nathaniel Smith

unread,

Mar 15, 2012, 7:02:30 PM3/15/12

to Discussion of Numerical Python

On Wed, Mar 14, 2012 at 1:44 AM, Mark Wiebe <mww...@gmail.com> wrote:
> On Fri, Mar 9, 2012 at 8:55 AM, Bryan Van de Ven <bry...@continuum.io>
> wrote:
>>
>> Hi all,
>>
>> I have started working on a NEP for adding an enumerated type to NumPy.
>> It is on my GitHub:
>>
>> https://github.com/bryevdv/numpy/blob/enum/doc/neps/enum.rst
>>
>> It is still very rough, and incomplete in places. But I would like to
>> get feedback sooner rather than later in order to refine it. In
>> particular there are a few questions inline in the document that I would
>> like input on. Any comments, suggestions, questions, concerns, etc. are
>> very welcome.
>
>
> This looks like a great start to me.
>
> I think the open/closed enum distinction will need to be explored a little
> bit more, because it interacts with dtype immutability/hashability. Do you
> know if there are any examples of Python objects in the wild that
> dynamically convert from not being hashable (i.e. raising an exception if
> used as a dict key) to become hashable?

I haven't run into any...

Thinking about it, I'm not sure I have any use case for this type
being mutable. Maybe someone else can think of one? The first case
that came to mind was in reading a large text file, where you want to
(1) auto-create an enum, (2) use a pre-allocated array, and (3) don't
know ahead of time what the levels are:

a = np.empty(lines_in_file, dtype=np.dtype(Enum()))
for i, line in enumerate(f):
field = line.split()[0]
a.dtype.add_level(field)
a[i] = field
a.dtype.seal()

But really this is just can be done just as easily and efficiently
without a mutable dtype:

a = np.empty(lines_in_file, dtype=np.int32)
intern_table = {}
next_level = 0
for i, line in enumerate(f):
field = line.split()[0]
val = intern_table.setdefault(field, next_level)
if val == next_level:
next_level += 1
a[i] = val
a = a.view(dtype=np.dtype(Enum(map=intern_table)))

I notice that the HDF5 C library has a concept of open versus closed
enums, but I can't tell from the documentation at hand why this is; it
looks like it might just be a limitation of the implementation. (Like,
a workaround for C's lack of a standard mapping type, which makes it
inconvenient to pass in all the mappings in to a single API call.)

> It might be worth adding a section which briefly compares and contrasts the
> proposed functionality with enums in various programming languages. Here are
> two links I found to try and get an idea:
>
> MS on C# enum usage:
> http://msdn.microsoft.com/en-us/library/cc138362.aspx
> Wikipedia on C++ enum class:
> http://en.wikipedia.org/wiki/C%2B%2B11#Strongly_typed_enumerations
>
> For example, the C# enum has a way to enable a "flags" mode, which will
> create successive powers of 2. This may not be a feature NumPy needs, but if
> people are finding it useful in C#, maybe it would be useful here too.

There's also a long, ongoing debate about how to do enums in Python -- e.g.:
http://www.python.org/dev/peps/pep-0354/
http://pypi.python.org/pypi/enum/
http://pypi.python.org/pypi/enum_meta/
http://pypi.python.org/pypi/flufl.enum/
http://pypi.python.org/pypi/lazr.enum/
http://pypi.python.org/pypi/pyutilib.enum/
http://pypi.python.org/pypi/coding/
http://stackoverflow.com/questions/36932/whats-the-best-way-to-implement-an-enum-in-python
I guess Guido likes flufl.enum:
http://mail.python.org/pipermail/python-ideas/2011-July/010909.html

BUT, I'm not sure any of this is relevant at all. "Enums" are a
programming language feature that are, first and foremost, about
injecting names into your code's namespace. What I'm hoping to see is
a dtype for holding categorical data, similar to an R "factor"
http://stat.ethz.ch/R-manual/R-devel/library/base/html/factor.html
https://svn.r-project.org/R/trunk/src/library/base/R/factor.R (NB:
This is GPL code if anyone is paranoid about contamination, but also
the most complete API description available)
or an HDF5 "enum"
http://www.hdfgroup.org/HDF5/doc/H5.user/Datatypes.html#Datatypes_Enum
I believe pandas has some functionality along these lines too, though
I can't find it in the online docs -- hopefully Wes will fill us in.

These are basically objects that act for most purposes like string
arrays, but in which all strings are required to come from a finite,
specified list. This list acts like some metadata attached to the
array; it's order may or may not be significant. And they're
implemented internally as integer arrays.

I'm not sure what it would even mean to treat this kind of data as
"flags", since you can't take the bitwise-or of two strings...

-- Nathaniel

Benjamin Root

unread,

Mar 15, 2012, 8:56:36 PM3/15/12

to Discussion of Numerical Python

I guess my problem is that this isn't _quite_ like an enum that I am familiar with (but not quite unlike it either). Should we call it "factor", to avoid confusion or are there going to be too many that won't know what that is, but would be drawn in by a name of "enum"?

Just a thought.

Ben Root

Stéfan van der Walt

unread,

Mar 15, 2012, 9:02:23 PM3/15/12

to Discussion of Numerical Python

On Thu, Mar 15, 2012 at 4:02 PM, Nathaniel Smith <n...@pobox.com> wrote:
> I'm not sure what it would even mean to treat this kind of data as
> "flags", since you can't take the bitwise-or of two strings...

This makes a more sense outside of ndarrays, where you would do something like:

enum FLAG0 = 1, FLAG1 = 2, FLAG2 = 4
do_something(data, mode=FLAG0 & FLAG2)

The enum is therefore just a handle for its numerical value. While it
may not be that useful in an array, I think Mark was just pointing out
that there may be other similar use cases, such as enumerating from 0
to N-1, or in reverse from N-1 down to 0, or in steps of 2, or in
powers of 2, etc.

Stéfan

Nathaniel Smith

unread,

Mar 16, 2012, 11:07:25 AM3/16/12

to Discussion of Numerical Python

On Mar 16, 2012 1:02 AM, "Stéfan van der Walt" <stefan @sun.ac.za> wrote:

>
> On Thu, Mar 15, 2012 at 4:02 PM, Nathaniel Smith <njs @pobox.com> wrote:
> > I'm not sure what it would even mean to treat this kind of data as
> > "flags", since you can't take the bitwise-or of two strings...
>
> This makes a more sense outside of ndarrays, where you would do something like:
>
> enum FLAG0 = 1, FLAG1 = 2, FLAG2 = 4
> do_something(data, mode=FLAG0 & FLAG2)
>
> The enum is therefore just a handle for its numerical value. While it
> may not be that useful in an array, I think Mark was just pointing out
> that there may be other similar use cases, such as enumerating from 0
> to N-1, or in reverse from N-1 down to 0, or in steps of 2, or in
> powers of 2, etc.

Right, there may be. But are there? That's the question :-)

It looks like R doesn't support anything except 1, ..., N numbering. There's really no reason it would either, since in their design the underlying integer values are almost entirely hidden from users. You could get at them if you wanted, but I bet less than 1% of users are even aware that factors and integers have anything to do with each other. Factors are just documented to be a way to store an array of strings drawn from a limited ordered list. (The ordering is important for things like polynomial coding and treatment versus baseline coding.)

HDF5 supports arbitrary symbol<->integer mappings.

0, ..., N-1 coding makes the common problem of creating an indicator matrix very convenient:
ind = np.zeros((enum_a.length, len(enum_.dtype.levels)), dtype=bool)
ind[:, enum_a.view(dtype=np.int32)] = True

But we can't restrict ourselves to only this coding if we want compatibility with HDF5 or R (because R is 1-based). So I guess supporting arbitrary mappings is worth it - though I doubt this flexibility will be used much. I'm curious if anyone can think of a reason they'd use it besides interoperability.

Cheers,
- Nathaniel

Bryan Van de Ven

unread,

Mar 16, 2012, 12:26:01 PM3/16/12

to Discussion of Numerical Python

Hi all,

I have spent some time thinking about things, and discussing them with
folks nearby. I actually got to wondering whether we really need new
dtypes for this. It seems like enumerated values or factor levels could
be cast as an annotation or metadata that could be attached to any
existing integral dtypes. It spells differently enough that I have put
up an alternate version that reflects this notion. I'd like to see what
folks think of this direction:

https://github.com/bryevdv/numpy/blob/enum/doc/neps/enum_alt.rst

So this would require adding machinery to existing dtypes to behave
properly when there is factor metadata present. Perhaps that is not an
acceptable trade-off, but it seems worth discussing.

I think a very similar approach could be used to add categorical ranges
to any numerical or string types (I think they are called "shingles" in R?)

Please let me know what you think.

Bryan

Nathaniel Smith

unread,

Mar 17, 2012, 1:11:13 PM3/17/12

to Bryan Van de Ven, Discussion of Numerical Python

On Fri, Mar 16, 2012 at 4:26 PM, Bryan Van de Ven <bry...@continuum.io> wrote:
> Hi all,
>
> I have spent some time thinking about things, and discussing them with folks
> nearby. I actually got to wondering whether we really need new dtypes for
> this. It seems like enumerated values or factor levels could be cast as an
> annotation or metadata that could be attached to any existing integral
> dtypes. It spells differently enough that I have put up an alternate version
> that reflects this notion. I'd like to see what folks think of this
> direction:
>
> https://github.com/bryevdv/numpy/blob/enum/doc/neps/enum_alt.rst
>
> So this would require adding machinery to existing dtypes to behave properly
> when there is factor metadata present. Perhaps that is not an acceptable
> trade-off, but it seems worth discussing.

I took a look at this, but I think something was lost in the translation from your head to text :-). Your description here makes it sound like what's different about this proposal is that there's very different underlying mechanics, but the enum_alt file just seems to describe an alternative and more-or-less equivalent user-level API. Unless you told me, I would have assumed that it just created a new dtype, rather than modified existing ones.

What mechanism are you thinking of? Or did I miss something?

> I think a very similar approach could be used to add categorical ranges to
> any numerical or string types (I think they are called "shingles" in R?)

A 'shingle' is a way of mapping (floating point) numbers into categories. However, they generally allow a single number to fall into multiple categories. So for example, you might take these data points:

1 2 3 4 5 6 7 8 9 10 11

And divide them into categories A, B, C like this:

1 2 3 4 5 6 7 8 9 10 11
AAAAAAAAAAAAA
BBBBBBBBBBBBB
CCCCCCCCCCCCCCC

Which is why they're called "shingles" :-)
http://www.floridadisaster.org/hrg/images/roofs/shingle_loose_tab_large.jpg
This can be a very convenient data structure for various sorts of visualizations, but I'm not sure how it would make sense to integrate it into basic numerical types.

R has a more basic function called 'cut' which takes a numerical array plus some specified breakpoints, and returns a factor array. But that's a simple utility function that doesn't need any special features in the underlying representation.

-- Nathaniel

Reply all

Reply to author

Forward