I have started working on a NEP for adding an enumerated type to NumPy.
It is on my GitHub:
https://github.com/bryevdv/numpy/blob/enum/doc/neps/enum.rst
It is still very rough, and incomplete in places. But I would like to
get feedback sooner rather than later in order to refine it. In
particular there are a few questions inline in the document that I would
like input on. Any comments, suggestions, questions, concerns, etc. are
very welcome.
Thanks,
Bryan
_______________________________________________
NumPy-Discussion mailing list
NumPy-Di...@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion
Hi Bryan,
That's excellent, an enumerated type would be very useful. From a
quick read, though, what I'd really like to see is some discussion of
the goals here -- like some example situations where you see these
being used, and the problems they're intended to solve? Because for
example, C "enums" are designed to solve a completely different
problem than something like an R "factor", and off the top of my head
I don't know how well either maps onto hdf5 enumerated types. Another
example is that I can't tell from the document what the motivation for
having both "open" and "closed" enums is?
(Also, general question: is there some technical advantage to being
able to represent more complicated dtypes as strings, that justifies
making up these mini-languages like "enum:uint16[A, B, C, D, E:128]"?
It can't be necessary for pickling or anything, right, since AFAICT
there's already no string representation for structured dtypes? It
just seems like it'd be simpler and more elegant to use some Python
syntax like 'dtype(Enum(["a", "b", "c"], storage=np.uint16))' instead
of writing a tiny one-off parser and wedging what's really a data
structure into a string, but I may be missing something.)
-- Nathaniel
On Sat, Mar 10, 2012 at 3:25 AM, Bryan Van de Ven <bry...@continuum.io> wrote:
> Hi all,
>
> I have started working on a NEP for adding an enumerated type to NumPy.
> It is on my GitHub:
>
> https://github.com/bryevdv/numpy/blob/enum/doc/neps/enum.rst
>
> It is still very rough, and incomplete in places. But I would like to
> get feedback sooner rather than later in order to refine it. In
> particular there are a few questions inline in the document that I would
> like input on. Any comments, suggestions, questions, concerns, etc. are
> very welcome.
"t = np.dtype('enum', map=(n,v))"
^ Is this supposed to be indicating 'this is an enum with values
ranging between n and v'? It could be a bit more clear.
Is it possible to partially define an enum? That is, give the maximum
and minimum values, and only some of the enumeration value:name
mappings?
For example, an enum where 0 means 'n/a', +n means 'Type A Object
#(n-1)' and -n means 'Type B Object #(abs(n) - 1)'. I just want to map
the non-scalar values, while having a way to avoid treating valid
scalar values (eg +64) as out-of-range.
Example of what I mean:
"t = np.dtype('enum[N_A:0]', range = (-127, 127))"
(defined values being printed as a string, undefined being printed as a number.)
David
I'll have to think about this (a little brain dump here). I have many
use cases in pandas where this would be useful which are basically
direct translations of R's factor data type. Note that R always
coerces the levels (the unique values) AFAICT to string type. However,
mapping back to a well-dtyped array is important, too. So the
temptation might be to do something like this:
ndarray: dtype storage type (uint32 or something)
mapping : khash with type PyObject* -> uint32
Now, one problem with this is that you want the mapping + dtype to be
invertible (otherwise you're left doing some type inference). The way
that I implement the mapping is to restrict the labeling to be from 0
to N - 1 which makes things easier. If we decide that having an
explicit value mapping
The nice thing about this is that the same set of core algorithms can
be used to fix numpy.unique. For example you would like to be able to
do:
enum_arr = np.enum(arr)
(this seems like a reasonable API to me) and that is a direct
equivalent of R's factor function. You need to be able to pass an
explicit ordering when calling the enum/factor function. If not
specified, you should have an option to either sort or not-- for
example suppose you convert an array of 1 million integers to enum but
you don't particularly care about the uniques (which could be very
large, up to the size of the array) being ordered (no need to pay N
log N for large N).
One nice thing about khash is that it can be serialized fairly easily.
Have you looked much at how I use enum-like ideas in pandas? It would
be great if I could offload some of this data algorithmic work to
NumPy.
We will want the enum data type to integrate with text file readers--
if you "factorize as you go" you can drastically reduce the memory
usage of a structured array (or pandas DataFrame) columns with
long-ish strings and relatively few unique values.
- Wes
On Mon, Mar 12, 2012 at 9:33 AM, Wes McKinney <wesm...@gmail.com> wrote:
>
> Now, one problem with this is that you want the mapping + dtype to be
> invertible (otherwise you're left doing some type inference). The way
> that I implement the mapping is to restrict the labeling to be from 0
> to N - 1 which makes things easier.
> If we decide that having an explicit value mapping
(...?)
You might want to finish whatever thought that was :)
Hi all,
I have started working on a NEP for adding an enumerated type to NumPy.
It is on my GitHub:
https://github.com/bryevdv/numpy/blob/enum/doc/neps/enum.rst
It is still very rough, and incomplete in places. But I would like to
get feedback sooner rather than later in order to refine it. In
particular there are a few questions inline in the document that I would
like input on. Any comments, suggestions, questions, concerns, etc. are
very welcome.
In Sage, the matrix objects are mutable when constructed, and you can
set_immutable to make them immutable.
The way I look at that though is that it is part of the construction
phase of the object, you'd typically construct, fill it in, then
set_immutable (to finish construction), then use it.
set/frozenset is an example of the opposite, and a design I personally
like better (i.e., "frozen_dtype" :-)).
Dag
>
> It might be worth adding a section which briefly compares and contrasts
> the proposed functionality with enums in various programming languages.
> Here are two links I found to try and get an idea:
>
> MS on C# enum usage:
> http://msdn.microsoft.com/en-us/library/cc138362.aspx
> Wikipedia on C++ enum class:
> http://en.wikipedia.org/wiki/C%2B%2B11#Strongly_typed_enumerations
>
> For example, the C# enum has a way to enable a "flags" mode, which will
> create successive powers of 2. This may not be a feature NumPy needs,
> but if people are finding it useful in C#, maybe it would be useful here
> too.
>
> Cheers,
> Mark
>
>
> Thanks,
>
> Bryan
> _______________________________________________
> NumPy-Discussion mailing list
> NumPy-Di...@scipy.org <mailto:NumPy-Di...@scipy.org>
> http://mail.scipy.org/mailman/listinfo/numpy-discussion
I haven't run into any...
Thinking about it, I'm not sure I have any use case for this type
being mutable. Maybe someone else can think of one? The first case
that came to mind was in reading a large text file, where you want to
(1) auto-create an enum, (2) use a pre-allocated array, and (3) don't
know ahead of time what the levels are:
a = np.empty(lines_in_file, dtype=np.dtype(Enum()))
for i, line in enumerate(f):
field = line.split()[0]
a.dtype.add_level(field)
a[i] = field
a.dtype.seal()
But really this is just can be done just as easily and efficiently
without a mutable dtype:
a = np.empty(lines_in_file, dtype=np.int32)
intern_table = {}
next_level = 0
for i, line in enumerate(f):
field = line.split()[0]
val = intern_table.setdefault(field, next_level)
if val == next_level:
next_level += 1
a[i] = val
a = a.view(dtype=np.dtype(Enum(map=intern_table)))
I notice that the HDF5 C library has a concept of open versus closed
enums, but I can't tell from the documentation at hand why this is; it
looks like it might just be a limitation of the implementation. (Like,
a workaround for C's lack of a standard mapping type, which makes it
inconvenient to pass in all the mappings in to a single API call.)
> It might be worth adding a section which briefly compares and contrasts the
> proposed functionality with enums in various programming languages. Here are
> two links I found to try and get an idea:
>
> MS on C# enum usage:
> http://msdn.microsoft.com/en-us/library/cc138362.aspx
> Wikipedia on C++ enum class:
> http://en.wikipedia.org/wiki/C%2B%2B11#Strongly_typed_enumerations
>
> For example, the C# enum has a way to enable a "flags" mode, which will
> create successive powers of 2. This may not be a feature NumPy needs, but if
> people are finding it useful in C#, maybe it would be useful here too.
There's also a long, ongoing debate about how to do enums in Python -- e.g.:
http://www.python.org/dev/peps/pep-0354/
http://pypi.python.org/pypi/enum/
http://pypi.python.org/pypi/enum_meta/
http://pypi.python.org/pypi/flufl.enum/
http://pypi.python.org/pypi/lazr.enum/
http://pypi.python.org/pypi/pyutilib.enum/
http://pypi.python.org/pypi/coding/
http://stackoverflow.com/questions/36932/whats-the-best-way-to-implement-an-enum-in-python
I guess Guido likes flufl.enum:
http://mail.python.org/pipermail/python-ideas/2011-July/010909.html
BUT, I'm not sure any of this is relevant at all. "Enums" are a
programming language feature that are, first and foremost, about
injecting names into your code's namespace. What I'm hoping to see is
a dtype for holding categorical data, similar to an R "factor"
http://stat.ethz.ch/R-manual/R-devel/library/base/html/factor.html
https://svn.r-project.org/R/trunk/src/library/base/R/factor.R (NB:
This is GPL code if anyone is paranoid about contamination, but also
the most complete API description available)
or an HDF5 "enum"
http://www.hdfgroup.org/HDF5/doc/H5.user/Datatypes.html#Datatypes_Enum
I believe pandas has some functionality along these lines too, though
I can't find it in the online docs -- hopefully Wes will fill us in.
These are basically objects that act for most purposes like string
arrays, but in which all strings are required to come from a finite,
specified list. This list acts like some metadata attached to the
array; it's order may or may not be significant. And they're
implemented internally as integer arrays.
I'm not sure what it would even mean to treat this kind of data as
"flags", since you can't take the bitwise-or of two strings...
-- Nathaniel
This makes a more sense outside of ndarrays, where you would do something like:
enum FLAG0 = 1, FLAG1 = 2, FLAG2 = 4
do_something(data, mode=FLAG0 & FLAG2)
The enum is therefore just a handle for its numerical value. While it
may not be that useful in an array, I think Mark was just pointing out
that there may be other similar use cases, such as enumerating from 0
to N-1, or in reverse from N-1 down to 0, or in steps of 2, or in
powers of 2, etc.
Stéfan
On Mar 16, 2012 1:02 AM, "Stéfan van der Walt" <stefan@sun.ac.za> wrote:
>
> On Thu, Mar 15, 2012 at 4:02 PM, Nathaniel Smith <njs@pobox.com> wrote:
> > I'm not sure what it would even mean to treat this kind of data as
> > "flags", since you can't take the bitwise-or of two strings...
>
> This makes a more sense outside of ndarrays, where you would do something like:
>
> enum FLAG0 = 1, FLAG1 = 2, FLAG2 = 4
> do_something(data, mode=FLAG0 & FLAG2)
>
> The enum is therefore just a handle for its numerical value. While it
> may not be that useful in an array, I think Mark was just pointing out
> that there may be other similar use cases, such as enumerating from 0
> to N-1, or in reverse from N-1 down to 0, or in steps of 2, or in
> powers of 2, etc.
Right, there may be. But are there? That's the question :-)
It looks like R doesn't support anything except 1, ..., N numbering. There's really no reason it would either, since in their design the underlying integer values are almost entirely hidden from users. You could get at them if you wanted, but I bet less than 1% of users are even aware that factors and integers have anything to do with each other. Factors are just documented to be a way to store an array of strings drawn from a limited ordered list. (The ordering is important for things like polynomial coding and treatment versus baseline coding.)
HDF5 supports arbitrary symbol<->integer mappings.
0, ..., N-1 coding makes the common problem of creating an indicator matrix very convenient:
ind = np.zeros((enum_a.length, len(enum_.dtype.levels)), dtype=bool)
ind[:, enum_a.view(dtype=np.int32)] = True
But we can't restrict ourselves to only this coding if we want compatibility with HDF5 or R (because R is 1-based). So I guess supporting arbitrary mappings is worth it - though I doubt this flexibility will be used much. I'm curious if anyone can think of a reason they'd use it besides interoperability.
Cheers,
- Nathaniel
I have spent some time thinking about things, and discussing them with
folks nearby. I actually got to wondering whether we really need new
dtypes for this. It seems like enumerated values or factor levels could
be cast as an annotation or metadata that could be attached to any
existing integral dtypes. It spells differently enough that I have put
up an alternate version that reflects this notion. I'd like to see what
folks think of this direction:
https://github.com/bryevdv/numpy/blob/enum/doc/neps/enum_alt.rst
So this would require adding machinery to existing dtypes to behave
properly when there is factor metadata present. Perhaps that is not an
acceptable trade-off, but it seems worth discussing.
I think a very similar approach could be used to add categorical ranges
to any numerical or string types (I think they are called "shingles" in R?)
Please let me know what you think.
Bryan