Converting from DataFrames to dicts

1,018 views
Skip to first unread message

Glen Hertz

unread,
Feb 10, 2013, 4:08:22 AM2/10/13
to julia...@googlegroups.com
Hi,

I'm iterating over each row of a dataframe and I can't find a method to get the values of a row (as a vector or a dict of colname => row_value).  Is there a method like colnames but for getting the values of a row (or set of rows)? 

Thanks,

Glen

Glen Hertz

unread,
Feb 10, 2013, 5:58:57 AM2/10/13
to julia-stats
It is quite easy to do a dict comprehension on a given row index:

name2val_dict = {cn => df[row,cn] for cn in colnames(df)}

Glen


--
You received this message because you are subscribed to the Google Groups "julia-stats" group.
To unsubscribe from this group and stop receiving emails from it, send an email to julia-stats...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.
 
 

John Myles White

unread,
Feb 10, 2013, 2:26:10 PM2/10/13
to julia...@googlegroups.com
We should really add this ability. It's related to the generic issue of converting DataFrames to Dicts that was recently raised. I'll mock up something, then let others take the lead at finalizing it.

I would prefer that rows become matrices rather than vectors to maintain the correspondence between DataFrame indexing and Array indexing.

 -- John

John Myles White

unread,
Feb 10, 2013, 2:42:19 PM2/10/13
to julia...@googlegroups.com
I've added a draft function for doing this with the following behaviors:

using DataFrames
adf = DataFrame(quote A = 1:4; B = ["A", "B", "C", "D"] end)
dict(adf)
# => ["B"=>["A", "B", "C", "D"],"A"=>[1, 2, 3, 4]]

dict(adf[1, :])
# => ["B"=>["A"],"A"=>[1]]

dict(adf[1, :], true)
# => ["B"=>"A","A"=>1]

I'll let others debate what else is needed.

 -- John

On Feb 10, 2013, at 12:58 AM, Glen Hertz <glen....@gmail.com> wrote:

Tom Short

unread,
Feb 10, 2013, 3:31:36 PM2/10/13
to julia...@googlegroups.com
Overall, I like it. Now that `dict` is no longer in Base, do we want
to use `Dict` instead?

John Myles White

unread,
Feb 10, 2013, 5:14:54 PM2/10/13
to julia...@googlegroups.com
I would prefer dict. Eventually, I think we'll want all of the following:

vec()
matrix()
dict()

-- John

Tom Short

unread,
Feb 10, 2013, 5:32:39 PM2/10/13
to julia...@googlegroups.com

The uppercase/lowercase for constructors/converters still seems muddled in Julia, but I thought there was more of a move to uppercase.

Because dict isn't in Base, we run into another problem--if another package defines dict then there's a conflict, and the user may have to preface the function with the appropriate module name.

I like the idea of your other convertors.  I can take a crack at that. 

John Myles White

unread,
Feb 10, 2013, 5:46:52 PM2/10/13
to julia...@googlegroups.com
The concern about multiple packages trying to use common names is a serious problem. For me, it is a strong argument for Base providing a much larger set of primitives that have almost no definition, but exist solely so that all packages can override those methods for their types without having to import other packages.

The other convertors already exist: they just have different names.

 -- John

Tom Short

unread,
Feb 10, 2013, 6:13:04 PM2/10/13
to julia...@googlegroups.com

Having Base define a lot of common names is a good idea.

Patrick O'Leary

unread,
Feb 10, 2013, 6:27:40 PM2/10/13
to julia...@googlegroups.com
Except keeping short, common names out of Base so they can be used flexibly is also considered a good idea. There appears to be something deeply wrong with Julia's module system--the optional module glue question points at it as well--but I lack the insight to articulate quite what it is, much less how one would fix it.

Stefan Karpinski

unread,
Feb 10, 2013, 6:50:50 PM2/10/13
to julia-stats
On Sun, Feb 10, 2013 at 1:27 PM, Patrick O'Leary <patrick...@gmail.com> wrote:
Except keeping short, common names out of Base so they can be used flexibly is also considered a good idea. There appears to be something deeply wrong with Julia's module system--the optional module glue question points at it as well--but I lack the insight to articulate quite what it is, much less how one would fix it.

Can you elaborate more on what you see as the symptoms of wrongness?

Tom Short

unread,
Feb 10, 2013, 7:09:56 PM2/10/13
to julia...@googlegroups.com
I sort-of agree with Patrick. Before we had modules, methods from
different packages rarely interfered with each other because of
multiple dispatch. Now with modules, it's great that a module can have
methods that are local, and you don't have to worry about interference
from methods from other modules. But, that comes at the expense of a
new type of interference: overlap of common names. DataFrames exports
many methods that use common names, like `cut`, `by`, `in`, `index`,
`matrix`, `range`, and `with`. If another package comes along and
defines and exports their own methods of the same name, the user can
no longer just type the method name and expect it to work. They'll
have to preface the function with the module name. See issue #1737 for
some discussion of this.

Having Base export a large number of common names is a kludgy way
around this. It'll work for many scenarios, but it'll also make
importall more dangerous in that a function you thought was local to
your module is not longer local. I like the idea of some sort of
`shared` keyword that's an alternative to `export` that puts the
method definition in a common pool. As Stefan noted in issue #1737,
that wasn't likely to happen. I didn't understand the reason behind
it, though.

Patrick O'Leary

unread,
Feb 10, 2013, 7:28:32 PM2/10/13
to julia...@googlegroups.com

There's something that doesn't quite line up between the way we've set up namespaces and the nature of multiple dispatch. Funamentally, the multiple dispatch model wants every method name to mean something semantically unique; this demands a flat namespace. Having to leave a bunch of common names in Base set aside to accomplish this (who decides what's "common"?) doesn't sound right to me.

Another odd corner I keep coming across is macro access to unexported names. I write a macro in a module:

module Mod
export @mac
macro mac(xpr)
    :(unexported_func($xpr))
end

unexported_func(xpr) = ...
end

This is (of course!) an error, since the caller of the macro can't see unexported_func. But lexically, the call is sufficiently qualified. (I'm aware the workaround is to use a fully-qualified call.)

Optional dependencies/module glue is another one. Julia is dynamic, but you can't late-bind types; types referred to by signatures must be defined at the time of method definition, even though the method isn't compiled until it is used. (Python's try/import/except pattern might be useful here, though.)

As I said, I have this odd gut feeling that something is wrong, but it's hard to explain why. So perhaps that feeling will just pass.

John Myles White

unread,
Feb 10, 2013, 7:31:24 PM2/10/13
to julia...@googlegroups.com
I think we're finally starting to circle in on the core problem with how multiple dispatch (which allows us to re-use a small set of function names by distinguishing functions by type signature) and modules (which allows us to re-use a small set of function names by restricting their scope) interact. I've been internally debating this for some time because it's really clear that the Distributions module is going to bear the brunt of these problems once we start to add more support for statistical modeling to Julia.

To give a sense of my concerns, the following common names are something I'd like to see supported by every single statistical model defined in Julia:

* simulate()
* predict()
* cost()

Essentially, I'd like to see multiple dispatch exploited to define a consistent interface (in the formal OO sense) to all statistical models. This is not a trivial issue because that interface would need to be defined in some module, so that other modules could import the function names and override them. Because those names are not defined in Base, they would have to be defined by something like a Stats module. But that's not a great solution to the larger problem we're starting to run up against.

One potential solution: a new method for a generic function should not require importing that name. Imports should only be required if a new definition conflicts with the old one. This is the same issue Tom and I have harped on in the past because it's so common in DataFrames: the only time when importing from Base is interesting is when you're not supplementing an existing generic function, but are redefining it.

-- John

John Myles White

unread,
Feb 10, 2013, 7:51:00 PM2/10/13
to julia...@googlegroups.com
It seems like Patrick and I had almost identical thoughts at the same time. On further reflection, I think the following two cases get at the core of my problems with the module system:

=== Case 1 ===

I want to extend show to work on my new type, aptly named MyNewType. I therefore import show from Base and then define show(MyNewType), which is handled cleanly by multiple dispatch. There is no conflict and nothing I have done overrides the behavior of Base, it simply extends it.

This type of function definition isn't problematic because all modules effectively depend upon Base. There's no cost to importing show from Base. Personally, I think this sort of extension shouldn't require any explicit importing.

=== Case 2 ===

I define a function called new_show(Type1) in Module1 and then define another function new_show(Type2) in Module2. This is where things really break. To make things work, we need to make one of the Modules have higher "priority" than the other, so that Module2 will extend Module1 or Module1 will extend Module2. Otherwise, Module1 will delete Module2 when they are used simultaneously (or vice versa). This deletion happens even though Module1 does not override Module2 and Module2 does not override Module1.

In other words, our current module system implicitly imposes a hierarchy on all modules. This is going to be terrible for parallel development of modules as the number of modules grows.

 -- John

Jeff Bezanson

unread,
Feb 10, 2013, 9:03:01 PM2/10/13
to julia...@googlegroups.com
>
> module Mod
> export @mac
> macro mac(xpr)
> :(unexported_func($xpr))
> end
>
> unexported_func(xpr) = ...
> end
>
> This is (of course!) an error, since the caller of the macro can't see

Not true; this works. The macroexpander makes sure non-esc()'d stuff
is looked up in the macro definition environment:

julia> module Mod
export @mac
macro mac(xpr)
:(unexported_func($xpr))
end

unexported_func(xpr) =1+xpr
end

julia> using Mod

julia> @mac 3
4

julia> unexported_func
ERROR: unexported_func not defined

Patrick O'Leary

unread,
Feb 10, 2013, 9:10:13 PM2/10/13
to julia...@googlegroups.com
Since when? I just went through what I thought was equivalent to this with an up-to-date as of yesterday version, and ran into that problem. Interesting. I'm not sure what I did differently.

Patrick O'Leary

unread,
Feb 10, 2013, 9:14:21 PM2/10/13
to julia...@googlegroups.com
Okay, found it. It came up when working interactively--I was separating expansion via a function and eval, rather than working through the macroexpander. Withdrawn.

Jeff Bezanson

unread,
Feb 10, 2013, 9:27:13 PM2/10/13
to julia...@googlegroups.com
This is a good insight; there do seem to be two kinds of names. Some
are so generic and commonly needed that there is effectively just one
(abstract) meaning. Operators, as well as things like show(), are the
examples. Everybody is happy to share the same +. Although in theory
you could define your own +(Int,Int), nothing like this has happened
so far and one wouldn't expect it to.

We put a bandaid on this by importing all Base.Operators by default. I
hesitated for a long time to make that change, even though it is very
easy to do, because I couldn't (and still can't) shake the feeling
that this is, well, just a bandaid.

Tom's "shared name pool" concept is interesting; basically a way to
have common names as in Base without them actually being in Base. That
does indeed seem to solve some of the problems here.
Reply all
Reply to author
Forward
0 new messages