Please Make conversion to dataframe for nested data structure more previsible/documented

1,244 views
Skip to first unread message

mush...@gmail.com

unread,
Nov 22, 2016, 9:45:45 AM11/22/16
to PyData
Hello
I have been working on topics implying a lot of meddling with data structures and Pandas was great to perform this.
That is, once you get how to input your data into it so it output a dataframe with good form.

this is especially the case when you want to handle nested data structures.

For example, if you want to convert a list of list, you can have different results:

l=list([[1,2],[1,2,3],[1,2,3,4]])

p=pd.DataFrame(l)

p
Out[5]:
   0  1    2    3
0  1  2  NaN  NaN
1  1  2  3.0  NaN
2  1  2  3.0  4.0

s=pd.Series(l)

s
Out[7]:
0          [1, 2]
1       [1, 2, 3]
2    [1, 2, 3, 4]
dtype: object

ps=pd.DataFrame(s)

ps
Out[9]:
              0
0        [1, 2]
1     [1, 2, 3]
2  [1, 2, 3, 4]

pc=pd.DataFrame (l, columns=["list"])
>return an Error: AssertionError: 1 columns passed, passed data had 4 columns

You have no way to build ps in a simple manner.

On the other hand, when working with a dict of list
for i,seq in enumerate(l):
    d[i]=seq
   

d
Out[16]: {0: [1, 2], 1: [1, 2, 3], 2: [1, 2, 3, 4]}

p1=pd.DataFrame(d)
>return an Error: arrays must all be same length
 s1=pd.Series(d)

s1
Out[19]:
0          [1, 2]
1       [1, 2, 3]
2    [1, 2, 3, 4]

ps1=pd.DataFrame(s1) #still works fine.

ps1
Out[21]:
              0
0        [1, 2]
1     [1, 2, 3]
2  [1, 2, 3, 4]

So I take it that to build ps from a list of list or a dict of list, you have to go through a Series.
On the other hand, obtaining p1 from a dict is anything but simple;;

p1=pd.DataFrame(d, columns=["columns"])
>return an error: ValueError: arrays must all be same length



To obtain p1 from d, you have to go through the following, which is fairly convoluted:
]: p1=pd.DataFrame([l for l in d.values()], index=d.keys())

p1
Out[25]:
   0  1    2    3
0  1  2  NaN  NaN
1  1  2  3.0  NaN
2  1  2  3.0  4.0

BUT as you can see, it mess with data for some reason, transforming some into floats:
p1.dtypes

Out[26]:

0 int64

1 int64

2 float64

3 float64

dtype: object




Can't this be just made simpler, either by documenting it well (and sorry, I can't help here, I didn't build the conversion engine for this, but I would be glad to help),

or either by adding keywords to control such behavior.

I would expect this to be as well controlable as to_dict() function behavior, which can be fairly well controlled...


I don't mean to be disrespectful about your tremendous and awesome work, I just point a fact that is primordial in my opinion; that a data structure done to handle data from other structures should be easy to handle itself, so to set and for this, it seems to me that keywords would be the best (or maybe, a list of dimensions, with wildcards for specific behavior like : align on the shortest/longest line, split some nested structures, and so on)


In case I would have missed something in documentation, maybe it is because I did not understand some crucial points or possible usage of parameters, so don't hesitate to point it to me, please;)


Goyo

unread,
Nov 24, 2016, 9:37:32 AM11/24/16
to PyData
El martes, 22 de noviembre de 2016, 15:45:45 (UTC+1), mush...@gmail.com escribió:
Hello
I have been working on topics implying a lot of meddling with data structures and Pandas was great to perform this.
That is, once you get how to input your data into it so it output a dataframe with good form.

this is especially the case when you want to handle nested data structures.

I do not think it is that hard. If you use a nested object to create a DataFrame, Pandas thinks that you want several columns. If you want only one column you have to be explicit and going through Series looks like the most obvious way:

pd.DataFrame(pd.Series(l))


             
0
0        [1, 2]
1     [1, 2, 3]
2  [1, 2, 3, 4]

There might be a keyword or a classmethod for this but it would not save you any keystrokes and would be less readable:

pd.DataFrame(l, input1d=True)
pd
.DataFrame.from_input1d(l)

For dictionaries, keys are interpreted as column labels and values as column data by default. But you can change it:

pd.DataFrame.from_dict(d, orient='index')


   
0  1    2    3
0  1  2  NaN  NaN
1  1  2  3.0  NaN
2  1  2  3.0  4.0

Columns 2 and 3 are of float type because there is no NaN for integer types. This is a long standing concern but it seems difficult to address.

Best regards
Goyo

mush...@gmail.com

unread,
Nov 25, 2016, 4:32:35 AM11/25/16
to PyData
Thanks for your answer. Actually I am less concerned about keystrokes than about processing time (having to build a Series before building the dataframe could be costly, for massive amount of data and multiple such columns). Also, about the fact that for values with same object type , behavior is not constant: for dict, it is acquired that index will be index. For lists of lists, it is auto generated. Dict values could be considered as a list. And if it is a list of list, behavior is not the same: for dict it crashes, for list it works like a charm. Only pd.Series will manage the data the same way.
Thanks for the info about NaN. Does this mean that even if forcing datatype with dtype=int, it will give NaN?
Thanks

Goyo

unread,
Nov 25, 2016, 7:11:30 AM11/25/16
to PyData
El viernes, 25 de noviembre de 2016, 10:32:35 (UTC+1), mush...@gmail.com escribió:
Thanks for your answer. Actually I am less concerned about keystrokes than about processing time (having to build a Series before building the dataframe could be costly, for massive amount of data and multiple such columns).

You don't use lists to store massive amounts of data, do you? If you do then the main CPU overhead won't be the instantiation of a temporary Series. Do you have a realistic use case where this is an actual issue?
 
Also, about the fact that for values with same object type , behavior is not constant: for dict, it is acquired that index will be index. For lists of lists, it is auto generated. Dict values could be considered as a list. And if it is a list of list, behavior is not the same: for dict it crashes, for list it works like a charm. Only pd.Series will manage the data the same way.

Sorry, I do not understand this. Dictionaries and lists are different things, dictionaries have keys and they are interpreted as index ot column labels in a well documented way IIRC. If you mean this should work:

pd.DataFrame(d.values(), index=d.keys())

take into account that dict_values is not a sequence type (it does not have __getitem__) so no, it is not list-like. You can use list(d.values()) or DataFrame.from_dict(d, orient='index') instead.
 
Thanks for the info about NaN. Does this mean that even if forcing datatype with dtype=int, it will give NaN?

Dit you try? Attempting to cast nan to int will likely raise ValueError.

Regards
Goyo

mush...@gmail.com

unread,
Dec 30, 2016, 9:02:17 AM12/30/16
to PyData
Hello, sorry I took so much time to come back, was quite busy.


You don't use lists to store massive amounts of data, do you? If you do then the main CPU overhead won't be the instantiation of a temporary Series. Do you have a realistic use case where this is an actual issue?
 
For the use case, well, whenever you use json embedded data, and you have a dictionary with  lists of data that you want to convert to panda df, that could be a problem.
It would be even more of a problem if you don't have your data in json form since pandas can manage this directly, by have a dictionary of data list that you want like this.
If you have sufficient number of lists, and big enough list (let say, a data set for sequence study from OEIS).
In the case of list of lists, I don't have a use case, but 1) you can't presume it doesn't exist 2) this doesn't mean that Pandas shouldn't have a way to manage this properly, since this is so simple to feed it into.
Actually, while having the algorithm to decide whether or not it splits a sequence type recorded as a value, I think that it would be better to have a parameters that say split or not, and which fits columns accordingly in the resulting data frame. This way it would be simple and more flexible and anticipable;)


Sorry, I do not understand this. Dictionaries and lists are different things, dictionaries have keys and they are interpreted as index ot column labels in a well documented way IIRC. If you mean this should work:

pd.DataFrame(d.values(), index=d.keys())

take into account that dict_values is not a sequence type (it does not have __getitem__) so no, it is not list-like. You can use list(d.values()) or DataFrame.from_dict(d, orient='index') instead.

Well, Even if you don't have __getitem, dict_values are iterator, so they share some common properties.

mush...@gmail.com

unread,
Dec 30, 2016, 9:03:10 AM12/30/16
to PyData
Thanks, best regards.
Reply all
Reply to author
Forward
0 new messages