Slicing DataFrames resets indexes

319 views
Skip to first unread message

Robert Smith

unread,
Aug 23, 2015, 4:32:26 PM8/23/15
to julia-users
Very silly question. In Julia 0.3, I noticed that after slicing some DataFrame, the resulting object doesn't keep its original indexes. For example:

julia> using DataFrames

julia> df = DataFrame(x=1:100)

julia> df[20:50,:]
31x1 DataFrame
| Row | x  |
|-----|----|
| 1   | 20 |
| 2   | 21 |
| 3   | 22 |
| 4   | 23 |
| 5   | 24 |
| 6   | 25 |
| 7   | 26 |
| 8   | 27 |
| 23  | 42 |
| 24  | 43 |
| 25  | 44 |
| 26  | 45 |
| 27  | 46 |
| 28  | 47 |
| 29  | 48 |
| 30  | 49 |
| 31  | 50 |

In the previous line, the DataFrame's indexes start with 1 instead of 20. Is there a way to keep the same index as the original DataFrame (this is the default behavior in Pandas and R).

Thanks.

Tom Short

unread,
Aug 23, 2015, 9:57:21 PM8/23/15
to julia...@googlegroups.com

DataFrames don't have real row indexes. If you want a row identifier, you can add it as an extra column.

Robert Smith

unread,
Aug 24, 2015, 12:23:59 AM8/24/15
to julia-users
Unfortunately, when it comes to displaying the columns in a front-end, there are actually two real columns. For example, this is the output using IJulia:

In [18]: kc.get_iopub_msg(timeout=1)
Out[18]:
{'buffers': [],
 'content': {u'data': {u'text/html': u'<table class="data-frame"><tr><th></th><th>x</th></tr><tr><th>1</th><td>1</td></tr><tr><th>2</th><td>2</td></tr><tr><th>3</th><td>3</td></tr><tr><th>4</th><td>4</td></tr><tr><th>5</th><td>5</td></tr><tr><th>6</th><td>6</td></tr><tr><th>7</th><td>7</td></tr><tr><th>8</th><td>8</td></tr><tr><th>9</th><td>9</td></tr><tr><th>10</th><td>10</td></tr><tr><th>11</th><td>11</td></tr><tr><th>12</th><td>12</td></tr><tr><th>13</th><td>13</td></tr><tr><th>14</th><td>14</td></tr><tr><th>15</th><td>15</td></tr><tr><th>16</th><td>16</td></tr><tr><th>17</th><td>17</td></tr><tr><th>18</th><td>18</td></tr><tr><th>19</th><td>19</td></tr><tr><th>20</th><td>20</td></tr><tr><th>21</th><td>21</td></tr><tr><th>22</th><td>22</td></tr><tr><th>23</th><td>23</td></tr><tr><th>24</th><td>24</td></tr><tr><th>25</th><td>25</td></tr><tr><th>26</th><td>26</td></tr><tr><th>27</th><td>27</td></tr><tr><th>28</th><td>28</td></tr><tr><th>29</th><td>29</td></tr><tr><th>30</th><td>30</td></tr><tr><th>&vellip;</th><td>&vellip;</td></tr></table>',
   u'text/plain': u'100x1 DataFrame\n| Row | x   |\n|-----|-----|\n| 1   | 1   |\n| 2   | 2   |\n| 3   | 3   |\n| 4   | 4   |\n| 5   | 5   |\n| 6   | 6   |\n| 7   | 7   |\n| 8   | 8   |\n| 9   | 9   |\n| 10  | 10  |\n| 11  | 11  |\n\u22ee\n| 89  | 89  |\n| 90  | 90  |\n| 91  | 91  |\n| 92  | 92  |\n| 93  | 93  |\n| 94  | 94  |\n| 95  | 95  |\n| 96  | 96  |\n| 97  | 97  |\n| 98  | 98  |\n| 99  | 99  |\n| 100 | 100 |'},
  u'execution_count': 2,
  u'metadata': {}},
 'header': {u'msg_id': u'12813207-38a2-4829-8b46-54dad551de04',
  u'msg_type': 'execute_result',
  u'session': u'8a7e3638-2b6a-4dd4-b834-6359b2a0a7dd',
  u'username': u'user',
  'version': '5.0'},
 'metadata': {},
 'msg_id': u'12813207-38a2-4829-8b46-54dad551de04',
 'msg_type': 'execute_result',
 'parent_header': {u'date': datetime.datetime(2015, 8, 23, 23, 5, 18, 16122),
  u'msg_id': u'1539d729-069b-4f27-90a2-3a3b7245e563',
  u'msg_type': u'execute_request',
  u'session': u'8a7e3638-2b6a-4dd4-b834-6359b2a0a7dd',
  u'username': u'user'}}

So even If I include an additional index, I'm going to get three columns because I can't get rid of the first one. Is there some way to prevent printing that column?

Milan Bouchet-Valat

unread,
Aug 24, 2015, 4:44:11 AM8/24/15
to julia...@googlegroups.com
Maybe it should not be printed by default? Showing line numbers can
indeed give the false impression that they are stable row identifiers.
Better leave the first column for an actual key added by the user.

An intermediate solution would be to show row numbers by default, but
allow the user to mark one of the column as being the key. If a key is
present, it would replace the row numbers in the display.


Regards

Robert Smith

unread,
Aug 24, 2015, 11:46:07 AM8/24/15
to julia-users
Well, the printed index is useful but leads to some confusion because it doesn't behave as a real index. It would be much better to have a real index. I believe there will be real indexes to increase speed, but Julia developers don't want to depend on them. Is that a correct assessment right now? It has been a while since those discussions took place.

Maybe another alternative would be to have the printed index as default and let the user remove them without requiring a key.

Sisyphuss

unread,
Aug 24, 2015, 11:59:57 AM8/24/15
to julia-users
In my opinion, when you did `df[20:50,:]`, you constructed a new DataFrame, which had no longer anything to do with the original `df`. So you cannot expect it to know the original position in `df`. And when you print it, the row number (instead of index) is generated on the fly, just let you know which value is at which line of the output.

I think what needed here is just a formatted `print()` for DataFrame, which can toggle the row number index.

Robert Smith

unread,
Aug 24, 2015, 12:28:30 PM8/24/15
to julia-users
In my opinion, when you did `df[20:50,:]`, you constructed a new DataFrame, which had no longer anything to do with the original `df`. So you cannot expect it to know the original position in `df`. And when you print it, the row number (instead of index) is generated on the fly, just let you know which value is at which line of the output.

I think what needed here is just a formatted `print()` for DataFrame, which can toggle the row number index.

When I do `df[20:50,:]`, the result is a subset of the original DataFrame `df`, so I should expect to keep track of the index in `df`. It doesn't take much to see why that is useful (an index often refers to a person, geographical entity, discrete period of time, etc).

Zheng Wendell

unread,
Aug 24, 2015, 12:33:00 PM8/24/15
to julia...@googlegroups.com
If the index is so important to you, you should have included an index (ID number, city name, date, etc) for it, instead of relying on the "row number" output on the fly.

Matt Bauman

unread,
Aug 24, 2015, 12:45:45 PM8/24/15
to julia-users
If the row index is intrinsically connected to and meaningful in relationship to the other columns, you should add it as a separate column.  All arrays in Julia start indexing at 1, and data frames displays its "row" indices consistent with how it would index.  Indexing arrays with Real numbers any differently is asking for trouble.

I'm definitely sympathetic to your request, and including meaningful axis information as an intrinsic part of an array is something I've thought a lot about.  But if you'd like to work with data frames, these "axes" must be explicitly included in the table as columns (in the wide format).  I've played around with allowing indexing with the values of the axis in a different package (AxisArrays), but it require homogenous arrays and is definitely not ready for general use yet.

Matt

Sisyphuss

unread,
Aug 24, 2015, 1:03:02 PM8/24/15
to julia-users
By the way, `df[20:50,:]` is not a subset of `df`, but a copy of a subset of `df`.
Since `df` doesn't have any column whose name is "row", how could you expect the "row" to be an index.

Robert Smith

unread,
Aug 24, 2015, 1:06:59 PM8/24/15
to julia-users
Here we have a couple of different issues. The first one is that printing a fake index is a bit confusing because we can't do anything with that index and doesn't behave as an index. In my particular use case, I want to do something with the HTML representation of a DataFrame but the fake index becomes a real index in the front-end. Obviously, the best option is to have a real index. Assuming that is not possible, having a toggle to say that I don't want to print the index, would also be an acceptable solution. 

The other issue that came up was: What do you do when need a real index? The answer to that is straightforward: you should create a real index. Keep in mind that even after you have created a real index, you still have the fake one, and that doesn't look great. However, I'm more interested in the first issue.

Robert Smith

unread,
Aug 24, 2015, 1:14:24 PM8/24/15
to julia-users
By the way, `df[20:50,:]` is not a subset of `df`, but a copy of a subset of `df`.
Since `df` doesn't have any column whose name is "row", how could you expect the "row" to be an index.

Because when you get the DataFrame back, there is a column called "Row" that looks like an index. As a user, if it looks like an index, you will try to use it accordingly. And that is indeed the way it works in R and pandas. You have to separate the way things work internally to the way the user will perceive certain features.

Cedric St-Jean

unread,
Aug 24, 2015, 9:26:45 PM8/24/15
to julia-users
There was some discussion about indexes on Github, but it didn't really get anywhere. I'm also not comfortable with that decision to not have indexes. It makes the dataframes asymmetric, whereas Pandas' and R's are "matrices with named axes".

Cédric

Robert Smith

unread,
Aug 24, 2015, 9:52:17 PM8/24/15
to julia-users
There was some discussion about indexes on Github, but it didn't really get anywhere. I'm also not comfortable with that decision to not have indexes. It makes the dataframes asymmetric, whereas Pandas' and R's are "matrices with named axes".

Thank you. I'm not sure if that is the most recent conversation about it, but it has been a while, so maybe we should ask what is the current consensus on this issue.  However, I can see that this is tricky. If there is a design decision to not rely on indexes for basic functionality, that's okay. But in the meantime, the solution of having a fake/printed index looks good generally but provides poor usability.

Cedric St-Jean

unread,
Aug 25, 2015, 1:55:44 AM8/25/15
to julia-users
I think that to the extent that they don't want a "real" index (and again, I also question that decision), printing the row number makes sense, since that's how you'll access the rows. If I have an array and I select half of its rows, the new array is still indexed 1:n, so they're following the same principle.

It's misleading if you come from an R/python background, but otherwise I can see that it's got its own consistency.

Robert Smith

unread,
Aug 25, 2015, 3:14:32 PM8/25/15
to julia-users
I think that to the extent that they don't want a "real" index (and again, I also question that decision), printing the row number makes sense, since that's how you'll access the rows. If I have an array and I select half of its rows, the new array is still indexed 1:n, so they're following the same principle.

It's misleading if you come from an R/python background, but otherwise I can see that it's got its own consistency.

Sure. I can see it has some consistency, but it also has limitations. Of course, the solution is to create a real index, but that looks confusing (again, this might be only an issue for people with R and Python background). For example, here the printed index is not very useful:

julia> DataFrame(index=1:100, y=2*(1:100))[50:70,:]
21x2 DataFrame
| Row | index | y   |
|-----|-------|-----|
| 1   | 50    | 100 |
| 2   | 51    | 102 |
| 3   | 52    | 104 |
| 4   | 53    | 106 |
| 5   | 54    | 108 |
| 6   | 55    | 110 |
| 7   | 56    | 112 |
| 14  | 63    | 126 |
| 15  | 64    | 128 |
| 16  | 65    | 130 |
| 17  | 66    | 132 |
| 18  | 67    | 134 |
| 19  | 68    | 136 |
| 20  | 69    | 138 |
| 21  | 70    | 140 |

Robert Smith

unread,
Aug 25, 2015, 6:23:13 PM8/25/15
to julia-users
Cedric:

It looks like the decision can be changed if there are enough interested contributors. This is the github issue  (https://github.com/JuliaStats/DataFrames.jl/issues/864)
Reply all
Reply to author
Forward
0 new messages