Imputing Data into a Series with a Time Series as the second level of the MultiIndex

17 views
Skip to first unread message

Kent Maxwell

unread,
Jun 20, 2018, 8:30:25 PM6/20/18
to PyData
Hello,

I am trying to find a solution to what I believe should be an easy problem.  Unfortunately, I can't seem to find the answer online and I am really struggling.

Consider the following code:

import pandas as pd
from pandas import Timestamp


demo_data = {('AAAA', Timestamp('2014-12-31 00:00:00', freq='A-DEC')): 4.0,
 ('AAAA', Timestamp('2015-12-31 00:00:00', freq='A-DEC')): 1.0,
 ('AAAA', Timestamp('2016-12-31 00:00:00', freq='A-DEC')): 2.0,
 ('AAAA', Timestamp('2017-12-31 00:00:00', freq='A-DEC')): 12.0,
 ('BBBB', Timestamp('2015-12-31 00:00:00', freq='A-DEC')): 4.0,
 ('BBBB', Timestamp('2016-12-31 00:00:00', freq='A-DEC')): 3.0,
 ('CCCC', Timestamp('2012-12-31 00:00:00', freq='A-DEC')): 7.0,
 ('CCCC', Timestamp('2013-12-31 00:00:00', freq='A-DEC')): 4.0,
 ('CCCC', Timestamp('2014-12-31 00:00:00', freq='A-DEC')): 4.0,
 ('CCCC', Timestamp('2015-12-31 00:00:00', freq='A-DEC')): 21.0,
 ('CCCC', Timestamp('2016-12-31 00:00:00', freq='A-DEC')): 16.0}

demo = pd.Series(demo_data)

The result is a pandas series that looks like this:

AAAA  2014-12-31     4.0
      2015-12-31     1.0
      2016-12-31     2.0
      2017-12-31    12.0
BBBB  2015-12-31     4.0
      2016-12-31     3.0
CCCC  2012-12-31     7.0
      2013-12-31     4.0
      2014-12-31     4.0
      2015-12-31    21.0
      2016-12-31    16.0

I want to impute a 0 for each missing year from 2012 to 2018 for each category like AAAA or BBBB so the final result looks like this:

AAAA  2012-12-31     0.0
      2013-12-31     0.0
      2014-12-31     4.0
      2015-12-31     1.0
      2016-12-31     2.0
      2017-12-31    12.0
      2018-12-31     0.0
BBBB  2012-12-31     0.0
      2013-12-31     0.0
      2014-12-31     0.0
      2015-12-31     4.0
      2016-12-31     3.0
      2017-12-31     0.0
      2018-12-31     0.0
CCCC  2012-12-31     7.0
      2013-12-31     4.0
      2014-12-31     4.0
      2015-12-31    21.0
      2016-12-31    16.0
      2017-12-31     0.0
      2018-12-31     0.0

I tried creating an datetimeindex and reindexing the second level, but it did not work.


idx = pd.date_range(start='1-1-2012', periods=6, freq='1Y')
demo.reindex(idx, level=1, fill_value=0)

Any assistance would be greatly appreciated, especially if it would be easy to re-use it do the same work at month level.

Thanks,

Kent

Paul Hobson

unread,
Jun 21, 2018, 1:08:28 PM6/21/18
to pyd...@googlegroups.com
I'm sure there's a more clever, high-level way to do this. But stick to pretty brute-force chained operations like this:

dates = pd.date_range(start='2011-12', end='2018-12', freq='A-DEC')
full = (
    demo.unstack(level=0)     # cross-tab the data
        .reindex(dates)       # fill out the complete index
        .fillna(0)            # replace missing values with zero
        .stack()              # uncross-tab
        .swaplevel()          # put the index levels back in the right order
        .sort_index(level=0)  # sort things to be clumped by e.g., AAAA
)

Hope that helps,
-Paul

--
You received this message because you are subscribed to the Google Groups "PyData" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pydata+un...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Tom Augspurger

unread,
Jun 21, 2018, 1:26:26 PM6/21/18
to pyd...@googlegroups.com
I would recommend reindexing with the desired final index.

```
In [18]: demo.reindex(pd.MultiIndex.from_product([demo.index.levels[0], idx]), fill_value=0)
Out[18]:

AAAA  2012-12-31     0.0
      2013-12-31     0.0
      2014-12-31     4.0
      2015-12-31     1.0
      2016-12-31     2.0
      2017-12-31    12.0
BBBB  2012-12-31     0.0
      2013-12-31     0.0
      2014-12-31     0.0
      2015-12-31     4.0
      2016-12-31     3.0
      2017-12-31     0.0
CCCC  2012-12-31     7.0
      2013-12-31     4.0
      2014-12-31     4.0
      2015-12-31    21.0
      2016-12-31    16.0
      2017-12-31     0.0
dtype: float64
```

I'm not sure that the level argument is functioning properly. https://github.com/pandas-dev/pandas/issues/7895 seems related.

To unsubscribe from this group and stop receiving emails from it, send an email to pydata+unsubscribe@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "PyData" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pydata+unsubscribe@googlegroups.com.

Joris Van den Bossche

unread,
Jun 21, 2018, 5:45:09 PM6/21/18
to PyData
Yes, Tom's workaround is a good one, but just to note that your initial attempt (demo.reindex(idx, level=1, fill_value=0)) should IMO work*.

Some related issues about this:
- recent bug report that reindex on MultiIndex is not filling missing values: https://github.com/pandas-dev/pandas/issues/21147
- issue on unclear documentation of the level keyword for reindex: https://github.com/pandas-dev/pandas/issues/15590

* although there are some questions about what to do with remaining levels (in case you are not reindexing on the last level)

Joris
Reply all
Reply to author
Forward
0 new messages