Message from discussion
Pandas row binning
Received: by 10.236.117.66 with SMTP id i42mr4464564yhh.8.1349374886012;
Thu, 04 Oct 2012 11:21:26 -0700 (PDT)
X-BeenThere: pydata@googlegroups.com
Received: by 10.236.154.232 with SMTP id h68ls4743011yhk.2.gmail; Thu, 04 Oct
2012 11:21:25 -0700 (PDT)
Received: by 10.236.35.52 with SMTP id t40mr4504887yha.31.1349374885071;
Thu, 04 Oct 2012 11:21:25 -0700 (PDT)
Received: by 10.236.35.52 with SMTP id t40mr4504886yha.31.1349374885056;
Thu, 04 Oct 2012 11:21:25 -0700 (PDT)
Return-Path: <alva...@gmail.com>
Received: from mail-gg0-f176.google.com (mail-gg0-f176.google.com [209.85.161.176])
by gmr-mx.google.com with ESMTPS id y43si1789098yhi.2.2012.10.04.11.21.24
(version=TLSv1/SSLv3 cipher=OTHER);
Thu, 04 Oct 2012 11:21:24 -0700 (PDT)
Received-SPF: pass (google.com: domain of alva...@gmail.com designates 209.85.161.176 as permitted sender) client-ip=209.85.161.176;
Authentication-Results: gmr-mx.google.com; spf=pass (google.com: domain of alva...@gmail.com designates 209.85.161.176 as permitted sender) smtp.mail=alva...@gmail.com; dkim=pass header...@gmail.com
Received: by mail-gg0-f176.google.com with SMTP id e5so202071ggk.35
for <pydata@googlegroups.com>; Thu, 04 Oct 2012 11:21:24 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
d=gmail.com; s=20120113;
h=mime-version:sender:in-reply-to:references:from:date
:x-google-sender-auth:message-id:subject:to:content-type
:content-transfer-encoding;
bh=hqAowPBVBc5z0TNorOmzYIbcOrkXCAbGxa+LShXbt60=;
b=RRfw7AyxQnthe1KAqKs+R//uXkdJfF1ktPs7iz8jzecnkNUtRXbMSeZisIOZILzg/E
6jIVGVmpbL082grJg/NOp3JuFqXif3WQG5XvtQZrI6bMZ1MpnlWe5OTtStKwBGfA+7rv
FdxbecBJ2e89Np/qC1nJD6OQXAx9mSHO4Za5InoI+zWEwv1GECmL3IazY4kFrW44EUh2
S6+WrLFuNtR4tRZ2Lf1/nlO94+FDGBlVO4nvMxpzAMyfcey4c2dtCW1cc6hruGLSWqf8
78P7s5NvuXA5+X8cFbF4XknDKjtVfR6RgxXxZITQ+9OPi6n7HtM5pPTZodZ2mUf9SV/a
B+cw==
Received: by 10.236.124.44 with SMTP id w32mr5944861yhh.76.1349374884511; Thu,
04 Oct 2012 11:21:24 -0700 (PDT)
MIME-Version: 1.0
Sender: alva...@gmail.com
Received: by 10.101.204.18 with HTTP; Thu, 4 Oct 2012 11:21:04 -0700 (PDT)
In-Reply-To: <CAMHV+dCtkTuV3mOOA3U9H=5swePE1pZR3AU-rk7tkOtTR67...@mail.gmail.com>
References: <CAMHV+dCPnVLN9a5c=OCL3h-cYMJ3W8m7tvL8CNZCSwRAw80...@mail.gmail.com>
<CAKS7gT6mmEnDTHay703K5jLsFXW4J-kkhuCTS9njcp88OiT...@mail.gmail.com>
<CAMHV+dDHboWEkxE_-LFJ7OhW6=UBfaiyYYUxaKG+4eH8tc9...@mail.gmail.com>
<CAMHV+dBRE0CvvoAuf+x9r96Ghh9vZL0UAc1Xo-5eU-2LEpA...@mail.gmail.com>
<CAKS7gT66taZitVwE_dFmP9LHmBuraAnaFmBZtCSPcgy_4iy...@mail.gmail.com>
<CAMHV+dDR8gtODJZodLd4EGafOeQTYPO2axXYKg1N2BgALLi...@mail.gmail.com>
<CAFFY=MFs-v_+c=m-H4VcCK0kaJq9ACSMF_YOZ5a_HZPKS89...@mail.gmail.com> <CAMHV+dCtkTuV3mOOA3U9H=5swePE1pZR3AU-rk7tkOtTR67...@mail.gmail.com>
From: Alvaro Tejero Cantero <alv...@minin.es>
Date: Thu, 4 Oct 2012 19:21:04 +0100
Message-ID: <CAFFY=MHu0VzCVgSsy89R08c=5gfQ6LPD86KEJDb7V-NP4Si...@mail.gmail.com>
Subject: Re: [pydata] Pandas row binning
To: pydata@googlegroups.com
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: quoted-printable
I actually tried it before writing, on Wouter's example, and this was
the result:
In [5]: df_resampled =3D df.groupby(lambda x:x/5.).mean()
In [6]: df_resampled
Out[6]:
<class 'pandas.core.frame.DataFrame'>
Index: 100 entries, 0.0 to 19.8
Columns: 100 entries, 0 to 99
dtypes: float64(100)
note index contains 100 entries, not 20. This was on 0.8.1 right now
and on 0.9.rcsomething on the morning.
-=C3=A1.
On 4 October 2012 19:12, Adam Hughes <hughesada...@gmail.com> wrote:
>
>
> On Thu, Oct 4, 2012 at 4:15 AM, Alvaro Tejero Cantero <alv...@minin.es>
> wrote:
>>
>> just a pedantic note (hammering on Wouter's remark "for Python 2.7")
>> to make it more robust: use the integer division (// instead of /) so
>> that the grouping lambda survives the transition to Python 3 (or a
>> from __future__ import division statement at the top of the module)
>>
>
> So, in my application, I actually am passing floats into the divisor (aka=
x
> / 4.55), and letting pandas automatically figure out the closest integer =
by
> which to split the table. It seems like pandas will convert to an intege=
r
> internally, regardless of if one passes a float or integer into the divis=
or.
>
>>
>>
>> -=C3=A1.
>>
>>
>> On 3 October 2012 20:19, Adam Hughes <hughesada...@gmail.com> wrote:
>> > Ah this is perfect thanks a lot!
>> >
>> > On Oct 3, 2012 3:02 AM, "Wouter Overmeire" <loda...@gmail.com> wrote:
>> >>
>> >>
>> >>
>> >> 2012/10/3 Adam Hughes <hughesada...@gmail.com>
>> >>>
>> >>> This seems like a common operation, eg row binning and column binnin=
g.
>> >>> I
>> >>> actually was surprised to find I couldn't turn up much about this in
>> >>> numpy
>> >>> searches which always go back to histogramming. This is not ideal i=
f
>> >>> you
>> >>> need to preserve ordering (aka, each column of the data frame would
>> >>> need its
>> >>> own, non-sequential set of bins bins), so I'm going to write a quick
>> >>> program
>> >>> to do this. If anyone is interested, I'll post it here later.
>> >>>
>> >>> The operations would simply first reshape the array, and then do a
>> >>> vectorized average. For example, if I had 100 rows by 100 columns a=
nd
>> >>> wanted to bin the rows by 5, I merely reshape into 20 separate 5*100
>> >>> arrays:
>> >>>
>> >>> 100 x 100 ---> 20 (5x100)
>> >>>
>> >>> Then I do a vectorized average on each of the 5x100 arrays and pop
>> >>> into
>> >>> the dataframe. This should be the fastest way to do it, no? Will t=
ry
>> >>> to be
>> >>> clever on handling non-perfect division.
>> >>>
>> >>>
>> >>>
>> >>>
>> >>> On Tue, Oct 2, 2012 at 6:08 PM, Adam Hughes <hughesada...@gmail.com>
>> >>> wrote:
>> >>>>
>> >>>> Thanks Wouter. So this operation is only valid for TimeSeries?
>> >>>>
>> >>>>
>> >>>> On Tue, Oct 2, 2012 at 7:13 AM, Wouter Overmeire <loda...@gmail.com=
>
>> >>>> wrote:
>> >>>>>
>> >>>>>
>> >>>>>
>> >>>>> 2012/10/2 Adam Hughes <hughesada...@gmail.com>
>> >>>>>>
>> >>>>>> Hi guys,
>> >>>>>>
>> >>>>>> I am interested in doing row binning in an efficient manner.
>> >>>>>>
>> >>>>>> I have a dataframe, say 100 rows by 20 columns. I'd like to redu=
ce
>> >>>>>> this to a 10 by 20 dataframe, where the new row values are the
>> >>>>>> average, sum
>> >>>>>> or other mapped operation that the user passes in. I'm having
>> >>>>>> trouble
>> >>>>>> finding the best way to do this, without resorting to row-by-row
>> >>>>>> iterations.
>> >>>>>>
>> >>>>>> I've been trying to follow the API here, but I know that it is no=
t
>> >>>>>> completely up to date. How would you guys make a function that
>> >>>>>> takes in a
>> >>>>>> bin number and does this type of operation?
>> >>>>>>
>> >>>>>> I thought the resample method might be what I wanted, but don't
>> >>>>>> understand the "rule" argument.
>> >>>>>>
>> >>>>>> PS, i'm avoiding histograms because my arrays are intentionally o=
ut
>> >>>>>> of
>> >>>>>> order, and I need to preserve this.
>> >>>>>>
>> >>>>>> Thanks.
>> >>>>>>
>> >>>>>> --
>> >>>>>>
>> >>>>>>
>> >>>>>
>> >>>>>
>> >>>>> The "rule" argument indicates the bin size.
>> >>>>> For example, daily bins -- 'D', 12 hour bins -- '12H', or monthly
>> >>>>> bins
>> >>>>> -- 'M', etc.
>> >>>>>
>> >>>>> In [73]: index =3D pandas.date_range('2012-10-10', periods=3D100,
>> >>>>> freq=3D'H')
>> >>>>> # 100 hourly spaced timestamps
>> >>>>>
>> >>>>> In [74]: import random
>> >>>>>
>> >>>>> In [75]: index =3D random.sample(index, 100) # index is unsorted n=
ow
>> >>>>>
>> >>>>> In [76]: df =3D pandas.DataFrame(np.random.randn(100,4), index,
>> >>>>> list('ABCD'))
>> >>>>>
>> >>>>> In [77]: df.resample(rule=3D'D', how=3D'mean')
>> >>>>> Out[77]:
>> >>>>> A B C D
>> >>>>> 2012-10-10 0.190634 -0.010331 -0.533944 2.019264
>> >>>>> 2012-10-11 -0.042690 -0.025202 0.340528 -0.337808
>> >>>>> 2012-10-12 0.046622 0.136284 -0.292848 0.261602
>> >>>>> 2012-10-13 0.283887 -0.084499 -0.468463 0.104838
>> >>>>> 2012-10-14 -0.343621 -0.031466 -0.166999 0.215456
>> >>>>> 2012-10-15 -0.702906 1.069791 0.464696 -0.049558
>> >>>>>
>> >>>>> In [78]: df.resample(rule=3D'12H', how=3D'mean')
>> >>>>> Out[78]:
>> >>>>> A B C D
>> >>>>> 2012-10-10 00:00:00 0.190634 -0.010331 -0.533944 2.019264
>> >>>>> 2012-10-10 12:00:00 0.033740 0.386743 0.553311 -0.144823
>> >>>>> 2012-10-11 00:00:00 -0.119121 -0.437148 0.127744 -0.530792
>> >>>>> 2012-10-11 12:00:00 -0.150271 0.404516 0.034835 0.475245
>> >>>>> 2012-10-12 00:00:00 0.243515 -0.131948 -0.620531 0.047958
>> >>>>> 2012-10-12 12:00:00 0.153480 -0.061286 -0.546608 -0.074879
>> >>>>> 2012-10-13 00:00:00 0.414294 -0.107713 -0.390319 0.284555
>> >>>>> 2012-10-13 12:00:00 -0.188024 -0.298300 0.105004 0.405880
>> >>>>> 2012-10-14 00:00:00 -0.499219 0.235367 -0.439002 0.025032
>> >>>>> 2012-10-14 12:00:00 -0.702906 1.069791 0.464696 -0.049558
>> >>>>>
>> >>>>> In [79]: df.resample(rule=3D'M', how=3D'mean')
>> >>>>> Out[79]:
>> >>>>> A B C D
>> >>>>> 2012-10-31 -0.032573 0.030818 -0.132466 0.077287
>> >>>>>
>> >>>>> --
>> >>>>>
>> >>>>>
>> >>>>
>> >>>>
>> >>>
>> >>> --
>> >>>
>> >>>
>> >>
>> >>
>> >>
>> >> Note that groupby can be used to do the "reshaping", from this its ea=
sy
>> >> to
>> >> compute the means and get a new frame.
>> >>
>> >> In the example below (python 2.7) i just reshape by "each 5 consecuti=
ve
>> >> rows is a group" over which the mean is taken. You can use whatever
>> >> groupby
>> >> function to do the reshaping.
>> >>
>> >> In [34]: df =3D pandas.DataFrame(np.random.randn(100,100))
>> >>
>> >> In [35]: df_resampled =3D df.groupby(lambda x:x/5).mean()
>> >>
>> >> In [36]: df_resampled
>> >> Out[36]:
>> >> <class 'pandas.core.frame.DataFrame'>
>> >> Int64Index: 20 entries, 0 to 19
>> >> Columns: 100 entries, 0 to 99
>> >> dtypes: float64(100)
>> >>
>> >> --
>> >>
>> >>
>> >
>> > --
>> >
>> >
>>
>> --
>>
>>
>
> --
>
>