[pandas] Binning Datetime

42 views
Skip to first unread message

João Pedro

unread,
Sep 17, 2020, 10:35:11 AM9/17/20
to PyData
Hello all,

I'm working with a dataframe containing the columns Job, Start Time, End Time and NCPUS.

job id start                      end                      ncpus
1 2020-01-01 11:12:17 2020-01-01 13:24:35 1

The idea is to reshape this dataframe in something like

  period                                 job id             duration               ncpus    cpu_hours
2020-01-01 11:00:00            1            0 days 00:47:43           1               0.79
2020-01-01 12:00:00            1            0 days 01:00:00           1                1
2020-01-01 13:00:00            1            0 days 00:24:35           1               0.40

The problem that in order to do this, I need to apply a for loop (either through grouping by "job id" or df.iterrows in the case each job is a line) , expand the hours between start and end to get "period" column and then calculate the duration.

For a list of 30000 jobs, this take a long time ~ 30 s, and 30000 is not even a dent in all the data.

Do you have any recommendations on how to perform it efficiently?





Josh Friedlander

unread,
Sep 17, 2020, 12:23:40 PM9/17/20
to pyd...@googlegroups.com
I can't see a way to do it without a groupby, but you should post it on StackOverflow. There are some Pandas ninjas there.

--
You received this message because you are subscribed to the Google Groups "PyData" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pydata+un...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/pydata/e11a46d0-141a-48a2-b4c8-6a9256d926aan%40googlegroups.com.

Pietro Battiston

unread,
Sep 17, 2020, 1:59:42 PM9/17/20
to pyd...@googlegroups.com
I think there's a way - not necessarily very nice.

You should, for each row, extract te following parts:
1) the time from "start" to the next hour
2) what the next hour is
3) the time to "end" from the previous hour
4) what the previous hour is
5) the number of hours fully covered.

Each of these is relatively easy to get with vectorized operations.

Now, you want to sum 1) by groupby-ing on 2) and 3) by groupby-ing on
4), then add together the two results.

Finally, for i = 1 to n, where n is the maximum duration of a single
task, you want to count rows where 5) is => i, shift the result by (i-
1), and sum, groupby-ing on 2).

Add these together, multiply by 1 hour, add the result of the previous
part, and you should be done.

(I bet there's a way of avoiding even the loop from 1 to n)

Pietro



Il giorno gio, 17/09/2020 alle 19.23 +0300, Josh Friedlander ha
scritto:

Josh Friedlander

unread,
Sep 17, 2020, 3:38:55 PM9/17/20
to pyd...@googlegroups.com
(I came up with a groupby solution which I originally thought was too slow, but it seems faster than the one posted on the Reddit thread. My test data is 50,000 rows.)

import datetime
import random
import pandas as pd

start = [datetime.datetime(2020, 1, 1) + random.random() * datetime.timedelta(days=1) for _ in range(50000)]
end = [x + random.randint(10, 500) * datetime.timedelta(minutes=1) for x in start]
data = {
    "job id": range(50000),
    "start": start,
    "end": end,
    "ncpus": [random.randint(1, 4) for _ in range(50000)]
}

df = pd.DataFrame(data)
df.start = pd.to_datetime(df.start)
df.end = pd.to_datetime(df.end)


def get_bins(a):
    job_id, start_time, end_time, ncpus = a.iloc[0].tolist()
    rng = pd.date_range(start_time.floor('h'), end_time.floor('h'), freq='h')
    tmp = pd.DataFrame(
        pd.Series(rng + 1 * rng.freq, index=rng).clip(upper=end_time) -
        pd.Series(rng, index=rng).clip(lower=start_time),
        columns=['duration'])
    tmp['ncpus'] = ncpus
    tmp['job id'] = job_id
    tmp['cpu_hours'] = tmp.duration.dt.seconds / 3600 * ncpus
    return tmp[['job id', 'duration', 'ncpus', 'cpu_hours']].copy()

df.groupby('job id').apply(lambda x: get_bins(x))

Message has been deleted

João Pedro

unread,
Sep 18, 2020, 5:10:58 AM9/18/20
to PyData
df = pd.DataFrame(data)
df.start = pd.to_datetime(df.start)
df.end = pd.to_datetime(df.end)

next_hour = df.start.dt.ceil(freq)
past_hour = df.end.dt.floor(freq)

df["next_hour"] = next_hour
df["past_hour"] = past_hour
df["time_from_start_to_next_hour"] = next_hour - df.start 
df["time_from_past_hour_to_end"] = df.end - df.past_hour

total_start = df.groupby(["next_hour"])["time_from_start_to_next_hour"].sum()
total_end = df.groupby(["past_hour"])["time_from_past_hour_to_end"].sum()

df["duration"] = df.end - df.start
df["fully_covered_hours"] = df.duration - df.time_from_start_to_next_hour - df.time_from_past_hour_to_end


But I got stuck on your loop. I understood the for i in range(1, int(np.ceil((df.duration / pd.Timedelta(freq)).max()))), but the inner part of the for loop is not so clear for me.

I check the jobs that have full_covered_hours more that the evaluated duration i (df.full_covered_hours/pd.Timedelta(freq) >= i). But the part of the shift and grouping lost me :(

Pietro Battiston

unread,
Sep 18, 2020, 7:12:06 AM9/18/20
to pyd...@googlegroups.com
Il giorno ven, 18/09/2020 alle 02.10 -0700, João Pedro ha scritto:
> But I got stuck on your loop. I understood the
> for i in range(1, int(np.ceil((df.duration / pd.Timedelta(freq)).max(
> )))), but the inner part of the for loop is not so clear for me.
>
> I check the jobs that have full_covered_hours more that the evaluated
> duration i (df.full_covered_hours/pd.Timedelta(freq) >= i).

Exactly. This will tell you the number of jobs that still "matter" for
a time slot happening i-1 hours after their start. Hence, it should be
sufficient then to count how often this happens.

For instance:

fully_covered = (df.full_covered_hours/pd.Timedelta(freq) >= 3)

will count the number of processes still fully covering hour H+2 - but
these will still refer (in "start_hour") to hour H. So this is the
information you need, but "wrong" by 2 hours. You can now produce
(within the loop) a column "actual_hour" obtained as df.start_hour + 2
(in general: +i-1).

If you now do a

fully_covered.groupby(df['actual_hour']).sum()

, you should get the correct assignment

hour -> number of fully covering processes

You can sum this information across i, and then add the total to the
sum of "partially covered hours".

Pietro
> --
> You received this message because you are subscribed to the Google
> Groups "PyData" group.
> To unsubscribe from this group and stop receiving emails from it,
> send an email to pydata+un...@googlegroups.com.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/pydata/b1e81efe-e233-4a9b-9ae6-f1a68614842an%40googlegroups.com
> .

João Pedro

unread,
Sep 18, 2020, 10:02:24 AM9/18/20
to PyData
Yes, this works very well! Unfortunately there is still a slow piece the add bit.   For each maxduration (i) in this huge dataframe (3M lines) the add operation takes ~1s.  Overall is pretty good, it does the total in 400 seconds which is much less than previously. In the end, there will be filters to allow the window to select, so I believe it's good. Thanks for everyone participating!


The results of %time

300000 rows: Wall time: 965 µs Wall time: 5.36 ms Wall time: 2.65 ms Wall time: 153 ms  
3000000 rows: Wall time: 2.11 ms Wall time: 58.1 ms Wall time: 20.3 ms Wall time: 1.36 s

actual_hours = None
max_duration = int(np.ceil( (df.duration / pd.Timedelta(freq)).max() ))
for i in (range(1, max_duration)):
    fully_covered = (df["fully_covered_hours"] / pd.Timedelta(freq) >= i)
    %time df["fully_covered_count"] = fully_covered
    %time df["actual_hour"] = df.start_hour_ceil + pd.Timedelta(freq) * (i-1)
    %time to_sum = df[fully_covered].groupby(["actual_hour","job id"])["fully_covered_count"].sum()
    if actual_hours is None:
        actual_hours = to_sum
    else:
        %time actual_hours = actual_hours.add(to_sum, fill_value = 0)

alex Skelton

unread,
Sep 24, 2020, 2:17:42 PM9/24/20
to PyData
This should work pretty quickly I think.  Let me know how it goes for you. I used the setup from someone up above.  

Takes awhile just to create the dataframe, but once the data frame is created this example of 3M rows processes in ~7 seconds.  This will vary depending on the distribution of the job durations.   If I changed the job durations to evenly distributed between 10 minutes and 2 days, it took ~30 seconds.  Not sure what your distribution looks like.


import datetime
import random
import pandas as pd

n = 3000000

start = [datetime.datetime(2020, 1, 1) + random.random() * datetime.timedelta(days=1) for _ in range(n)]
end = [x + random.randint(10, 500) * datetime.timedelta(minutes=1) for x in start]
data = {
    "job id": range(n),
    "start": start,
    "end": end,
    "ncpus": [random.randint(1, 4) for _ in range(n)]
}

df = pd.DataFrame(data)
df['start'] = pd.to_datetime(df.start)
df['end'] = pd.to_datetime(df.end)
df['start_hour'] = df['start'].dt.floor('H')
results = []
to_process = df
rows = len(to_process.index)

while rows > 0:
      print(rows)
      to_process['end_hour']= to_process['start_hour'] + pd.DateOffset(hours = 1)
      to_process['duration'] = to_process[['end','end_hour']].min(axis = 1) - to_process[['start','start_hour']].max(axis = 1)
      to_process['cpu_hours'] = to_process['duration'] * to_process['ncpus']
      results.append(to_process[['start_hour','job id','duration','ncpus','cpu_hours']])
      to_process = to_process.loc[to_process['end'] > to_process['end_hour']]
      to_process['start_hour'] = to_process['end_hour']
      
      rows = len(to_process.index)

result_df = pd.concat(results)

Thanks,
Alex

Reply all
Reply to author
Forward
0 new messages