groupby_bins

494 views
Skip to first unread message

Edward Byers

unread,
Sep 27, 2016, 3:56:34 AM9/27/16
to xarray
Just trying out the dataset.groupby_bins feature but am having a little trouble with the syntax. I'm also not entirely sure that it will do what I need it to do so if you don't mind will outline what I am trying to do.

I have one gridded dataset (with only one DataArray) that is the impact variable of interest (e.g. temperature) and another similar dataset that will be used as a function of the impact (e.g. population).

  1. First I want to sort dataset 1 into bins, preserving the indices of this data
  2. Next, for each bin, I want to apply a function, which in this simple case is simply sum()
The result of this example is sum of population impacted by different temperature changes.

So my dataset (1, temp) looks like this:

qp
<xarray.Dataset>
Dimensions:       (lat: 360, lon: 720)
Coordinates:
    pcts          int32
90
 
* lat           (lat) float32 89.75 89.25 88.75 88.25 87.75 87.25 86.75 ...
 
* lon           (lon) float32 -179.75 -179.25 -178.75 -178.25 -177.75 ...
Data variables:
    EM  
(lat, lon) float64 nan nan nan nan nan nan nan nan nan nan .



So when I do groupby_bins()...:
bin_data = qp.groupby_bins('EM',[-100,-50,0,50,100],labels=['vgood','good','bad','vbad'])
bin_data.groups

my output is a huge long list of what I think may be indices, similar to the output of:
bin_data.group_indices
>>..
  51755,
  51757,
  51974,
  51981,
  51982,
  51986,
  ...]}

This output isn't the same as other times I have done grouby_bins whereby I think I got a one line statement showing the bin boundaries

 I am not sure if this is correct, and/ or how to proceed...

MANY thanks



Edward

unread,
Sep 27, 2016, 5:36:12 AM9/27/16
to xarray
Just to clarify my thinking on the second stage of this process.

For each bin of datapoints, I want to:
  1. get the indices (or a boolean mask) of all the points in that bin
  2. apply that to the population dataset so that i can sum the values relating only to those indices.
Thanks

Ryan Abernathey

unread,
Sep 27, 2016, 8:51:07 AM9/27/16
to xar...@googlegroups.com
Edward,

You should not have to access the .group_indices attribute in order to perform the operation you describe.

What is the "population dataset"? In the Dataset example above, you only appear to have one data variable ("EM"). Assuming you had another variable called "population", you could do the following

qp_binned = qp.groupby_bins('EM',[-100,-50,0,50,100],labels=['vgood','good','bad','vbad']).sum()
qp_binned['population'].plot()

Hopefully this clarifies how groupby_bins was intended to be used.

Cheers,
Ryan


--
You received this message because you are subscribed to the Google Groups "xarray" group.
To unsubscribe from this group and stop receiving emails from it, send an email to xarray+unsubscribe@googlegroups.com.
To post to this group, send email to xar...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/xarray/57ed4013-c720-45cc-82ea-f2ed5c34dab1%40googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

Message has been deleted

Stephan Hoyer

unread,
Sep 28, 2016, 12:06:26 PM9/28/16
to xarray
You can pass a DataArray as the group to groupby/groupby_bins.

So try something like:
pop.2010.groupby_bins(temp.EnsembleMean, bins).sum()

On Wed, Sep 28, 2016 at 9:02 AM, Edward <eam...@gmail.com> wrote:
Thanks Ryan

i think I am almost there. 

The 2nd population dataset is gridded population, as integers. The intention being: what is the sum of population experiencing temperature value changes between the bins of [-100,-50,0,50,100]?

I realise that in my original question I meant to say that we are summing the values in the population dataset, split by the bins... not summing the values in the bins.


So
  1. Bin the temperature changes in qp.
  2. Get the indices for each bin, and select the relevant values in the population dataset, and sum.

What i eneded up with now is:

#Temp dataset:
bins
= [-100,-50,0,50,100]
tempb 
= temp.groupby_bins('EnsembleMean',bins)

#Population dataset:
pv = pop.p2010.values

pvsum
= np.zeros(len(bins))  # Assign an zeros array


# Now loop through the bins in tempb and get the indices
for binn in range(0,len(bins)-1):
    tempbi = tempb.group_indices[binn]  

# Grab the values from pv using indices qbpi and sum
    pvsum1[binn] = np.nansum(pv.take(qpbi))







To unsubscribe from this group and stop receiving emails from it, send an email to xarray+un...@googlegroups.com.

To post to this group, send email to xar...@googlegroups.com.

--
You received this message because you are subscribed to the Google Groups "xarray" group.
To unsubscribe from this group and stop receiving emails from it, send an email to xarray+unsubscribe@googlegroups.com.
To post to this group, send email to xar...@googlegroups.com.

Edward

unread,
Sep 28, 2016, 12:10:24 PM9/28/16
to xarray

Thanks Ryan

i think I am pretty much there. 

The  population dataset is gridded population, as integers. The intention being: what is the sum of population experiencing temperature value changes between the bins of [-100,-50,0,50,100]?

I realise that in my original question I meant to say  that we are summing the values in the population dataset, split by the bins... not summing the values in the bins of the temp dataset


So
  1. Bin the temperature changes in temp.
  2. Get the indices for each bin, and select the relevant values in the population dataset (pv), and sum.

What i eneded up with now is:

#Temp dataset:
bins
= [-100,-50,0,50,100]

tempb
= temp.groupby_bins('EM',bins)


#Population dataset:
pv
= pop.p2010.values

# Assign an zeros array

pvsum
= np.zeros(len(bins))

# Now loop through the bins in tempb and get the indices
for binn in range(0,len(bins)-1):
    tempbi
= tempb.group_indices[binn]  


# Grab the values from pv using indices qbpi and sum

    pvsum
[binn] = np.nansum(pv.take(qpbi))



This gives a the array pvsum, with the total population impacts by the temp changes specified in the bins.

I'm sure there are more efficient ways to do what I did above, probably directly through xarray, but I didn't quite get there!

xarray was pretty much my entry into Python, so I am still learning some elementary methods in pandas and numpy

Thanks again


Edward

unread,
Sep 28, 2016, 12:12:10 PM9/28/16
to xarray
Thanks Stephan

That also looks like it could work! Much more elegant!

Edward

unread,
Sep 28, 2016, 12:25:39 PM9/28/16
to xarray
Image result for amazing meme


On Wednesday, 28 September 2016 18:06:26 UTC+2, Stephan Hoyer wrote:

Edward

unread,
Sep 28, 2016, 7:04:11 PM9/28/16
to xarray

(I think) I'm coming up with instances whereby there are bins missing, as opposed to say, just filling that bin with 0 or nan.

Is there any way to easily identify which bins were skipped, or preferably  have them represented with 0s?
Thanks





On Wednesday, 28 September 2016 18:06:26 UTC+2, Stephan Hoyer wrote:

Ryan Abernathey

unread,
Sep 28, 2016, 8:10:49 PM9/28/16
to xar...@googlegroups.com
Edward,

We are eager to help, but more information is needed regarding the details of your problem.

The easiest way for us to help with your question would be for you to come up with a "Minimum Working Example" that we can run on our own computers. Even better, you could file this example, along with a detailed explanation of the problem, as a github issue;

Without a minimum working example, it is an inefficient us of time to guess what might be going on.

Best,
Ryan



To unsubscribe from this group and stop receiving emails from it, send an email to xarray+unsubscribe@googlegroups.com.

To post to this group, send email to xar...@googlegroups.com.

Edward

unread,
Sep 29, 2016, 3:11:50 AM9/29/16
to xarray
I wans't sure if it was really an issue or I am doing it wrong.

But I submitted an issue that is hopefully reproducible!
Thanks

Reply all
Reply to author
Forward
0 new messages