Faster sigma clipping routine

323 views
Skip to first unread message

Srikrishna Sekhar

unread,
Oct 1, 2015, 2:47:32 PM10/1/15
to astropy-dev
Hi,

I have a sigma clipping routine that runs ~ 50x faster than the one in astropy.stats (best case) and about 10% faster (worst case) from what I've tested it on so far.

My routine as of now doesn't used masked arrays, it throws away the outliers and only returns the "good" values. However I know that astropy.stats.sigma_clip uses masked arrays, and when I put masked arrays in my routine it looses the huge 50x margin and drops down to consistently running 10%-30% faster. I'm still trying to figure out why this is.

Is there any interest in me filing a pull request to change the exist sigma clipping routine? I'm happy posting the code here for review, comments and modification if required!

I feel like we should have an option to not use masked arrays for the potential speed benefits. The masked array issue can be gotten around in the following way I think:
Have a flag in the function (like domask=True) and only then used the masked version of the code, else throw away the outliers at every iteration. In most of my work I never want to keep the location of the outliers, although I can see that there are cases where that would be useful.

Thanks,
Srikrishna

michaels...@gmail.com

unread,
Oct 2, 2015, 12:27:33 AM10/2/15
to astropy-dev
Hey Srikrishna,

just out of interest if you don't use masked arrays how do you handle 2D or 2D inputs with an axis argument? But it sounds fascinating, I was struggling a few times with astropy.stats.sigma_clip long runtimes before.

Greetings,
Michael

Sri Krishna

unread,
Oct 2, 2015, 2:18:17 AM10/2/15
to astro...@googlegroups.com
Hi,

My routine is a lot less versatile than sigma_clip! I'm not sure if the speed improvements will stay if I add in all the functionality, that's something I will have to look into. Alternatively, I was thinking of writing the core sigma clipping portion (the for/while loop) in cython, and calling the cython function from a thin python wrapper that handles the masks/sigma values/axes etc. Also the routine seems to be helped by some caching issue, because if I use a really large input (I suspect larger than the cache in CPU) then my routine runs only 10% or so faster than sigma_clip, but that could only be a local issue depending on cache size.

But a plain 2D input (with no axis argument) will just be flattened before sigma clipping.

The code is posted below (the commented lines will make use of masked arrays):

import numpy as np

def sigmaclip(dat, nsigma = 3.0, niter=30):   
   # Get the standard deviation and mean   
   std   = np.sqrt(np.var(dat, ddof=1))   
   mean  = np.mean(dat)   
   
   nn = 0   
   ulim     = mean + nsigma * std   
   llim     = mean - nsigma * std   
   newdat   = dat[np.logical_and((llim < dat),(dat < ulim))]    
   #newdat   = np.where(np.logical_and((llim < dat),(dat < ulim)), dat, 0)   
   
   while nn < niter:   
      dat   = newdat   
      std   = np.sqrt(np.var(dat, ddof=1))   
      mean  = np.mean(dat)   
   
      ulim = mean + nsigma * std   
      llim = mean - nsigma * std   
   
      nn = nn + 1   
      newdat = dat[np.logical_and((llim < dat),(dat < ulim))]    
      #newdat = np.where(np.logical_and((llim < dat),(dat < ulim)), dat, 0)   
   
      if np.array_equal(dat, newdat):   
         break  # No more outliers   
   
   #dat = np.ma.masked_values(dat, 0, copy=False)   
   return dat   




Srikrishna Sekhar


--
You received this message because you are subscribed to the Google Groups "astropy-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to astropy-dev...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Steve Crawford

unread,
Oct 2, 2015, 7:25:02 AM10/2/15
to astro...@googlegroups.com
Hi Srikrishna

Thank you for the contribution.   I think adding the option to not use masks arrays in order to gain substantial increases in speed would be something that could be useful.  We do not want to limit the flexibility of the current sigma_clip as masked arrays is a functionality already being used in other programs, but substantial speed gains would be worthwhile to include as a potential option and it does seem like a problem that others have already faced.   So I think it would be worthwhile for you to submit a PR with how the current sigma_clip could be changed to support faster performance and we can discuss it further on the PR and possible include some consistent benchmarking for future testing.

I will note there is a PR that should be merged soon that does improve the memory performance so that would be a good place to start.  Good luck and looking forward to further discussion of the PR. 

Cheers
Steve


Larry Bradley

unread,
Oct 2, 2015, 9:27:56 AM10/2/15
to astropy-dev, craw...@saao.ac.za
Hi Srikrishna,

As mentioned in astropy's `sigma_clip` docs, Scipy has a basic (no-frills) sigma clipping function (`scipy.stats.sigmaclip`).  Have you benchmarked Scipy's version against your code?

Thanks,
Larry

Srikrishna Sekhar

unread,
Oct 2, 2015, 12:24:38 PM10/2/15
to astro...@googlegroups.com
Hi,

My routine seems to match scipy.stats.sigmaclip, within a few percent either way on my benchmarks. Looking at the scip.stats.sigmaclip code, I'm using almost the exact same algorithm which explains the performance.

Right now I'm trying to add in astropy.stats.sigma_clip's functionality and see what happens to the performance. If there is still an improvement on the astropy function, I will submit a pull request. This way we won't have to pull in the scipy dependency for this function.


Srikrishna Sekhar

Erik Bray

unread,
Oct 2, 2015, 12:47:08 PM10/2/15
to astro...@googlegroups.com
On Fri, Oct 2, 2015 at 12:24 PM, Srikrishna Sekhar <kri...@gmail.com> wrote:
> Hi,
>
> My routine seems to match scipy.stats.sigmaclip, within a few percent either
> way on my benchmarks. Looking at the scip.stats.sigmaclip code, I'm using
> almost the exact same algorithm which explains the performance.
>
> Right now I'm trying to add in astropy.stats.sigma_clip's functionality and
> see what happens to the performance. If there is still an improvement on the
> astropy function, I will submit a pull request. This way we won't have to
> pull in the scipy dependency for this function.

To be honest I don't think we should be reinventing the wheel (where
there is no obvious benefit).

I've mentioned multiple times in the recent past (and this comment
isn't specifically directed at you Srikrishna) that we should make
scipy a hard requirement. The previous reason for not doing so was
installation difficulty. But that is much less a problem now than it
was 5 years ago, both due to improvements in packaging and wider use
of scientific Python distributions.

So if there's no tangible performance benefit to your sigma_clip
function over the one in scipy, I don't think we should be adding
additional code that needs to be maintained to Astropy. Instead, if
there are improvements that can be made to the scipy implementation I
would suggest making them directly in scipy.

That said, don't hesitate to make a PR and we can discuss further there.

Thanks,
Erik

Srikrishna Sekhar

unread,
Oct 2, 2015, 1:21:17 PM10/2/15
to astro...@googlegroups.com
I agree, if making scipy a hard dependency is not a problem we should totally go for that code. All that will require is some refactoring to use the scipy function when masked arrays aren't required.

Meanwhile I'll make a PR either later today or tomorrow and we can work out the rest of the details there.

Srikrishna Sekhar

unread,
Oct 11, 2015, 12:54:59 PM10/11/15
to astro...@googlegroups.com
Reply all
Reply to author
Forward
0 new messages