Running mean/median in Julia

666 views
Skip to first unread message

Tomas Mikoviny

unread,
Jan 6, 2015, 11:37:15 AM1/6/15
to julia...@googlegroups.com
Hi, 
I was just wondering if anyone knows if there is a package that implements fast running median/mean algorithms?

Thanks a lot...
 

Kevin Squire

unread,
Jan 6, 2015, 12:18:02 PM1/6/15
to julia...@googlegroups.com
Hi Tomas,
I'm bit aware of any (though they might exist). It might help if you gave a little more context--what kind of data are you working with?

Cheers,
   Kevin

Tamas Papp

unread,
Jan 6, 2015, 12:35:06 PM1/6/15
to julia...@googlegroups.com
Hi Tomas,

I don't know any packages, but a running mean is trivial to
implement. See eg
http://en.wikipedia.org/wiki/Algorithms_for_calculating_variance#Online_algorithm
, it doesn't get faster than this, and it is reasonably accurate.

Medians (and quantiles in general) are trickier, because you have to
keep track of the whole distribution or sacrifice accuracy. See

@inproceedings{greenwald2001space,
title={Space-efficient online computation of quantile summaries},
author={Greenwald, Michael and Khanna, Sanjeev},
booktitle={ACM SIGMOD Record},
volume={30},
number={2},
pages={58--66},
year={2001},
organization={ACM}
}

You might want to tell us more about your problem and then you could get
more specific advice. Eg if number of observations >> number of distinct
values, you could simply count them in a Dict.

Best,

Tamas

On Tue, Jan 06 2015, Tomas Mikoviny <tomas.m...@gmail.com> wrote:

> Hi,
> I was just wondering if anyone knows if there is a package that implements
> *fast* running median/mean algorithms?
>
> Thanks a lot...
>

Tim Holy

unread,
Jan 6, 2015, 12:37:24 PM1/6/15
to julia...@googlegroups.com
For running mean, cumsum gives you an easy approach, if you don't mind a
little floating-point error.

--Tim

On Tuesday, January 06, 2015 09:17:59 AM Kevin Squire wrote:
> Hi Tomas,
> I'm bit aware of any (though they might exist). It might help if you gave a
> little more context--what kind of data are you working with?
>
> Cheers,
> Kevin
>
> On Tuesday, January 6, 2015, Tomas Mikoviny <tomas.m...@gmail.com>
>
> wrote:
> > Hi,
> > I was just wondering if anyone knows if there is a package that implements
> > *fast* running median/mean algorithms?
> >
> > Thanks a lot...

Tomas Mikoviny

unread,
Jan 6, 2015, 2:33:19 PM1/6/15
to julia...@googlegroups.com
Hi Kevin, 

generally I'm trying to do baseline correction on mass spectra with ~300000 bins. I've tried several algorithms to evaluate baseline but the ones working the best implement running median and mean. I've just got mean sorted out via cumsum trick in coincidence with Tim's suggestion (found some MATLAB discussion on that). Although I'll check Tamas' suggestion too.

I've got stacked with running median that would have reasonable performance since computer has to crunch runmed of array of 200 x 300000 within couple of seconds (max) to manage online analysis. 

Tomas Mikoviny

unread,
Jan 6, 2015, 3:20:25 PM1/6/15
to julia...@googlegroups.com
indeed very efficient approach... you've nailed it again Tim

thank you

Tomas Mikoviny

unread,
Jan 6, 2015, 3:32:12 PM1/6/15
to julia...@googlegroups.com
Tamas, thanks for the quantile reference - I'm just immersed in that paper.

As I've mentioned in reply to Kevin, I have rather bulky mass spec data and performance is rather crucial. I've read about the approach with limited distinct values but unfortunately it is not very applicable in my case. Let's see what I'll learn from Greenwald paper

Tomas

Tamas Papp

unread,
Jan 6, 2015, 3:38:37 PM1/6/15
to julia...@googlegroups.com
Hi Tomas,

A less sophisticated approach is to just bin the values using some
regular grid and count them, and then use the midpoint of the median bin
as an approximation.

Then

abs((median bin midpoint) - (true median)) <= (median bin width)/2

which may be good enough for your application.

Best,

Tamas

On Tue, Jan 06 2015, Tomas Mikoviny <tomas.m...@gmail.com> wrote:

> Tamas, thanks for the quantile reference - I'm just immersed in that paper.
>
> As I've mentioned in reply to Kevin, I have rather bulky mass spec data and
> performance is rather crucial. I've read about the approach with limited
> distinct values but unfortunately it is not very applicable in my case.
> Let's see what I'll learn from Greenwald paper
>
> Tomas
>
>
> On Tuesday, January 6, 2015 6:35:06 PM UTC+1, Tamas Papp wrote:
>>
>> Hi Tomas,
>>
>> I don't know any packages, but a running mean is trivial to
>> implement. See eg
>>
>> http://en.wikipedia.org/wiki/Algorithms_for_calculating_variance#Online_algorithm
>> <http://www.google.com/url?q=http%3A%2F%2Fen.wikipedia.org%2Fwiki%2FAlgorithms_for_calculating_variance%23Online_algorithm&sa=D&sntz=1&usg=AFQjCNGY1sv8symyRT_JBeOg4AwZb6Nt-g>
>> , it doesn't get faster than this, and it is reasonably accurate.
>>
>> Medians (and quantiles in general) are trickier, because you have to
>> keep track of the whole distribution or sacrifice accuracy. See
>>
>> @inproceedings{greenwald2001space,
>> title={Space-efficient online computation of quantile summaries},
>> author={Greenwald, Michael and Khanna, Sanjeev},
>> booktitle={ACM SIGMOD Record},
>> volume={30},
>> number={2},
>> pages={58--66},
>> year={2001},
>> organization={ACM}
>> }
>>
>> You might want to tell us more about your problem and then you could get
>> more specific advice. Eg if number of observations >> number of distinct
>> values, you could simply count them in a Dict.
>>
>> Best,
>>
>> Tamas
>>
>> On Tue, Jan 06 2015, Tomas Mikoviny <tomas.m...@gmail.com <javascript:>>

Steven G. Johnson

unread,
Jan 6, 2015, 3:52:16 PM1/6/15
to julia...@googlegroups.com


On Tuesday, January 6, 2015 12:37:24 PM UTC-5, Tim Holy wrote:
For running mean, cumsum gives you an easy approach, if you don't mind a
little floating-point error.

Yikes, just noticed that cumsum is significantly less accurate than sum; basically, cumsum is no better than naive summation, whereas the intention was to get pairwise-summation accuracy.  This is fixed in #9650 

Stefan Karpinski

unread,
Jan 6, 2015, 4:10:39 PM1/6/15
to Julia Users
You may not need online methods at all. Sorting the rows of a 200 x 300000 matrix doesn't take very long on my laptop:

julia> X = randn(200,300000);

julia> @time X = sortrows(X);
elapsed time: 0.297998739 seconds (480053384 bytes allocated)


Tomas Mikoviny

unread,
Jan 6, 2015, 6:50:38 PM1/6/15
to julia...@googlegroups.com
Hi Stefan, 

you are right. But the moment I try to do this for any subset along the array it gets inefficient. 
Here is the simplest code for running median of one dimensional array. Time consuming...

function runmed(input::Array, w::Int)
    L = length(input)
    output = zeros(input)
    subset = zeros(w+1)
    for i in 1:L-w
        subset = sub(input, i:i+w)
output[i] = median!(subset)
    end
    return output
end

a = zeros(300000);

runmed(a, 200);

@time runmed(a, 200);
elapsed time: 1.460171594 seconds (43174976 bytes allocated)

Stefan Karpinski

unread,
Jan 7, 2015, 1:49:43 PM1/7/15
to Julia Users
Try deleting the line where you initially assign to subset – it isn't necessary and causes that variable to change type over the course of your loop.

Tomas Mikoviny

unread,
Jan 9, 2015, 7:33:02 PM1/9/15
to julia...@googlegroups.com
yes, that was my negligence, change type is first thing I usually check...
however it has no significant impact on performance...

I've got caught with some different things last days but continuing to play with online methods

tm

Stefan Karpinski

unread,
Jan 9, 2015, 8:38:06 PM1/9/15
to Julia Users
I assume you're on 0.3.x – this may simply be due to SubArrays kind of sucking for performance. The new ArrayView stuff is much better.
Reply all
Reply to author
Forward
0 new messages