Running mean/median in Julia

Tomas Mikoviny

unread,

Jan 6, 2015, 11:37:15 AM1/6/15

to julia...@googlegroups.com

Hi,

I was just wondering if anyone knows if there is a package that implements fast running median/mean algorithms?

Thanks a lot...

Kevin Squire

unread,

Jan 6, 2015, 12:18:02 PM1/6/15

to julia...@googlegroups.com

Hi Tomas,

I'm bit aware of any (though they might exist). It might help if you gave a little more context--what kind of data are you working with?

Cheers,

Kevin

Tamas Papp

unread,

Jan 6, 2015, 12:35:06 PM1/6/15

to julia...@googlegroups.com

Hi Tomas,

I don't know any packages, but a running mean is trivial to
implement. See eg
http://en.wikipedia.org/wiki/Algorithms_for_calculating_variance#Online_algorithm
, it doesn't get faster than this, and it is reasonably accurate.

Medians (and quantiles in general) are trickier, because you have to
keep track of the whole distribution or sacrifice accuracy. See

@inproceedings{greenwald2001space,
title={Space-efficient online computation of quantile summaries},
author={Greenwald, Michael and Khanna, Sanjeev},
booktitle={ACM SIGMOD Record},
volume={30},
number={2},
pages={58--66},
year={2001},
organization={ACM}
}

You might want to tell us more about your problem and then you could get
more specific advice. Eg if number of observations >> number of distinct
values, you could simply count them in a Dict.

Best,

Tamas

On Tue, Jan 06 2015, Tomas Mikoviny <tomas.m...@gmail.com> wrote:

> Hi,
> I was just wondering if anyone knows if there is a package that implements

> *fast* running median/mean algorithms?
>
> Thanks a lot...
>

Tim Holy

unread,

Jan 6, 2015, 12:37:24 PM1/6/15

to julia...@googlegroups.com

For running mean, cumsum gives you an easy approach, if you don't mind a
little floating-point error.

--Tim

On Tuesday, January 06, 2015 09:17:59 AM Kevin Squire wrote:
> Hi Tomas,
> I'm bit aware of any (though they might exist). It might help if you gave a
> little more context--what kind of data are you working with?
>
> Cheers,
> Kevin
>
> On Tuesday, January 6, 2015, Tomas Mikoviny <tomas.m...@gmail.com>
>
> wrote:
> > Hi,
> > I was just wondering if anyone knows if there is a package that implements

> > *fast* running median/mean algorithms?
> >
> > Thanks a lot...

Tomas Mikoviny

unread,

Jan 6, 2015, 2:33:19 PM1/6/15

to julia...@googlegroups.com

Hi Kevin,

generally I'm trying to do baseline correction on mass spectra with ~300000 bins. I've tried several algorithms to evaluate baseline but the ones working the best implement running median and mean. I've just got mean sorted out via cumsum trick in coincidence with Tim's suggestion (found some MATLAB discussion on that). Although I'll check Tamas' suggestion too.

I've got stacked with running median that would have reasonable performance since computer has to crunch runmed of array of 200 x 300000 within couple of seconds (max) to manage online analysis.

Tomas Mikoviny

unread,

Jan 6, 2015, 3:20:25 PM1/6/15

to julia...@googlegroups.com

indeed very efficient approach... you've nailed it again Tim

thank you

Tomas Mikoviny

unread,

Jan 6, 2015, 3:32:12 PM1/6/15

to julia...@googlegroups.com

Tamas, thanks for the quantile reference - I'm just immersed in that paper.

As I've mentioned in reply to Kevin, I have rather bulky mass spec data and performance is rather crucial. I've read about the approach with limited distinct values but unfortunately it is not very applicable in my case. Let's see what I'll learn from Greenwald paper

Tomas

Tamas Papp

unread,

Jan 6, 2015, 3:38:37 PM1/6/15

to julia...@googlegroups.com

Hi Tomas,

A less sophisticated approach is to just bin the values using some
regular grid and count them, and then use the midpoint of the median bin
as an approximation.

Then

abs((median bin midpoint) - (true median)) <= (median bin width)/2

which may be good enough for your application.

Best,

Tamas

On Tue, Jan 06 2015, Tomas Mikoviny <tomas.m...@gmail.com> wrote:

> Tamas, thanks for the quantile reference - I'm just immersed in that paper.
>
> As I've mentioned in reply to Kevin, I have rather bulky mass spec data and
> performance is rather crucial. I've read about the approach with limited
> distinct values but unfortunately it is not very applicable in my case.
> Let's see what I'll learn from Greenwald paper
>
> Tomas
>
>
> On Tuesday, January 6, 2015 6:35:06 PM UTC+1, Tamas Papp wrote:
>>
>> Hi Tomas,
>>
>> I don't know any packages, but a running mean is trivial to
>> implement. See eg
>>
>> http://en.wikipedia.org/wiki/Algorithms_for_calculating_variance#Online_algorithm

>> <http://www.google.com/url?q=http%3A%2F%2Fen.wikipedia.org%2Fwiki%2FAlgorithms_for_calculating_variance%23Online_algorithm&sa=D&sntz=1&usg=AFQjCNGY1sv8symyRT_JBeOg4AwZb6Nt-g>

>> , it doesn't get faster than this, and it is reasonably accurate.
>>
>> Medians (and quantiles in general) are trickier, because you have to
>> keep track of the whole distribution or sacrifice accuracy. See
>>
>> @inproceedings{greenwald2001space,
>> title={Space-efficient online computation of quantile summaries},
>> author={Greenwald, Michael and Khanna, Sanjeev},
>> booktitle={ACM SIGMOD Record},
>> volume={30},
>> number={2},
>> pages={58--66},
>> year={2001},
>> organization={ACM}
>> }
>>
>> You might want to tell us more about your problem and then you could get
>> more specific advice. Eg if number of observations >> number of distinct
>> values, you could simply count them in a Dict.
>>
>> Best,
>>
>> Tamas
>>

>> On Tue, Jan 06 2015, Tomas Mikoviny <tomas.m...@gmail.com <javascript:>>

Steven G. Johnson

unread,

Jan 6, 2015, 3:52:16 PM1/6/15

to julia...@googlegroups.com

On Tuesday, January 6, 2015 12:37:24 PM UTC-5, Tim Holy wrote:

For running mean, cumsum gives you an easy approach, if you don't mind a
little floating-point error.

Yikes, just noticed that cumsum is significantly less accurate than sum; basically, cumsum is no better than naive summation, whereas the intention was to get pairwise-summation accuracy. This is fixed in #9650

Stefan Karpinski

unread,

Jan 6, 2015, 4:10:39 PM1/6/15

to Julia Users

You may not need online methods at all. Sorting the rows of a 200 x 300000 matrix doesn't take very long on my laptop:

julia> X = randn(200,300000);

julia> @time X = sortrows(X);
elapsed time: 0.297998739 seconds (480053384 bytes allocated)

Tomas Mikoviny

unread,

Jan 6, 2015, 6:50:38 PM1/6/15

to julia...@googlegroups.com

Hi Stefan,

you are right. But the moment I try to do this for any subset along the array it gets inefficient.

Here is the simplest code for running median of one dimensional array. Time consuming...

function runmed(input::Array, w::Int)

L = length(input)

output = zeros(input)

subset = zeros(w+1)

for i in 1:L-w

subset = sub(input, i:i+w)

output[i] = median!(subset)

end

return output

end

a = zeros(300000);

runmed(a, 200);

@time runmed(a, 200);

elapsed time: 1.460171594 seconds (43174976 bytes allocated)

Stefan Karpinski

unread,

Jan 7, 2015, 1:49:43 PM1/7/15

to Julia Users

Try deleting the line where you initially assign to subset – it isn't necessary and causes that variable to change type over the course of your loop.

Tomas Mikoviny

unread,

Jan 9, 2015, 7:33:02 PM1/9/15

to julia...@googlegroups.com

yes, that was my negligence, change type is first thing I usually check...

however it has no significant impact on performance...

I've got caught with some different things last days but continuing to play with online methods

tm

Stefan Karpinski

unread,

Jan 9, 2015, 8:38:06 PM1/9/15

to Julia Users

I assume you're on 0.3.x – this may simply be due to SubArrays kind of sucking for performance. The new ArrayView stuff is much better.

Reply all

Reply to author

Forward