1719 views

Skip to first unread message

Oct 13, 2009, 11:22:55 AM10/13/09

to

I'm looking for code that will calculate the running median of a

sequence, efficiently. (I'm trying to subtract the running median from

a signal to correct for gradual drift).

sequence, efficiently. (I'm trying to subtract the running median from

a signal to correct for gradual drift).

My naive attempt (taking the median of a sliding window) is

unfortunately too slow as my sliding windows are quite large (~1k) and

so are my sequences (~50k). On my PC it takes about 18 seconds per

sequence. 17 of those seconds is spent in sorting the sliding windows.

I've googled around and it looks like there are some recent journal

articles on it, but no code. Any suggestions?

Thanks

Janto

Oct 13, 2009, 12:12:23 PM10/13/09

to

Janto Dreijer wrote:

If you aren't using numpy, try that. Showing your code might also be a good

idea...

Peter

Oct 13, 2009, 12:33:19 PM10/13/09

to

Janto Dreijer <jan...@gmail.com> writes:

> I'm looking for code that will calculate the running median of a

> sequence, efficiently. (I'm trying to subtract the running median from

> a signal to correct for gradual drift).

> I'm looking for code that will calculate the running median of a

> sequence, efficiently. (I'm trying to subtract the running median from

> a signal to correct for gradual drift).

Is that a known and valid technique of correcting for drift? Maybe

what you really want is a high-pass filter, to remove very low

frequency components (drift) from the signal. You can do that with

standard digital filters. Using something depending on ordering

within the samples (i.e. the median) seems weird to me, but I'm no

expert.

The obvious way to compute a running median involves a tree structure

so you can quickly insert and delete elements, and find the median.

That would be asymtotically O(n log n) but messy to implement.

Oct 13, 2009, 1:27:22 PM10/13/09

to Janto Dreijer, pytho...@python.org

one, add one" in the sorted list. And if that isn't fast enough,

consider some form of tree.

DaveA

Oct 13, 2009, 1:37:24 PM10/13/09

to pytho...@python.org

Oct 13, 2009, 2:29:30 PM10/13/09

to

On Oct 13, 8:22 am, Janto Dreijer <jan...@gmail.com> wrote:

> I'm looking for code that will calculate the running median of a

> sequence, efficiently. (I'm trying to subtract the running median from

> a signal to correct for gradual drift).

> ...> I'm looking for code that will calculate the running median of a

> sequence, efficiently. (I'm trying to subtract the running median from

> a signal to correct for gradual drift).

> Any suggestions?

For a reference try:

Comparison of algorithms for standard median filtering

Juhola, M. Katajainen, J. Raita, T.

Signal Processing, IEEE Transactions on

Jan. 1991, Volume: 39 , Issue: 1, page(s): 204 - 208

Abstract

An algorithm I have used comes from:

On computation of the running median

Astola, J.T. Campbell, T.G.

Acoustics, Speech and Signal Processing, IEEE Transactions on

Apr 1989, Volume: 37, Issue: 4, page(s): 572-574

This is a dual heap approach. No code though. There are some obvious

(yeah, right) errors in their pseudo-code.

The point of the dual heap algorithm (loosely put) is to reduce the

computation to slide the window 1 element to be proportional to 2

bubble sorts of log window size instead of a window size sort.

Good luck.

Dale B. Dalrymple

Oct 13, 2009, 2:57:33 PM10/13/09

to

On 13 Okt, 18:33, Paul Rubin <http://phr...@NOSPAM.invalid> wrote:

> The obvious way to compute a running median involves a tree structure

> so you can quickly insert and delete elements, and find the median.

> That would be asymtotically O(n log n) but messy to implement.

QuickSelect will find the median in O(log n) time.

Oct 13, 2009, 3:59:47 PM10/13/09

to

sturlamolden <sturla...@yahoo.no> writes:

> > The obvious way to compute a running median involves a tree structure

> > so you can quickly insert and delete elements, and find the median.

> > That would be asymtotically O(n log n) but messy to implement.

>

> QuickSelect will find the median in O(log n) time.

> > The obvious way to compute a running median involves a tree structure

> > so you can quickly insert and delete elements, and find the median.

> > That would be asymtotically O(n log n) but messy to implement.

>

> QuickSelect will find the median in O(log n) time.

That makes no sense, you can't even examine the input in O(log n) time.

Anyway, the problem isn't to find the median of n elements. It's to

find n different medians of n different sets. You might be able to

get close to linear time though, depending on how the data acts.

Oct 13, 2009, 5:15:44 PM10/13/09

to

On Oct 13, 9:59 pm, Paul Rubin <http://phr...@NOSPAM.invalid> wrote:

> sturlamolden <sturlamol...@yahoo.no> writes:

> > > The obvious way to compute a running median involves a tree structure

> > > so you can quickly insert and delete elements, and find the median.

> > > That would be asymtotically O(n log n) but messy to implement.

>

> > QuickSelect will find the median in O(log n) time.

>

> That makes no sense, you can't even examine the input in O(log n) time.

> sturlamolden <sturlamol...@yahoo.no> writes:

> > > The obvious way to compute a running median involves a tree structure

> > > so you can quickly insert and delete elements, and find the median.

> > > That would be asymtotically O(n log n) but messy to implement.

>

> > QuickSelect will find the median in O(log n) time.

>

> That makes no sense, you can't even examine the input in O(log n) time.

True. I think he means O(n).

I've tried wrapping the lesser-known nth_element function from the C++

STL (http://www.cppreference.com/wiki/stl/algorithm/nth_element).

Unfortunately it seems the conversion to std::vector<double> is quite

slow...I'll have a look again. Will probably have to rewrite my whole

median_filter function in C++ to avoid unnecessary conversions.

> Anyway, the problem isn't to find the median of n elements. It's to

> find n different medians of n different sets. You might be able to

> get close to linear time though, depending on how the data acts.

Yes, that's what I've been hoping for. If I want it reeeeaaally fast

I'll have to figure out how to do that.

Thanks

Janto

Oct 13, 2009, 5:23:31 PM10/13/09

to

Thanks. I'll have a look.

I found a PDF by Soumya D. Mohanty entitled "Efficient Algorithm for

computing a Running Median" (2003) by Googling. It has code snippets

at the end, but it's not going to be a simple cut-and-paste job. It

will take some work figuring out the missing parts.

Regards

Janto

Oct 13, 2009, 5:55:35 PM10/13/09

to

On Oct 13, 6:33 pm, Paul Rubin <http://phr...@NOSPAM.invalid> wrote:

> Janto Dreijer <jan...@gmail.com> writes:

> > I'm looking for code that will calculate the running median of a

> > sequence, efficiently. (I'm trying to subtract the running median from

> > a signal to correct for gradual drift).

>

> Is that a known and valid technique of correcting for drift? Maybe

> what you really want is a high-pass filter, to remove very low

> frequency components (drift) from the signal. You can do that with

> standard digital filters. Using something depending on ordering

> within the samples (i.e. the median) seems weird to me, but I'm no

> expert.

> Janto Dreijer <jan...@gmail.com> writes:

> > I'm looking for code that will calculate the running median of a

> > sequence, efficiently. (I'm trying to subtract the running median from

> > a signal to correct for gradual drift).

>

> Is that a known and valid technique of correcting for drift? Maybe

> what you really want is a high-pass filter, to remove very low

> frequency components (drift) from the signal. You can do that with

> standard digital filters. Using something depending on ordering

> within the samples (i.e. the median) seems weird to me, but I'm no

> expert.

Well, I don't have a lot of theoretical reasoning behind wanting to

use a median filter. I have some experience applying it to images with

very good results, so that was the first thing I tried. It felt right

at the time and the results looked good. Also, most of the elements in

the sequence are supposed to be near zero, so the median is supposed

to give the amount of drift.

That said, you're probably right. I should have first tried to apply a

HPF.

> The obvious way to compute a running median involves a tree structure

> so you can quickly insert and delete elements, and find the median.

> That would be asymtotically O(n log n) but messy to implement.

Messy. Yes. I tried and failed :P

Thanks

Janto

Oct 13, 2009, 6:03:49 PM10/13/09

to

On Tue, 13 Oct 2009 12:59:47 -0700, Paul Rubin wrote:

> sturlamolden <sturla...@yahoo.no> writes:

>> > The obvious way to compute a running median involves a tree structure

>> > so you can quickly insert and delete elements, and find the median.

>> > That would be asymtotically O(n log n) but messy to implement.

>>

>> QuickSelect will find the median in O(log n) time.

>

> That makes no sense, you can't even examine the input in O(log n) time.

If that were a relevant argument, no algorithms would ever be described

as running in O(log n).

Obviously to run in O(log n) you must have already built the tree. If you

include the time required to build the tree, then O(n) is the lower-bound

(because you have to see each element to put it in the tree) but that

applies to any data structure. The point is, once you have that tree, you

can perform repeated calculations in O(log n) time (or so it has been

claimed).

For example, if you want the m running medians of m data points, you

might do the following:

find the median of element 1

find the median of element 1 and 2

find the median of element 1 through 3

find the median of element 1 through 4

...

find the median of element 1 through m

Each median calculation may take O(log n), where n varies from 1 to m,

giving a total order of:

log 1 + log 2 + log 3 + ... + log m

=> O(log m!)

steps, which is much better than:

1 + 2 + 3 + ... + m = 1/2*m*(m+1)

=> O(m**2)

> Anyway, the problem isn't to find the median of n elements. It's to

> find n different medians of n different sets. You might be able to get

> close to linear time though, depending on how the data acts.

But since the data sets aren't independent, a tree structure seems like

the way to go to me. Inserting and deleting elements should be fast, and

assuming the claim of calculating the median in O(log n) is correct,

that's quite fast too.

--

Steven

Oct 13, 2009, 6:04:56 PM10/13/09

to

Very nice! I assume you mean I can use it to quickly insert items into

the sliding window?

Thanks

Janto

Oct 13, 2009, 6:13:22 PM10/13/09

to

Janto Dreijer <jan...@gmail.com> writes:

> Well, I don't have a lot of theoretical reasoning behind wanting to

> use a median filter. I have some experience applying it to images with

> very good results, so that was the first thing I tried. It felt right

> at the time and the results looked good.

> Well, I don't have a lot of theoretical reasoning behind wanting to

> use a median filter. I have some experience applying it to images with

> very good results, so that was the first thing I tried. It felt right

> at the time and the results looked good.

If this is image data, which typically would have fairly low

resolution per pixel (say 8-12 bits), then maybe you could just use

bins to count how many times each value occurs in a window. That

would let you find the median of a window fairly quickly, and then

update it with each new sample by remembering the number of samples

above and below the current median, etc.

Oct 13, 2009, 6:22:44 PM10/13/09

to

Thanks, unfortunately it's not image data. It's electrocardiogram

sequences. So it's sequences of floats.

Janto

Oct 13, 2009, 7:20:06 PM10/13/09

to

On 14 Okt, 00:03, Steven D'Aprano

<ste...@REMOVE.THIS.cybersource.com.au> wrote:

<ste...@REMOVE.THIS.cybersource.com.au> wrote:

> Obviously to run in O(log n) you must have already built the tree.

You don't need a tree. Quickselect is a partial quicksort.

But my memory served me badly, quickselect is O(n).

Oct 13, 2009, 7:40:35 PM10/13/09

to

Oct 13, 2009, 8:30:53 PM10/13/09

to

I placed the test code and its output here:

http://bitbucket.org/janto/snippets/src/tip/running_median.py

I also have a version that uses numpy. On random data it seems to be

about twice as fast as the pure python one. It spends half the time

sorting and the other half converting the windows from python lists to

numpy arrays.

If the data is already sorted, the version using python's builtin sort

outperforms the numpy convert-and-sort by about 5 times. Strangely

satisfying :)

I'm hoping to find a trick to do better than scipy.signal.medfilt.

Looking at its code (PyArray_OrderFilterND in

http://svn.scipy.org/svn/scipy/trunk/scipy/signal/sigtoolsmodule.c)

there may be hope. It looks like they're sorting the entire window

instead of doing a QuickSelect (which I could do with STL's

nth_element).

Janto

Oct 14, 2009, 2:48:51 AM10/14/09

to pytho...@python.org

have to work through all the values.

- Hendrik

Oct 14, 2009, 10:53:00 AM10/14/09

to

Janto Dreijer wrote:

> On Oct 13, 6:12 pm, Peter Otten <__pete...@web.de> wrote:

>> Janto Dreijer wrote:

>> > I'm looking for code that will calculate the running median of a

>> > sequence, efficiently. (I'm trying to subtract the running median from

>> > a signal to correct for gradual drift).

>>

>> > My naive attempt (taking the median of a sliding window) is

>> > unfortunately too slow as my sliding windows are quite large (~1k) and

>> > so are my sequences (~50k). On my PC it takes about 18 seconds per

>> > sequence. 17 of those seconds is spent in sorting the sliding windows.

>>

>> > I've googled around and it looks like there are some recent journal

>> > articles on it, but no code. Any suggestions?

>>

>> If you aren't using numpy, try that. Showing your code might also be a

>> good idea...

>>

>> Peter

>

> I placed the test code and its output here:

> http://bitbucket.org/janto/snippets/src/tip/running_median.py

That gives me something to tinker ;)

> I also have a version that uses numpy. On random data it seems to be

> about twice as fast as the pure python one. It spends half the time

> sorting and the other half converting the windows from python lists to

> numpy arrays.

> If the data is already sorted, the version using python's builtin sort

> outperforms the numpy convert-and-sort by about 5 times. Strangely

> satisfying :)

I was thinking of using as many of numpy's bulk operations as possible:

def running_median_numpy(seq):

data = array(seq, dtype=float)

result = []

for i in xrange(1, window_size):

window = data[:i]

result.append(median(window))

for i in xrange(len(data)-window_size+1):

window = data[i:i+window_size]

result.append(median(window))

return result

But it didn't help as much as I had hoped.

The fastest I came up with tries hard to keep the data sorted:

def running_median_insort(seq):

seq = iter(seq)

d = deque()

s = []

result = []

for item in islice(seq, window_size):

d.append(item)

insort(s, item)

result.append(s[len(d)//2])

m = window_size // 2

for item in seq:

old = d.popleft()

d.append(item)

del s[bisect_left(s, old)]

insort(s, item)

result.append(s[m])

return result

Some numbers:

10.197 seconds for running_median_scipy_medfilt

25.043 seconds for running_median_python

13.040 seconds for running_median_python_msort

14.280 seconds for running_median_python_scipy_median

4.024 seconds for running_median_numpy

0.221 seconds for running_median_insort

What would be an acceptable performance, by the way?

Peter

PS: code not tested for correctness

Oct 14, 2009, 5:54:10 PM10/14/09

to

On Oct 14, 4:53 pm, Peter Otten <__pete...@web.de> wrote:

> Some numbers:

>

> 10.197 seconds for running_median_scipy_medfilt

> 25.043 seconds for running_median_python

> 13.040 seconds for running_median_python_msort

> 14.280 seconds for running_median_python_scipy_median

> 4.024 seconds for running_median_numpy

> 0.221 seconds for running_median_insort

>

> What would be an acceptable performance, by the way?

>

> Some numbers:

>

> 10.197 seconds for running_median_scipy_medfilt

> 25.043 seconds for running_median_python

> 13.040 seconds for running_median_python_msort

> 14.280 seconds for running_median_python_scipy_median

> 4.024 seconds for running_median_numpy

> 0.221 seconds for running_median_insort

>

> What would be an acceptable performance, by the way?

>

That's great!

Well, the faster it works, the better. It means I can process more

data before getting frustrated. So if you have a faster version I'd

like to see it :)

Thankyou!

Janto

Oct 14, 2009, 6:53:24 PM10/14/09

to Python

I'm afraid I can't help any further. Going from your post, I thought a

quicker list implementation might be useful, but beyond that I have no

knowledge to share.

Who said ignorance is bliss? *hangs head*

~Ethan~

Oct 15, 2009, 6:41:40 AM10/15/09

to

[Janto Dreijer]

> I found a PDF by Soumya D. Mohanty entitled "Efficient Algorithm for

> computing a Running Median" (2003) by Googling. It has code snippets

> at the end, but it's not going to be a simple cut-and-paste job. It

> will take some work figuring out the missing parts.

> I found a PDF by Soumya D. Mohanty entitled "Efficient Algorithm for

> computing a Running Median" (2003) by Googling. It has code snippets

> at the end, but it's not going to be a simple cut-and-paste job. It

> will take some work figuring out the missing parts.

See http://code.activestate.com/recipes/576930/ for working Python

code

adapted from that paper.

Raymond

Oct 15, 2009, 3:33:21 PM10/15/09

to

On Oct 15, 12:41 pm, Raymond Hettinger <pyt...@rcn.com> wrote:

> [Janto Dreijer]

>

> > I found a PDF by Soumya D. Mohanty entitled "Efficient Algorithm for

> > computing a Running Median" (2003) by Googling. It has code snippets

> > at the end, but it's not going to be a simple cut-and-paste job. It

> > will take some work figuring out the missing parts.

>

> Seehttp://code.activestate.com/recipes/576930/for working Python> [Janto Dreijer]

>

> > I found a PDF by Soumya D. Mohanty entitled "Efficient Algorithm for

> > computing a Running Median" (2003) by Googling. It has code snippets

> > at the end, but it's not going to be a simple cut-and-paste job. It

> > will take some work figuring out the missing parts.

>

> code

> adapted from that paper.

>

> Raymond

Wow! Thanks. I'll have a look.

Janto

Oct 19, 2009, 10:38:39 AM10/19/09

to

Folks,

approximate medians -- would you settle for 49 - 51 % ? --

open up new possibilities, and there's quite a lot of work on that,

for huuuge datasets.

approximate medians -- would you settle for 49 - 51 % ? --

open up new possibilities, and there's quite a lot of work on that,

for huuuge datasets.

A class of problems:

from a data stream X1 X2 ... you want, every so often,

a histogram / quantile summary / distribution estimator such that

H approximates the distribution of X, so

median of H(t) ~ median of some of X.

Some parameters of this class:

NH, size of H: 100 => H(p) ~ pth percentile of X

A, how often you want H

window, drop or age old data

runtime, memory of course

accuracy -- in theory, in practice

A histogram viewer in which you can change buckets, colors, zoom,

in REALTIME sounds tantalizing -- fast loop >> fast algorithm + slow

loop.

Anyone know of one off-the -shelf, python or anything else ?

Refs:

Manku et al., Approximate medians in one pass with limited memory,

1998, 10p

under http://infolab.stanford.edu/~manku/papers.html

nice tree pictures

they optimize mem (NH * Nbuf) not runtime, and don't window

Suri +, Quantiles on Streams, 2006, 5p, http://www.cs.ucsb.edu/~suri/psdir/ency.pdf

~ 20 refs zzz

cheers

-- denis

Oct 20, 2009, 5:28:21 AM10/20/09

to

Yet another link: http://en.wikipedia.org/wiki/Median_filter

--> Perreault + Hebert, Median Filtering in Constant Time,

nice 6-page paper + 400 lines well-documented C code:

http://nomis80.org/ctmf.html

(for 2d images, dropping to 1d must be possible :)

--> Perreault + Hebert, Median Filtering in Constant Time,

nice 6-page paper + 400 lines well-documented C code:

http://nomis80.org/ctmf.html

(for 2d images, dropping to 1d must be possible :)

They're also in opencv, which I haven't tried --

http://www.intel.com/technology/computing/opencv

http://code.google.com/p/ctypes-opencv

http://code.google.com/p/opencv-cython

cheers

-- denis

Oct 26, 2009, 11:28:05 AM10/26/09

to

Based on Perreault + Hebert, here's python code for a rather different

algorithm:

- for quantized e.g. 8-bit data

- runs faster for wider windows / slowly changing medians

- no heaps, no trees: the only import is numpy, and even that's not

essential

algorithm:

- for quantized e.g. 8-bit data

- runs faster for wider windows / slowly changing medians

- no heaps, no trees: the only import is numpy, and even that's not

essential

http://stackoverflow.com/questions/1309263/rolling-median-algorithm-in-c

cheers

-- denis

Reply all

Reply to author

Forward

0 new messages

Search

Clear search

Close search

Google apps

Main menu