sparse dwt?

78 views
Skip to first unread message

Nathaniel Smith

unread,
Oct 24, 2009, 11:26:10 PM10/24/09
to pywav...@googlegroups.com
I have cause to run wavedec() on very sparse vectors (e.g.,
1000-element vectors with 4 non-zero entries). This may seem a bit
silly, but really it makes sense in context of the larger algorithm...
anyway, right now the rest of my code is using sparse matrix codes,
just expanding to dense vectors for the wavelet transform, and the
wavelet transform ends up being about 70% of the total running time.
(And the total running time here is in hours, so I'm not just worrying
about a 1 second slowdown :-).)

It seems to me that the convolution underlying dwt should be very
possible to optimize for sparse vectors -- presumably represented in
CSC/CSR-style form, like (total_length, nonzero_idxs, nonzero_values).
Does implementing this in pywt seem plausible?

-- Nathaniel

Nathaniel Smith

unread,
Nov 2, 2009, 3:11:26 AM11/2/09
to pywav...@googlegroups.com
Let me revise my question.

I now have some cython code to implement the core sparse downsampling
convolution, and need to figure out what to do with it. Ideally I'd
port to C and contribute it back to pywt, with an efficient interface
for running the dwt on columns of CSC matrix (which might involve some
scipy.sparse or at least numpy entanglement in the pywt API). Is this
something that would be desired upstream?

-- Nathaniel

Nathaniel Smith

unread,
Nov 17, 2009, 5:57:28 PM11/17/09
to pywav...@googlegroups.com, fi...@wasilewski.it
Anyone...?

Nathaniel Smith

unread,
Dec 10, 2009, 6:16:26 PM12/10/09
to pywav...@googlegroups.com, fi...@wasilewski.it
Ping.

On Tue, Nov 17, 2009 at 2:57 PM, Nathaniel Smith

Nathaniel Smith

unread,
Mar 17, 2010, 6:43:00 PM3/17/10
to pywav...@googlegroups.com, fi...@wasilewski.it
Hello! If the group is waking up again, I'd still like to know about
this, actually :-). (Specifically, if pywt upstream would be
interested in this functionality and what constraints you'd have in
terms of implementation and API.)

On Thu, Dec 10, 2009 at 7:16 PM, Nathaniel Smith

Filip Wasilewski

unread,
Mar 19, 2010, 4:58:21 PM3/19/10
to PyWavelets
Hi Nathaniel,

thank you very much for your patience and determination!:)

On Mar 17, 11:43 pm, Nathaniel Smith <njsm...@cogsci.ucsd.edu> wrote:
> Hello! If the group is waking up again, I'd still like to know about
> this, actually :-). (Specifically, if pywt upstream would be
> interested in this functionality and what constraints you'd have in
> terms of implementation and API.)

I have never worked with sparse data sets but I think it could make a
great addition to the project. Can you elaborate a bit more on what
kind of data are you working with so I can get a better grasp of the
problem?

I haven't thought about this extensively, but I guess that for an
initial release it could be incorporated as a subpackage of
specialized routines. This way we won't get distracted by the
necessity of implementing more sophisticated features like signal
extension for border distortion handling (if it counts at all with
these datasets) and focus on establishing the core functionality.

There are no official constrains for the API and implementation as
long as it is easy to use and build and does not require lots of
external dependencies.

I also pay attention to maintaining a numeric compatibility with
Matlab Wavelet Toolbox (I tried several other packages and got
frustrated when each of them was giving a bit different results).

> >>> I now have some cython code to implement the core sparse downsampling
> >>> convolution, and need to figure out what to do with it. Ideally I'd
> >>> port to C and contribute it back to pywt, with an efficient interface
> >>> for running the dwt on columns of CSC matrix (which might involve some
> >>> scipy.sparse or at least numpy entanglement in the pywt API). Is this
> >>> something that would be desired upstream?

Cython implementation is fine, as well as using Numpy api directly (I
started this package when there were a couple numeric packages on the
market, but now I think I would choose the simplicity of Numpy and
Cython integration over the array interface).
I'm however not convinced about SciPy. It has sparse datatypes, but if
we decide to use it, it should be an optional dependency that enables
the sparse transforms feature. Otherwise it will be an overkill to
require it as a dependency for the whole project.

> >>>> I have cause to run wavedec() on very sparse vectors (e.g.,
> >>>> 1000-element vectors with 4 non-zero entries). This may seem a bit
> >>>> silly, but really it makes sense in context of the larger algorithm...
> >>>> anyway, right now the rest of my code is using sparse matrix codes,
> >>>> just expanding to dense vectors for the wavelet transform, and the
> >>>> wavelet transform ends up being about 70% of the total running time.
> >>>> (And the total running time here is in hours, so I'm not just worrying
> >>>> about a 1 second slowdown :-).)

Out of curiosity, have you considered running dwt only on these small
non-zero pieces and then combining them into a sparse output vector?

Btw. I'm going to migrate the code to either bitbucket or github
within a couple days so it should be easier to branch and work on new
features.

Thanks,
Filip Wasilewski
http://filipwasilewski.pl/

Nathaniel Smith

unread,
Mar 26, 2010, 2:15:24 AM3/26/10
to pywav...@googlegroups.com
On Fri, Mar 19, 2010 at 1:58 PM, Filip Wasilewski
<filipwa...@gmail.com> wrote:
> On Mar 17, 11:43 pm, Nathaniel Smith <njsm...@cogsci.ucsd.edu> wrote:
>> Hello! If the group is waking up again, I'd still like to know about
>> this, actually :-). (Specifically, if pywt upstream would be
>> interested in this functionality and what constraints you'd have in
>> terms of implementation and API.)
>
> I have never worked with sparse data sets but I think it could make a
> great addition to the project. Can you elaborate a bit more on what
> kind of data are you working with so I can get a better grasp of the
> problem?

I have a long timeseries recording, and I'm trying to least-squares
regress it against a set of sparse predictors (things like
"such-and-such event did/didn't happen 150 milliseconds ago", which
are usually zero). So it's a standard least squares setup:
Xb = Y
where X is a huge sparse matrix, and Y is my timeseries.

Now where wavelets come in is that the errors in Y are autocorrelated,
which will mess up the standard least squares statistical theory
(which assumes independent errors), but it turns out that the form of
the autocorrelation is such that a properly chosen wavelet
decomposition will de-correlate them. So I can create a transformed
problem:
WXb = WY
where W is the wavelet decomposition operator, which has the same
solution but is now properly behaved statistically.

So basically what I need is an efficient way to wavedec the columns of
X, which is large and sparse, and collect the results back into
another sparse matrix.

> I haven't thought about this extensively, but I guess that for an
> initial release it could be incorporated as a subpackage of
> specialized routines. This way we won't get distracted by the
> necessity of implementing more sophisticated features like signal
> extension for border distortion handling (if it counts at all with
> these datasets) and focus on establishing the core functionality.
>
> There are no official constrains for the API and implementation as
> long as it is easy to use and build and does not require lots of
> external dependencies.
>
> I also pay attention to maintaining a numeric compatibility with
> Matlab Wavelet Toolbox (I tried several other packages and got
> frustrated when each of them was giving a bit different results).

Of course the result of wavelet operations on a sparse data structure
should be numerically the same as on a dense one; it's just a
different way of storing the same numbers.

Border distortion handling does matter somewhat for my data sets,
though I can see how it might not be worth the hassle to implement all
the modes up front.

The one place where the API might be contentious is that right now,
wavedec takes a vector and returns a list of vectors; for full
efficiency I need a version that takes a sparse matrix and returns a
sparse matrix (acting column-wise, and then concatenating). I'm not
sure how generally useful that is -- it wouldn't be the most elegant
API I've seen -- but it's valuable to avoid all these big
re-allocations and copies.

>> >>> I now have some cython code to implement the core sparse downsampling
>> >>> convolution, and need to figure out what to do with it. Ideally I'd
>> >>> port to C and contribute it back to pywt, with an efficient interface
>> >>> for running the dwt on columns of CSC matrix (which might involve some
>> >>> scipy.sparse or at least numpy entanglement in the pywt API). Is this
>> >>> something that would be desired upstream?
>
> Cython implementation is fine, as well as using Numpy api directly (I
> started this package when there were a couple numeric packages on the
> market, but now I think I would choose the simplicity of Numpy and
> Cython integration over the array interface).
> I'm however not convinced about SciPy. It has sparse datatypes, but if
> we decide to use it, it should be an optional dependency that enables
> the sparse transforms feature. Otherwise it will be an overkill to
> require it as a dependency for the whole project.

I don't think avoiding a dependency on SciPy should be difficult --
there are a number of ways to avoid that. If you peek under the
covers, SciPy sparse matrices (in, say, CSC format) are just a wrapper
around a few Numpy arrays in a standard format. So we could even
support scipy.sparse without importing scipy -- though it could become
a bit convoluted if we wanted to, say, return such a matrix, or do
automatic conversion from one of the sparse types to a different one.
Or the conditional import dance ('try: import scipy.sparse except
ImportError: ...') is easy too.

>> >>>> I have cause to run wavedec() on very sparse vectors (e.g.,
>> >>>> 1000-element vectors with 4 non-zero entries). This may seem a bit
>> >>>> silly, but really it makes sense in context of the larger algorithm...
>> >>>> anyway, right now the rest of my code is using sparse matrix codes,
>> >>>> just expanding to dense vectors for the wavelet transform, and the
>> >>>> wavelet transform ends up being about 70% of the total running time.
>> >>>> (And the total running time here is in hours, so I'm not just worrying
>> >>>> about a 1 second slowdown :-).)
>
> Out of curiosity, have you considered running dwt only on these small
> non-zero pieces and then combining them into a sparse output vector?

Not really. As the decomposition progresses, you get overlaps, etc.;
the code would probably be more complicated than the Cython code I
have now, and certainly much slower :-).

> Btw. I'm going to migrate the code to either bitbucket or github
> within a couple days so it should be easier to branch and work on new
> features.

Excellent. Any progress?

-- Nathaniel

Reply all
Reply to author
Forward
0 new messages