Hi all,
slides and
notes from yesterday's talk by Kyunghyun Cho are online. Our next talk will be Wednesday, November 18, at the usual time and place - 4pm, CEPSR 7LE4. Zhuo will be talking about a novel LSTM-based embedding technique which allows for clustering methods to be used for source separation and speech enhancement. An abstract follows, please distribute to anyone you think would be interested. See you then!
Deep clustering: Discriminative Embeddings for Segmentation and Separation
Zhuo Chen
4pm, Wednesday, November 18th
CEPSR 7LE4
We address the problem of "cocktail-party" source separation in a deep learning framework called deep clustering. Previous deep network approaches to separation have shown promising performance in scenarios with a fixed number of sources, each belonging to a distinct signal class, such as speech and noise. However, for arbitrary source classes and number, "class-based" methods are not suitable. Instead, we train a deep network to assign contrastive embedding vectors to each time-frequency region of the spectrogram in order to implicitly predict the target spectrogram segmentation labels, from the input mixtures. This yields a deep network-based analogue to spectral clustering, in that the embeddings form a low-rank approximation to an ideal pairwise affinity matrix, while enabling much faster performance. The objective is also equivalent to k-means', where the segmentation defines the cluster assignments of the embeddings. At test time, a clustering step "decodes" the segmentation implicit in the embeddings by optimizing with respect to the unknown assignments. Preliminary experiments on single-channel mixtures of speech from multiple speakers show that a speaker-independent model trained on two-speaker mixtures can improve signal quality for mixtures of held-out speakers by an average of 6dB. More dramatically, the same model does surprisingly well with three-speaker mixtures.
∿