AI and audio discussion

Sarah Plovnick

unread,

Nov 17, 2021, 2:00:50 PM11/17/21

to Community Announcements

Hi all,

I'm a PhD candidate in ethnomusicology at UC Berkeley, and I have a research question with which I'm hoping you can help. My research focuses on sound communication between Taiwan and China, and especially how sound gets around censorship mechanisms. I'm trying to understand whether there are any technological reasons why audio communication might be more difficult to censor than visual communication. Since AI is central in many censorship tools, I am especially interested in the unique challenges of using sound data with AI.

My understanding thus far is that there were certain developments around 2010 which prompted the use of GPUs for AI, and led to huge breakthroughs in AI applications in industry. My question is whether the switch to GPUs also led to a greater focus on visual data because the physical architecture of the GPUs lend themselves better to visual rather than audio data. Thoughts on this topic? Is there visual bias in AI research? If so, is this bias technological, or cultural? What are some of the unique challenges of using AI technologies with sound data?

Looking forward to hearing your candid feedback, and thanks to Justin Salamon for pointing me toward this listserv.

Best,

Sarah

Eric Nichols

unread,

Nov 17, 2021, 2:21:41 PM11/17/21

to Sarah Plovnick, Community Announcements

2 quick ideas to prompt discussion:

1) One natural way to represent audio data is the spectrogram, which of course is a 2D "picture" of audio over time. Some of the same ML techniques used in image analysis (Convolutional Neural Nets, etc) can actually be applied to spectrograms. So progress in vision understanding helps audio understanding as well. For instance, see the results of the DCASE audio understanding competition; I think you'll find several CNN-type systems there.

2) When there are multiple audio events taking place at once, they can interfere with each other. In image processing you might have one object move in front of another and completely hide the object behind. in audio, however, two sounds might overlap -- the visual analogy might be to imagine a translucent object passing in front of another one; a blend of both objects might be visible, and confusing to process. This problem of overlapping sound events is perhaps more difficult and unique to audio processing, compared with vision (multiple microphones and signal processing can help separate out some audio events, but that's a more sophisticated setup, perhaps analogous to having binocular images to analyze instead of just single images.

Best,

Eric

--
Open-access journal Transactions of ISMIR, open for submissions: https://tismir.ismir.net
---
ISMIR 2021 will take place online, November 8-12, 2021
ISMIR 2022 will take place in Bangalore, India
ISMIR Home -- http://www.ismir.net/
---
Please note! This list is lightly moderated, any email sent from a non-member address will be queued until it can be reviewed by a human. Be sure to join before posting!
---
You received this message because you are subscribed to the Google Groups "Community Announcements" group.
To unsubscribe from this group and stop receiving emails from it, send an email to community+...@ismir.net.
To view this discussion on the web visit https://groups.google.com/a/ismir.net/d/msgid/community/238453a3-6d10-4128-b17b-cf85b7b659a2n%40ismir.net.

Полина Пруцкова

unread,

Nov 18, 2021, 8:03:47 PM11/18/21

to Sarah Plovnick, Community Announcements

Hi Sarah and all,

GPUs enabled parallel processing and with that working with much larger amounts of data than was previously possible. At this scale, the ‘old’ technique of neural networks hugely increased in performance. I don’t think there is a bias towards visual processing, rather it was the first area where a lot of progress was made. Now you see other areas catching up, for example, language processing. Music is one of the numerous signal processing applications which deal with time series. The most significant difference between visual processing and music is that relationships in pictures are mostly local whereas in a musical piece has non-local temporal relationships are essential. Visual approaches have been applied to music because spectrograms are the most common representations of musical signals and they are in essence pictures. At this year’s ISMIR conference though, large language models have been shown to be more successful for many musical tasks than the visual approaches.

In terms of unique challenges, one thing that we noticed in MIR is that, what works for speech, does not work for singing. For example, models separating speakers cannot separate choir singers; speech-to-text models fail to transcribe vocal lyrics, etc. Apart from the obvious facts that in singing pitch is more important and the vowels are longer, we don’t quite know why this is the case. One reason for the disparity is that, whereas there is a huge amount of speech data, there is very little singing, and almost none labeled singing available for research.

All the best,

Polina

Polina Proutskova

unread,

Nov 18, 2021, 8:06:54 PM11/18/21

to Sarah Plovnick, Community Announcements

Hi Sarah and all,

GPUs enabled parallel processing and with that working with much larger amounts of data than was previously possible. At this scale, the ‘old’ technique of neural networks hugely increased in performance. I don’t think there is a bias towards visual processing, rather it was the first area where a lot of progress was made. Now you see other areas catching up, for example, language processing. Music is one of the numerous signal processing applications which deal with time series. The most significant difference between visual processing and music is that relationships in pictures are mostly local whereas in a musical piece non-local temporal relationships are essential. Visual approaches have been applied to music because spectrograms are the most common representations of musical signals and they are in essence pictures. At this year’s ISMIR conference though, large language models have been shown to be more successful for many musical tasks than the visual approaches.

In terms of unique challenges, one thing that we noticed in MIR is that, what works for speech, does not work for singing. For example, models separating speakers cannot separate choir singers; speech-to-text models fail to transcribe vocal lyrics, etc. Apart from the obvious facts that in singing pitch is more varied and the vowels are longer, we don’t quite know why this is the case. One reason for the disparity is that, whereas there is a huge amount of speech data, there is very little singing, and almost none labeled singing available for research.

All the best,

Polina

On 17 Nov 2021, at 19:00, Sarah Plovnick <sarah.p...@gmail.com> wrote:

Jonathan Reus

unread,

Nov 20, 2021, 11:00:30 AM11/20/21

to Eric Nichols, Sarah Plovnick, Community Announcements

This is such an interesting question. Thank you Sarah. :-)

On Wed, Nov 17, 2021 at 7:21 PM Eric Nichols <epni...@gmail.com> wrote:

2 quick ideas to prompt discussion:

1) One natural way to represent audio data is the spectrogram, which of course is a 2D "picture" of audio over time. Some of the same ML techniques used in image analysis (Convolutional Neural Nets, etc) can actually be applied to spectrograms. So progress in vision understanding helps audio understanding as well. For instance, see the results of the DCASE audio understanding competition; I think you'll find several CNN-type systems there.

That's a good point. And even if we think more generally, there's no real reason why CNNs have to be thought of as visual processors other than they are so often used to recognize patterns in data with high 2-dimensional correlation (e.g. images). At the most general level, CNNs can be thought of as a parameter sharing technique that reduces the number of training parameters by recognizing that they are important higher-level spatial relationships in the data. But this space could be 1-dimensional (I believe Wavenet does this?) ... or N dimensional ...

However, I think the bigger cultural/social movement here is that the paradigm of CNNs + GPUs + lots of data were shown to be so incredibly good at image recognition in the early 2010's that they have since existed with a certain aura of state-of-the-art that continues to be explored well beyond image processing. That paired with the reality (both historical and present) that image/vision related computing gets the lion share of R&D funding compared to auditory computing.

Historically CNN's (like the perceptron) were inspired by theories of the visual cortex in mammals, so there is certainly also a bit of structural bias baked in towards emphasizing the visual mode, but this bias is also what makes them so good at what they do... in terms of creating heirarchical representations. This leads me to wonder if there are any examples of ANN architectures inspired by auditory neural structures rather than visual ones?

xx

Jon

To view this discussion on the web visit https://groups.google.com/a/ismir.net/d/msgid/community/CANHJUv3hSC2d0aFJXp8z5oQ7G0t5ffWYA8FeXanvsg-V8MFbdw%40mail.gmail.com.