Non-music audio and robustness

cybernytrix

unread,

Jun 4, 2011, 3:05:25 PM6/4/11

to Acoustid

I'm running a few experiments on applicability of fingerprints to non-
music audio how robust they are when audio is transcoded, resampled or
captured via microphones, my test data is as follows:

1. I recorded some programs in MythTV - these are television programs
like sports, animal planet, news etc.
2. This was cut into 1min clips and transcoded to .mp4 with AAC for
audio
3. These clips were submitted to a local acoustid-server for
fingerprinting and indexing
4. Different versions of the clip were then created. For example, some
of them were cut to 5 second clips, transcoded again to .mp3. Other
clips (5 second) were played and recorded over a microphone to .wav.
5. These test clips are then sent to the server for identification
purposes.

Since I'm still in the process of getting the server to correctly
index the clips (step #3), I'd like to know what are the odds of this
will work. Is the chromaprint fingerprint robust for this purpose? Can
it handle non-music audio as I'm trying to do?

Thanks for any thoughts, suggestions or other advice!

Lukáš Lalinský

unread,

Jun 4, 2011, 3:25:14 PM6/4/11

to acou...@googlegroups.com

On Sat, Jun 4, 2011 at 9:05 PM, cybernytrix <cyber...@gmail.com> wrote:
> Since I'm still in the process of getting the server to correctly
> index the clips (step #3), I'd like to know what are the odds of this
> will work. Is the chromaprint fingerprint robust for this purpose? Can
> it handle non-music audio as I'm trying to do?

The main idea will not work by design. Even though the fingerprints
can be used for match audio excepts, the Acoustid server application
is not designed to do that and will not be able to find such
fingerprints. You would have to write a different application.

Another thing is that the fingerprint algorithm is optimized for
musical content. It might work on speech to some degree, but I've not
tested that and I don't believe the results will be very good because
of the limited frequency ranges.

Additionally, the algorithm is not designed to handle the kind of
additional noise you get when recording something using a phone/laptop
microphone. The training set used for the filter selection included
only transcoded samples, not samples with any external noise.

Lukas

cybernytrix

unread,

Jun 4, 2011, 4:47:48 PM6/4/11

to Acoustid

Any idea on how I can solve these problems easily :) ? Are there any
other open source software that can handle this? How difficult is it
to modify chromaprint to consider addition audio features so that it
works for speech? For example, is it easy to include number of zero
crossings as a feature? Will it help?

Thanks for any help and pointers!

On Jun 4, 12:25 pm, Lukáš Lalinský <lalin...@gmail.com> wrote:

Lukáš Lalinský

unread,

Jun 5, 2011, 5:37:26 AM6/5/11

to acou...@googlegroups.com

2011/6/4 cybernytrix <cyber...@gmail.com>:

> Any idea on how I can solve these problems easily :) ? Are there any
> other open source software that can handle this? How difficult is it
> to modify chromaprint to consider addition audio features so that it
> works for speech? For example, is it easy to include number of zero
> crossings as a feature? Will it help?

To be honest, I don't know. :) I knew next to nothing about this kind
of audio analysis before I started working on this project and I still
can't claim that I know that much. I have a list of open source audio
fingerprinting projects here, but neither of them is specialized in
speech matching:

https://github.com/lalinsky/acoustid-index/wiki/Links

Chromaprint is designed more as a framework than a specific
fingerprint implementation. If you read the source code, you will
notice that there are a couple of "standalone" modules that can be
configured and connected together to extract different audio features.
Especially if you go through the history, you will find several
configurations that I evaluated in the past. So yes, I believe that
implementing a different fingerprint algorithm using the Chromaprint
source code should be fairly easy, but I personally don't know much
about speed audio features.

Lukas

cybernytrix

unread,

Jun 25, 2011, 4:22:07 AM6/25/11

to Acoustid

Hi Lukas,
Thanks for your reply. I was off to collect some data so that I can
run some tests on Acoustid. I have collected a bunch of reference
audio samples (1min each) and then I created a noisy version of the
same - recorded first 15sec via mic.

I also modified ACOUSTID_MAX_BIT_ERROR to 4 in acoustid_compare.c but
did not get anything better. What are some of the other possible
application changes that needs to be made that you mention? Can I
tradeoff some precision for better recall?
I am also playing around with the openfp code and echonest code.
openfp is pretty impressive, but echonest is really subpar.
I am hoping that by making some simple changes to acoustid, it will
beat openfp as well. Please let me know which files I need to take a
look at.

Thanks!
Ashwin

On Jun 5, 2:37 am, Lukáš Lalinský <lalin...@gmail.com> wrote:
> 2011/6/4 cybernytrix <cybernyt...@gmail.com>:

Lukáš Lalinský

unread,

Jun 25, 2011, 4:51:56 AM6/25/11

to acou...@googlegroups.com

2011/6/25 cybernytrix <cyber...@gmail.com>:

> Hi Lukas,
> Thanks for your reply. I was off to collect some data so that I can
> run some tests on Acoustid. I have collected a bunch of reference
> audio samples (1min each) and then I created a noisy version of the
> same - recorded first 15sec via mic.
>
> I also modified ACOUSTID_MAX_BIT_ERROR to 4 in acoustid_compare.c but
> did not get anything better. What are some of the other possible
> application changes that needs to be made that you mention? Can I
> tradeoff some precision for better recall?

I'm afraid there is no simple change that you can make to make it
better. The algorithm was designed with the intention of identifying
unmodified files with minimal hardware resources. At each point in
time, the algorithm looks at almost two seconds of audio data and it
looks at the whole frequency spectrum, not just peaks. That makes it
very hard for the algorithm to identify noisy audio, but it makes it
very efficient to identify unmodified audio, because most fingerprint
items are unique.

> I am also playing around with the openfp code and echonest code.
> openfp is pretty impressive, but echonest is really subpar.

I'm surprised, because the Echoprint algorithm is modeled after Shazam
and it was designed specifically for this situation. I haven't
actually read the OpenFP code, so I don't know what they are using.

> I am hoping that by making some simple changes to acoustid, it will
> beat openfp as well. Please let me know which files I need to take a
> look at.

See above, I'm afraid it won't be that easy.

Lukas

cybernytrix

unread,

Jun 25, 2011, 2:16:46 PM6/25/11

to Acoustid

My brief summary of openfp and echonest:

Openfp:
1. Server side is very basic. Uses kmeans to cluster. When a query is
made, it searches for points within the closest centroid. Problem is
online updates are not implemented. So if you need to add a new
fingerprint, you need to restart and re-index everything!
2. On the FP side, the spectrum is divided up into bark bands. Lowpass
IIR filtering is done within the bands. There is also a special flag
(--no-loudness) that can take into consideration the loudness. Using
loudness seems to give good results though I have not tested it
thoroughly.
3. Fingerprints are compared using hamming distance. It also uses MFCC
(mel frequency cepstrum) not sure what is happening here.
4. matching is a hack. Client sends a message to the server about the
location of the .afp file to match. Server then reads the file.
fingerprint data is not sent to the server directly.

Echonest:
1. Server side uses a combination of tokyotyrant and solr. This is
totally unnecessary (IMHO). In addition they have a local mode that
simply stores all fps in a python dictionary and persists it via
pickle.
2. Not sure about the core algorithms. But one thing I noticed is that
the client simply sends a list of fingerprints to match to solr. solr
basically treats this a bunch of words and returns closest match.
There is no hamming distance etc. This is what struck me as odd (but
then I'm no expert!). I guess this makes it very scalable as you are
not searching for fps hamming-close to query fp. Instead you look up
exact matches that can be optimized better with hash tables.
3. Once the results are returned from solr, a lot of processing
happens in fp.py code to figure out matches approximately.

At this point, I'm thinking that I use the openfp code to generate the
fingerprints and use your postgres backend. In theory this should work
as both openfp and acoustid fingerprints are compared using hamming
distances unlike echonest. This will also overcome limitations of
openfp_server.
If you are interested in the audio data, let me know and I can share
it privately with you.

Thanks!
Ashwin

On Jun 25, 1:51 am, Lukáš Lalinský <lalin...@gmail.com> wrote:
> 2011/6/25 cybernytrix <cybernyt...@gmail.com>:

Lukáš Lalinský

unread,

Jun 25, 2011, 2:52:30 PM6/25/11

to acou...@googlegroups.com

2011/6/25 cybernytrix <cyber...@gmail.com>:

> Echonest:
> 1. Server side uses a combination of tokyotyrant and solr. This is
> totally unnecessary (IMHO). In addition they have a local mode that
> simply stores all fps in a python dictionary and persists it via
> pickle.
> 2. Not sure about the core algorithms. But one thing I noticed is that
> the client simply sends a list of fingerprints to match to solr. solr
> basically treats this a bunch of words and returns closest match.
> There is no hamming distance etc. This is what struck me as odd (but
> then I'm no expert!). I guess this makes it very scalable as you are
> not searching for fps hamming-close to query fp. Instead you look up
> exact matches that can be optimized better with hash tables.
> 3. Once the results are returned from solr, a lot of processing
> happens in fp.py code to figure out matches approximately.

This is pretty standard setup for any hashed fingerprints. You have an
index where you are looking for exact matches on atomic parts of the
fingerprint. Once you get these matches, you retrieve the full
fingerprints and compare against the query. Acoustid does the same,
the only difference is how you compare the full fingerprints.

Regarding the algorithm, the Echoprint fingerprints are basically
sequences of timestamped hashes (20 bits if I remember correctly). The
hashes describe pairs of peaks on the spectrogram. You can read about
the basic idea in the old Shazam paper ("An Industrial-Strength Audio
Search Algorithm").

> At this point, I'm thinking that I use the openfp code to generate the
> fingerprints and use your postgres backend. In theory this should work
> as both openfp and acoustid fingerprints are compared using hamming
> distances unlike echonest. This will also overcome limitations of
> openfp_server.

I don't think you can easily do this. The OpenFP fingerprints are a
mix of various approaches. The hashes that you can search are
generated using the Philips Robust Hashing algorithm, which is very
similar in structure to the Acoustid fingerprints, but then you have
the MFCCs which you have to compare differently. Theoretically, I
think that the Echoprint implementation should work the best for you,
but I've not tried it practically.

Lukas

Reply all

Reply to author

Forward