question about using chromaprint to identify the same tracks

2,766 views
Skip to first unread message

thomasvs

unread,
Aug 8, 2011, 11:58:09 AM8/8/11
to Acoustid
Hi,

I just recently discovered chromaprint and acustid and would like to
use it in an application I'm writing.

My goal is to be able to identify different encodings of the same
audio track, preferably without doing any online lookups.

In my naive understanding, I would just be able to generate
chromaprint fingerprints for all my tracks, and then compare them
directly for matches.

However, I ran a quick test with a single track I have in .flac,
converting it to .wav (for comparison), .ogg and .mp3

Only the flac and wav checksums match, but the ogg and mp3 ones are
different, and I can't see how the fingerprints relate or could be
used to identify the same audio content.

Am I missing something obvious ? I'll paste the output I got.

$ chromaprint-fpcalc-0.4-linux/fpcalc 01\ -\ Light\ Aircraft\ on\
Fire.*
FILE=01 - Light Aircraft on Fire.flac
DURATION=137
FINGERPRINT=AQABz5MkLZmYRDj-Iz9uGUUP8bnA9McPPW7Qw1UrIUuzHD-
qSpeRatyh_kRewx02KldQyin-4vLh45qQKwmrIakutPuE73gf4YMdJqiOv7g-
dMePXIM_aA9Rq4fZI0TFz7B8oZ-
JmGgPB6Xn4h5x6XAe6AoPfWiO3tj54zzePCCb5bjw4xea_LhUCTlhscdPmLIDwjv-
YnqJs2gmo1VyXP-QeD_y5MMl7egH8-jDSMGLH6j-4XnxI8dJ-
IR2LegXNCqD5zgalUE3c4KuoymF3hrew3-hL7j0olJ3-
C_64ywwkzjxaTcyNHGLU4eP_ugNK1yQhyn04gt-5MV5uMZZpJQG-
NCqMMehHc6XGSeRG354_IhPPDx0Hu4kH9kfGO2SC_2JbTp-XDLRJ8d1nENuCieDVw-0vvB-
VDqg5cUPSx9640frIEw5vMfkH41YfAX3OGhjD3YyXIyO6caHJI3kI1SX4xJugiGzCjejHVeU4z3yb9CQljr64rrBq0KuiKNwaD-
aH8qJ30UIx8dnHP6OazziDsm25rh87EfJ7KjFaMST4Db-
DFkS59BwEbmiBJdBz8SzB70HPyMwPXho4_nRFw53-
LTx40fyD3nEHa9UTCf-4U7xpEdeQ1uO9Adv1CKO_yDFOcjNQSO-wD-m84JMI7yOJjh6fPA-
JKH2EPnBzJGP6_iORxP2Izlr5A9xHc2ZoHeRHNHxHTye3ah24dGOPIlxHbqK8Hh48MlR6oPPo_q24Md0fDFC79CRM1HCHD9cURSeEI3GHn3REu7KozJDXEco6_CVoz9-
vDy0vKhG5Dz-Dz3RdESPJqQe3MuO-wh_iWie4-
GH89Ay0uhTXF0PGeewF58sHN224_sh6ccnQ8-ROjz6gfs1_MZvYdRxD8139Dn-I4-
VI_mIOw8aqjHKo1b0w_nw6GiyDqeG3OhdaImX4D2eT0CjK0a7DHlwpHUcPEMbCc0f4gzeXELzo9qYD_pRNM-
EvMODHO-I0Qz04ifCZ-pxcji-4xR8BlcQHnpy5D2a7Eb5oD9842ge5ISO6-
g0iC9uAq6K__BDfDLUHzn2CP4n-DUeaXB0nMLp-GhO48ePa0f45IF4C3ngrDKesHAf-
IWvo0S-
Q1cUkQj1DW8wbkSfDDeYxiJ8VJKYD2ey4_RETGc4_EguETnhkhidB_8OSxHRJwuHR0cuIlnAXMeLzwe9JSEu9EcOB4wgkkrhhIHCMEEgEEYIQQUCWCknGAFCGcMQAUEYCZhEBlhBqBFKKQUAIgowAIEjkgDGBDAEMcIFQAQAJYCgADgmDDJAGEMBEAwoZZhySCihgCDMMWeAMUIgAIxBUDBAkBACMAMEYgBAAQBhQhigFBBCIGIIMEgIIpgRiiGBAEMGEQoQQQwAgYATRglgDIDCCGKEQQg9QJhgRgNAmKAAGiCUAkoJQYQRQCNgEABEQYIMYxQoZAkgiggigBCgjCcMEE4goowTSgEFFDpEECYQIAgBJwwSEAgGGOUCQiEYU0IIo4gQTBggBFBAESAUEAIwYBBQwEAAjDGKASAEAkgJwwAwQiHPBAMMGAUAAkIAYgRgQhGkBECQIcCEQgQphIRSRAgChKPIMICUE8AAIRAyQXGgIADAGGEIAgIIIwQAiBkoCEICIaAAccQRIYAFQCggCCeIMACAgFY4IoQISAE

FILE=01 - Light Aircraft on Fire.mp3
DURATION=137
FINGERPRINT=AQABz5OkZEkkJhF-
_Mgf9DKO_hB_of3BQzeFG03VSsjSLMd_VJXiI9W4Q_2MvCeewc6FbXIa9PhDuDd-5EnED6ouIew-4cc5HL4U9BVe_EMtPD9yDR8ah4eOhjnC4zw65wkanzjCBs5Rei7uEZd0iMfHHBc0kmhGBjsv_Me1PCCbIc_x47wR5rtwX8IVpKJ2_IS5OnjAHu6L6cdFlPLRUDruf8ijH8l1XFJ04UNztEo1fMWNo_qHF7lwnIR4otdiaD-06MEVAo0Yopsz6EfTBX2O9yq0_EVPXCrK7nCOnj1OHDOJE58O_gh1mIs-
GH2O0yhJpI-GZ4U-
fHiPnIeDXGLgTBpgaF2PExocZzlO5HiWwz3iEw8PnYc7ychfUEe7XOh1YhZsG5dM9MlxHT_yC5eK94FWvvB-
VMehHb4JS0dPIcedo-nRVTomO0cjFk_BUXzQPh1cBQ-jY7JrfIga-
khm5ngjXCZK5gLnasez48wRf4NGpPXRa_gN9hVyRTGFQ_vh55geQr6L8HB8fMbhH9dYxN2hbT1yEfuPkjdqZdGIJ8V5_B-
yJM7RQxeRa8I_PBMJMg_6evBJHNPxMjW-HtXhCj7xfzhyfUioKNtxSdi-
CP2H27iaI18NbfmR_rgM3ujx6yDFGbkZoRx0wsd0etBthDt8ofpxHv3wD9E2K0RSHswcHcd79FI0TA-
PnDWSE_eJZmSKXkeOJPhy8HhuVKfw9LiOXCp0HueEsD145UapeIJ_VD--
Y_rxHaEj6Mi1HmdawRUtPGHQlEfXCv1gXjxqZSE-
HaGOR3lg_ujx8tDQZEuGE_m8oz_cLTgcUg_6Zcd7hOpJNI_x8MF5aDltlEtxdT1knNgZfDp-1Ntx_ZCUH58MPbBDI_wMPh--7vhTzLjR_GTQh_iP_DmSU8UdOWhVNI9RK_rhiBweqTA7VKaGG-
ELLXmCMzueF0fzo9OkIE9wNH8QWgweJfBPnDF-NOqhyf1wHvAp5F3wIMc7vIPCVfiJkMt-
vMeP7ziFF64y_AifQC5C7Wj2oD98HH4QntBxoeuhH44JuCr-4yHcQy6PuMeew7_gi_g0OJJxCvcP55nRH8d7hG7yBOKPPIezCg9bnMcLXzpK5IMenQiv45nCCPuIqslwg2kswkcliTnOZCdOT8R0Bu9XJFdC5IRLYnSOfzGuGI21cPiR6yK0IMuFH8_BawqF_0J_BAUAINAyIQQURiAriAACGIGMUEwJBRRwginJAQBHASCNdwgYgAgBXACKDFIAMAUAkwAIIgAABApgFGGEYCYAIAAAJIDwyDFukAWWAiAQcU4opAASRChOCPPOSAcQEEYAIRBRCgghCDNMMACgAIpxIgQQQiAClCBAASEQIUAoZQAAjAGgCEAEIAEMEkwpARQA1AhAgAEIoWeIJM4YgQxRAgEkmQGAQACcEoQgAQCzQgGHDEFGGUGQYEAQABQQwkAAiFOYOOBAEgIYIoJQRBVAGDKAKGCAIQgQxQzQQDAB0CJEMQIEUIIBIZQQDAEBKSBCMIMUMAwAY4wSzAAhhAFIMUKMUYgJAYBRgAAAECCCWEKEEiIBBBEwSDAnITIICaUIEwQIIqhAhjFKEBACUQoEB0IgAIwUhAEHEDMCAWCEgYQAJJARgFngCALGAGCBEQIIIggiDAhhhRAOCgEMAAAB

FILE=01 - Light Aircraft on Fire.ogg
DURATION=136
FINGERPRINT=AQABz0skacnEJAKPI_-DW0ahHX4uMO2OQ48b9HDVSsjSLMd7VOVlpBp3qD-
R1zAzbM4VlHKKv7gO37gm5ErCakiqC-0-4dnxCO9ghwmq4y_-
odvxI9fgD9pD1Oph9ghR8TMsH1XnBUEbH85Rei7uEZcOMQ-
u8NDRPOiNncf5480DslmOXPiRX0KT41IlnEcqFjcJU6kDwjv-Ynol_EUziWiVHP-
ROPqPPPlwSTv6wTz6MFLw4gcqOh9-
HPlxEnYI7Qv6BY3KDA9ONGfQzRx0oSmF3hreQ_yLL7j0olJ3-
C_64ywwkzjxaTcyNHGLU4eP_ihnWF2Qhyle6AuuI2dwHq5xMkgpAT60KsxxaIfzZcZJ5CJ8Hj_iEw8PnUfaSR7BH0a75EJ_YpuOH5dM9MlxHT9yXTiLqg90vvCP6oCWF-
_R8MONH-GDhuHwHpN_NGLxFdzjoI092MlwMTp6YxyiRvKRTF2OSzhNMM8q3Iym4oqC90R-
Q0Na6ugZXJfBq0Ku6BQuaD-
aH8pDnHYRwvHxGYe_4xqPuDuSbQ0uY39QMjtqMRrxJLiN7hkyxTk0XESuSLgs0DPx7EHvwc8ITG-
O03h-9IXDHb6N__iP5Ec-7rhUTCf-4U7xpEfuGhqP9McNxiKO_wIp7sjNQePxBb6O6b0gtwivowmOHh_8IQn7EPnBzJGPT8ePPlKE6TyS88jtENfRnAl6F8kR_fjAH894VDuqaEf45LjUQleF8Hiog092lPrg86i-
Ef8xHV-
M0Dt05EyU5rgPVxSFJ0SjsUe7okfTMTzOCp8Qyjp85eiPHy8PLa9RjTiR_0NPNB3RowmpB_eyDjfC-
CT6OHj44Ty0jDT6FFfXQ8aJncEnC0e37fh-
SPqO19CTI3Vp9AXnXMNv_Mao49TQfEef4z_yK0dyjrjzoKEaozxqRT8cc3ikosk6nBpyo3ehJV5yvMfzAY2uGO0y5MHhOg7yDG0kNH-
IM7iP5jyqjTn0H0XzTMg7PMjxjhjNQC9-
InymHieH4zvOBD5xBeGhJ0feo8lulA_6w_eA5kFO6LiOTjvEHCfgqsYPP8QnQ_2RH3t0-
Bd8RsQjwdFxCqfjozmNHz-uHeGTB-
It5IGzynjCwn1wXDpK5Dv0KCIRitzwBuNG9Mlwg2kswkcliflwJsd5EdMZDj-
SS0ROuCRG58G_w1JE9MnC4dGRi9CCLNfx4vNBb0mIC_2RAwMIEFQKZ8BhghggjBCCCsSYUk4oooExjBENhJGASaSQEYQaoxRSABAFAAMOOOEIcwAQxAhTAjACAFACCAqAY8IYJgygRDABFFMOCScUEIQ55gwwRggElEBMMECQEAIwIwRiAEABCBPCAKWAEAIRYARRAiCBhDCCMoQEAgwZRCgiiAgggBFGCSAMAFAYQYhACD1AmGEaAMIogAYIpQBATAgijAKAISAMAsApSJBhgAIFLQEAKEAEAEAIUEYVBggkHAHCCKSAAgodQIhihhEAkBOGCAUEA8wALiAUggEliGCACMGEAUIAhggQCghglAIGAmCMUQwAIQRASAhghEKCCQYYMAoAAIAQgAgmFCBKCYCgQoAJCRBSCCChFBGCAOEoMgwg5QQwQAiEgOJAMccAMMIQxIASRhghAEDMQEEQEggBZZkjDglgARAQEEE4QYQBAAS0whEhREAK

FILE=01 - Light Aircraft on Fire.wav
DURATION=137
FINGERPRINT=AQABz5MkLZmYRDj-Iz9uGUUP8bnA9McPPW7Qw1UrIUuzHD-
qSpeRatyh_kRewx02KldQyin-4vLh45qQKwmrIakutPuE73gf4YMdJqiOv7g-
dMePXIM_aA9Rq4fZI0TFz7B8oZ-
JmGgPB6Xn4h5x6XAe6AoPfWiO3tj54zzePCCb5bjw4xea_LhUCTlhscdPmLIDwjv-
YnqJs2gmo1VyXP-QeD_y5MMl7egH8-jDSMGLH6j-4XnxI8dJ-
IR2LegXNCqD5zgalUE3c4KuoymF3hrew3-hL7j0olJ3-
C_64ywwkzjxaTcyNHGLU4eP_ugNK1yQhyn04gt-5MV5uMZZpJQG-
NCqMMehHc6XGSeRG354_IhPPDx0Hu4kH9kfGO2SC_2JbTp-XDLRJ8d1nENuCieDVw-0vvB-
VDqg5cUPSx9640frIEw5vMfkH41YfAX3OGhjD3YyXIyO6caHJI3kI1SX4xJugiGzCjejHVeU4z3yb9CQljr64rrBq0KuiKNwaD-
aH8qJ30UIx8dnHP6OazziDsm25rh87EfJ7KjFaMST4Db-
DFkS59BwEbmiBJdBz8SzB70HPyMwPXho4_nRFw53-
LTx40fyD3nEHa9UTCf-4U7xpEdeQ1uO9Adv1CKO_yDFOcjNQSO-wD-m84JMI7yOJjh6fPA-
JKH2EPnBzJGP6_iORxP2Izlr5A9xHc2ZoHeRHNHxHTye3ah24dGOPIlxHbqK8Hh48MlR6oPPo_q24Md0fDFC79CRM1HCHD9cURSeEI3GHn3REu7KozJDXEco6_CVoz9-
vDy0vKhG5Dz-Dz3RdESPJqQe3MuO-wh_iWie4-
GH89Ay0uhTXF0PGeewF58sHN224_sh6ccnQ8-ROjz6gfs1_MZvYdRxD8139Dn-I4-
VI_mIOw8aqjHKo1b0w_nw6GiyDqeG3OhdaImX4D2eT0CjK0a7DHlwpHUcPEMbCc0f4gzeXELzo9qYD_pRNM-
EvMODHO-I0Qz04ifCZ-pxcji-4xR8BlcQHnpy5D2a7Eb5oD9842ge5ISO6-
g0iC9uAq6K__BDfDLUHzn2CP4n-DUeaXB0nMLp-GhO48ePa0f45IF4C3ngrDKesHAf-
IWvo0S-
Q1cUkQj1DW8wbkSfDDeYxiJ8VJKYD2ey4_RETGc4_EguETnhkhidB_8OSxHRJwuHR0cuIlnAXMeLzwe9JSEu9EcOB4wgkkrhhIHCMEEgEEYIQQUCWCknGAFCGcMQAUEYCZhEBlhBqBFKKQUAIgowAIEjkgDGBDAEMcIFQAQAJYCgADgmDDJAGEMBEAwoZZhySCihgCDMMWeAMUIgAIxBUDBAkBACMAMEYgBAAQBhQhigFBBCIGIIMEgIIpgRiiGBAEMGEQoQQQwAgYATRglgDIDCCGKEQQg9QJhgRgNAmKAAGiCUAkoJQYQRQCNgEABEQYIMYxQoZAkgiggigBCgjCcMEE4goowTSgEFFDpEECYQIAgBJwwSEAgGGOUCQiEYU0IIo4gQTBggBFBAESAUEAIwYBBQwEAAjDGKASAEAkgJwwAwQiHPBAMMGAUAAkIAYgRgQhGkBECQIcCEQgQphIRSRAgChKPIMICUE8AAIRAyQXGgIADAGGEIAgIIIwQAiBkoCEICIaAAccQRIYAFQCggCCeIMACAgFY4IoQISAE


Lukáš Lalinský

unread,
Aug 8, 2011, 1:02:44 PM8/8/11
to acou...@googlegroups.com
On Mon, Aug 8, 2011 at 5:58 PM, thomasvs <svsa...@gmail.com> wrote:
> In my naive understanding, I would just be able to generate
> chromaprint fingerprints for all my tracks, and then compare them
> directly for matches.
>
> However, I ran a quick test with a single track I have in .flac,
> converting it to .wav (for comparison), .ogg and .mp3
>
> Only the flac and wav checksums match, but the ogg and mp3 ones are
> different, and I can't see how the fingerprints relate or could be
> used to identify the same audio content.
>
> Am I missing something obvious ? I'll paste the output I got.

Yep, the assumption that you can directly compare the output is
incorrect. The output you get from fpcalc by default is a compressed
version of the fingerprints, so even small changes will produce very
different results. If you run it with the -raw argument, it will give
you the fingerprints as lists of integers. That's the data you want to
compare. Depending on the bitrate, some of them will not match
exactly, but in general there will be very little bit differences in
them. If you are indeed looking for the same recording, you don't have
deal with different alignment, and can compare the fingerprints by
calculating the bit differences between them. Here is a simple Python
script that does that: https://gist.github.com/1132166

It will produce output like this for the log file you attached:

0 1 0.967062634989
0 2 0.991225701944
0 3 1.0
1 2 0.965307775378
1 3 0.967062634989
2 3 0.991225701944

Where do you set the threshold on what's "the same recording" depends
on the accuracy you wan't to get.

For something like Acoustid I need to match the fingerprints
differently, as I can't assume they both have exactly the same
starting point, so the first step is to align them first and then take
the bit-error-rate score. Here is the code that I'm going to use in
the next release of the Acoustid server:
https://github.com/lalinsky/acoustid-server/blob/master/postgresql/acoustid_compare.c#L114

Another, more simple, way of comparing the fingerprints is to just
count how many common integers do they have. They are 32-bit integers,
but you usually don't need the full 32-bits so you can strip some bits
from it and count how many common integers do they have if you convert
them to e.g. 20-bit integers (always strip the lower bits, as they are
less accurate).

Lukas

Thomas Vander Stichele

unread,
Aug 8, 2011, 1:25:28 PM8/8/11
to acou...@googlegroups.com
Hi,

> Yep, the assumption that you can directly compare the output is
> incorrect. The output you get from fpcalc by default is a compressed
> version of the fingerprints, so even small changes will produce very
> different results. If you run it with the -raw argument, it will give
> you the fingerprints as lists of integers. That's the data you want to
> compare. Depending on the bitrate, some of them will not match
> exactly, but in general there will be very little bit differences in
> them. If you are indeed looking for the same recording, you don't have
> deal with different alignment, and can compare the fingerprints by
> calculating the bit differences between them. Here is a simple Python
> script that does that: https://gist.github.com/1132166

Thanks for the detailed explanation!

I think I follow but will need to look into your reply and the code a
bit more.

The gist of my understanding is that for my use case, my naive approach
of doing an O(1) lookup to find duplicate tracks is not going to work,
and is now going to have to be an O(n) run over all tracks to compare.

That's probably going to be expensive; my test database has about 50K
tracks.

Is this comparison issue specific to chromaprint ? Do all fingerprinting
algorithms work such that fingerprints cannot be compared directly ?

Are there ways to help the comparison process; for example by reducing
the information in a fingerprint further to the point where it generates
false positives which would serve as a list of prints to compare to ?

Thomas

Lukáš Lalinský

unread,
Aug 8, 2011, 1:51:29 PM8/8/11
to acou...@googlegroups.com
On Mon, Aug 8, 2011 at 7:25 PM, Thomas Vander Stichele
<svsa...@gmail.com> wrote:
> The gist of my understanding is that for my use case, my naive approach
> of doing an O(1) lookup to find duplicate tracks is not going to work,
> and is now going to have to be an O(n) run over all tracks to compare.

If you use the naive way of searching for tracks, then yes, it's O(n).

> That's probably going to be expensive; my test database has about 50K
> tracks.

I can search in 7M fingerprints in about 1ms, so there are definitely
ways to work around that problem. :)

> Is this comparison issue specific to chromaprint ? Do all fingerprinting
> algorithms work such that fingerprints cannot be compared directly ?

Yes, all fingerprint algorithms I know work like that. You get raw
data that you have to index, use the index to get possible matches and
then compare the fingerprints to get the final score.

> Are there ways to help the comparison process; for example by reducing
> the information in a fingerprint further to the point where it generates
> false positives which would serve as a list of prints to compare to ?

I'm afraid you are just going to get many false negatives that way.
The key idea to remove the O(n) complexity is to build the index.
Let's say you have fingerprints like this:

A = [1,2,3,4]
B = [5,2,3,4]
C = [7,8,9,10]

A and B are obviously the same track, C is different from them. You
build an inverted index that will look like this:

{
1: [A],
2: [A, B],
3: [A, B],
4: [A, B],
7: [C],
8: [C],
9: [C],
10: [C],
}

When looking for a duplicate of B, simply iterate over B and search in
the index. You will get A and B as possible matches. You can ignore B,
so you only have to run the comparison algorithm against A. That's the
basic idea, how you implement the inverted index depends on the
application.

Lukas

Thomas Vander Stichele

unread,
Aug 9, 2011, 4:23:04 AM8/9/11
to acou...@googlegroups.com
Hi Lukáš,


I can search in 7M fingerprints in about 1ms, so there are definitely
ways to work around that problem. :)


That's really fast.  Is that using the postgresql extension you are writing ?


When looking for a duplicate of B, simply iterate over B and search in
the index. You will get A and B as possible matches. You can ignore B,
so you only have to run the comparison algorithm against A. That's the
basic idea, how you implement the inverted index depends on the
application.

(Your index example was missing an entry for 5 -> [B} right ?)

So, given that the fingerprints consist of 32 bit integers, you're saying to build a reverse index where each 32 bit integer in all fingerprints is an index to the fingerprints that contain that specific integer ?
And if I follow your blog post correctly, a fingerprint contains about 8.5 32 bit ints per second, so about 500 per minute ?

So the problem roughly becomes, for 500 reverse index lookups, and for all results from those, calculate how many of the 500 fingerprints they have in common ?

And then I guess of course one still needs to actually compare the fingerprints directly, because the reverse index ignores the order of the integers in the fingerprints ?

Let me know if I said something stupid here, I'm new to fingerprinting.

By the way, my application is built on CouchDB, so I'll probably have to figure out either how to build this reverse index using map/reduce (which may be straightforward but probably really slow) or directly as a specific view server.

Thanks
Thomas

Lukáš Lalinský

unread,
Aug 9, 2011, 5:02:35 AM8/9/11
to acou...@googlegroups.com
On Tue, Aug 9, 2011 at 10:23 AM, Thomas Vander Stichele
<svsa...@gmail.com> wrote:
> Hi Lukáš,
>
>> I can search in 7M fingerprints in about 1ms, so there are definitely
>> ways to work around that problem. :)
>
> That's really fast.  Is that using the postgresql extension you are writing?

It's using the custom index data structure, which is not based on
PostgreSQL: https://github.com/lalinsky/acoustid-index

The 1ms was just a show off. :) It only happens in the ideal case
right after building the index where I have only one huge segment. In
this case the search reduces to a single binary search. It gets slower
as I add multiple segments while updating the index, but it's still
pretty fast (around 5ms on a cached, but non-optimized index).

>> When looking for a duplicate of B, simply iterate over B and search in
>> the index. You will get A and B as possible matches. You can ignore B,
>> so you only have to run the comparison algorithm against A. That's the
>> basic idea, how you implement the inverted index depends on the
>> application.
>
> (Your index example was missing an entry for 5 -> [B} right ?)

Correct.

> So, given that the fingerprints consist of 32 bit integers, you're saying to
> build a reverse index where each 32 bit integer in all fingerprints is an
> index to the fingerprints that contain that specific integer ?

Correct. If you want to get good results also for very poorly encoded
files, I'd suggest to index only the highest 28 bits.

> And if I follow your blog post correctly, a fingerprint contains about 8.5
> 32 bit ints per second, so about 500 per minute ?

That's about right.

> So the problem roughly becomes, for 500 reverse index lookups, and for all
> results from those, calculate how many of the 500 fingerprints they have in
> common ?

Yes, but if you are planning to do the full compare, you don't even
need to index one minute of the tracks. Index only something like 100
integers per track, but should be enough to get you the right
candidates for comparing.

> And then I guess of course one still needs to actually compare the
> fingerprints directly, because the reverse index ignores the order of the
> integers in the fingerprints ?

That depends, there are not many collisions in the integers. If you
put large parts of the fingerprints to the index, it would be very
uncommon that most of them match but they are in the wrong order.
Indexing the full tracks is probably not very useful in your case, so
I'd index only small parts of the fingerprints and then filter the
candidates by running the full compare.

Lukas

tu...@brandaffinity.net

unread,
Mar 27, 2014, 12:09:21 PM3/27/14
to acou...@googlegroups.com
I'm using your sample compare file to find a 7 second audio within a 300 second audio but i don't seem to be able to get a positive accuracy within the area of the audio snippet.  My current logic compares the sub audio to the master at each index trying to find the highest positive accuracy, but positive ratings are not in the correct area of the sub audio.

Lukáš Lalinský

unread,
Mar 27, 2014, 12:14:04 PM3/27/14
to Acoustid
Chromaprint is not designed to work on such short audio samples. You do not have enough context to identify parts of audio there. You would need to modify the Chromaprint algorithm settings to generate many more data per second to be able to do this.

Lukas



--
You received this message because you are subscribed to the Google Groups "AcoustID" group.
To unsubscribe from this group and stop receiving emails from it, send an email to acoustid+u...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Tuan Nguyen

unread,
Mar 27, 2014, 1:32:04 PM3/27/14
to acou...@googlegroups.com
With the current algorithm what's the minimum audio duration needed to do a compare?

Tuan
You received this message because you are subscribed to a topic in the Google Groups "AcoustID" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/acoustid/C3EHIkZVpZI/unsubscribe.
To unsubscribe from this group and all its topics, send an email to acoustid+u...@googlegroups.com.

tu...@brandaffinity.net

unread,
Mar 29, 2014, 11:54:48 AM3/29/14
to acou...@googlegroups.com
I changed the fingerprinter.cpp values below to increase the datapoints but the matching algorithm example still does not find a match in the correct area.  I compared your match algorithm in your sever code and it's similar to your php example.  Am i not modifying the correct settings?

static const int SAMPLE_RATE = 22050;
static const int FRAME_SIZE = 128;

Lukáš Lalinský

unread,
Mar 31, 2014, 6:52:15 AM3/31/14
to Acoustid
I'm sorry, but I don't remember exactly what needs to be changed. The problem with any changes to the audio processing is that the hashing algorithm has been trained on a specific configuration and I don't know how it would work if you change things.

I'd probably leave SAMPLE_RATE unchanged. FRAME_SIZE and OVERLAP are probably the only two constants that need to be changed. If it works, you should be getting larger fingerprints.

But first of all, you should check what are the fingerprints you are comparing. How large are they? Do they have at least some common bits?

You are pretty much on your own regarding these modifications as I have designed and tested the code only for a specific purpose, so I can't directly help you with modifying the algorithm. I can only make theoretical suggestions.

Lukas

christo...@gmail.com

unread,
Oct 15, 2014, 6:01:27 AM10/15/14
to acou...@googlegroups.com
Dear all,
  I'm just reading this (old) thread. My need is recent..
I needed to clean up a (not so large) set of audio chunks. 
The solution I used was: first, compute fingerprints with fpcalc, and then compare fingerprints according to indications you gave here.

To perform the second step I wrote the attached code.
This is simply a piece of glue arround your function match_fingerprints2()


On Monday, August 8, 2011 7:02:44 PM UTC+2, Lukáš Lalinský wrote:
 Here is the code that I'm going to use in

the next release of the Acoustid server:
https://github.com/lalinsky/acoustid-server/blob/master/postgresql/acoustid_compare.c#L114


This code could be improved (read compact fingerprint, compute a clusterization on a set of fingerprints, etc.)
However this simple tool was fine for me.

That's why I'd like to share this code.
Other people may also need some simple solution like this.
Maybe this kind of small tools could be provided within the libchromaprint-tools package, side to fpcalc.

Many thanks
Regards
Christophe


 
fpcomp.c

Lukáš Lalinský

unread,
Oct 15, 2014, 6:55:41 AM10/15/14
to Acoustid
On Wed, Oct 15, 2014 at 12:01 PM, <christo...@gmail.com> wrote:
Maybe this kind of small tools could be provided within the libchromaprint-tools package, side to fpcalc.

When I started with chromaprint, I wanted to avoid having any code for comparing the fingerprints in the package, because I was not sure myself on what is the best way to do that. The code in acoustid-server is not ideal either. It doesn't real well with some edge cases.

But now I think it would be useful to extend the chromaprint API to allow a few common use cases:

 - Comparing full audio files (what the match_fingerprints2 function does)
 - Locating a short snippet in a longer fingerprint, or the general case of locating all matching ranges in two fingerprints

The code would have to be flexible enough to be used from PostgreSQL extention (i.e. no run-time malloc). Having the code in chromaprint would also make it easier to have some test suite and improve the comparison functions. And then the fpcmp utility would use these new functions.

Lukas

cyph...@gmail.com

unread,
Jun 2, 2020, 10:49:58 PM6/2/20
to AcoustID
Here is a fixed version of fpcomp.c. Changes:

- Parse fingerprints without FILE
- Use latest from pg_acoustid (off-by-one error fix)
- palloc0 should zero memory
- fix errmsg, was not forwarding varargs correctly
fpcomp.c
Reply all
Reply to author
Forward
0 new messages