Yep, the assumption that you can directly compare the output is
incorrect. The output you get from fpcalc by default is a compressed
version of the fingerprints, so even small changes will produce very
different results. If you run it with the -raw argument, it will give
you the fingerprints as lists of integers. That's the data you want to
compare. Depending on the bitrate, some of them will not match
exactly, but in general there will be very little bit differences in
them. If you are indeed looking for the same recording, you don't have
deal with different alignment, and can compare the fingerprints by
calculating the bit differences between them. Here is a simple Python
script that does that: https://gist.github.com/1132166
It will produce output like this for the log file you attached:
0 1 0.967062634989
0 2 0.991225701944
0 3 1.0
1 2 0.965307775378
1 3 0.967062634989
2 3 0.991225701944
Where do you set the threshold on what's "the same recording" depends
on the accuracy you wan't to get.
For something like Acoustid I need to match the fingerprints
differently, as I can't assume they both have exactly the same
starting point, so the first step is to align them first and then take
the bit-error-rate score. Here is the code that I'm going to use in
the next release of the Acoustid server:
https://github.com/lalinsky/acoustid-server/blob/master/postgresql/acoustid_compare.c#L114
Another, more simple, way of comparing the fingerprints is to just
count how many common integers do they have. They are 32-bit integers,
but you usually don't need the full 32-bits so you can strip some bits
from it and count how many common integers do they have if you convert
them to e.g. 20-bit integers (always strip the lower bits, as they are
less accurate).
Lukas
> Yep, the assumption that you can directly compare the output is
> incorrect. The output you get from fpcalc by default is a compressed
> version of the fingerprints, so even small changes will produce very
> different results. If you run it with the -raw argument, it will give
> you the fingerprints as lists of integers. That's the data you want to
> compare. Depending on the bitrate, some of them will not match
> exactly, but in general there will be very little bit differences in
> them. If you are indeed looking for the same recording, you don't have
> deal with different alignment, and can compare the fingerprints by
> calculating the bit differences between them. Here is a simple Python
> script that does that: https://gist.github.com/1132166
Thanks for the detailed explanation!
I think I follow but will need to look into your reply and the code a
bit more.
The gist of my understanding is that for my use case, my naive approach
of doing an O(1) lookup to find duplicate tracks is not going to work,
and is now going to have to be an O(n) run over all tracks to compare.
That's probably going to be expensive; my test database has about 50K
tracks.
Is this comparison issue specific to chromaprint ? Do all fingerprinting
algorithms work such that fingerprints cannot be compared directly ?
Are there ways to help the comparison process; for example by reducing
the information in a fingerprint further to the point where it generates
false positives which would serve as a list of prints to compare to ?
Thomas
If you use the naive way of searching for tracks, then yes, it's O(n).
> That's probably going to be expensive; my test database has about 50K
> tracks.
I can search in 7M fingerprints in about 1ms, so there are definitely
ways to work around that problem. :)
> Is this comparison issue specific to chromaprint ? Do all fingerprinting
> algorithms work such that fingerprints cannot be compared directly ?
Yes, all fingerprint algorithms I know work like that. You get raw
data that you have to index, use the index to get possible matches and
then compare the fingerprints to get the final score.
> Are there ways to help the comparison process; for example by reducing
> the information in a fingerprint further to the point where it generates
> false positives which would serve as a list of prints to compare to ?
I'm afraid you are just going to get many false negatives that way.
The key idea to remove the O(n) complexity is to build the index.
Let's say you have fingerprints like this:
A = [1,2,3,4]
B = [5,2,3,4]
C = [7,8,9,10]
A and B are obviously the same track, C is different from them. You
build an inverted index that will look like this:
{
1: [A],
2: [A, B],
3: [A, B],
4: [A, B],
7: [C],
8: [C],
9: [C],
10: [C],
}
When looking for a duplicate of B, simply iterate over B and search in
the index. You will get A and B as possible matches. You can ignore B,
so you only have to run the comparison algorithm against A. That's the
basic idea, how you implement the inverted index depends on the
application.
Lukas
I can search in 7M fingerprints in about 1ms, so there are definitely
ways to work around that problem. :)
When looking for a duplicate of B, simply iterate over B and search in
the index. You will get A and B as possible matches. You can ignore B,
so you only have to run the comparison algorithm against A. That's the
basic idea, how you implement the inverted index depends on the
application.
It's using the custom index data structure, which is not based on
PostgreSQL: https://github.com/lalinsky/acoustid-index
The 1ms was just a show off. :) It only happens in the ideal case
right after building the index where I have only one huge segment. In
this case the search reduces to a single binary search. It gets slower
as I add multiple segments while updating the index, but it's still
pretty fast (around 5ms on a cached, but non-optimized index).
>> When looking for a duplicate of B, simply iterate over B and search in
>> the index. You will get A and B as possible matches. You can ignore B,
>> so you only have to run the comparison algorithm against A. That's the
>> basic idea, how you implement the inverted index depends on the
>> application.
>
> (Your index example was missing an entry for 5 -> [B} right ?)
Correct.
> So, given that the fingerprints consist of 32 bit integers, you're saying to
> build a reverse index where each 32 bit integer in all fingerprints is an
> index to the fingerprints that contain that specific integer ?
Correct. If you want to get good results also for very poorly encoded
files, I'd suggest to index only the highest 28 bits.
> And if I follow your blog post correctly, a fingerprint contains about 8.5
> 32 bit ints per second, so about 500 per minute ?
That's about right.
> So the problem roughly becomes, for 500 reverse index lookups, and for all
> results from those, calculate how many of the 500 fingerprints they have in
> common ?
Yes, but if you are planning to do the full compare, you don't even
need to index one minute of the tracks. Index only something like 100
integers per track, but should be enough to get you the right
candidates for comparing.
> And then I guess of course one still needs to actually compare the
> fingerprints directly, because the reverse index ignores the order of the
> integers in the fingerprints ?
That depends, there are not many collisions in the integers. If you
put large parts of the fingerprints to the index, it would be very
uncommon that most of them match but they are in the wrong order.
Indexing the full tracks is probably not very useful in your case, so
I'd index only small parts of the fingerprints and then filter the
candidates by running the full compare.
Lukas
--
You received this message because you are subscribed to the Google Groups "AcoustID" group.
To unsubscribe from this group and stop receiving emails from it, send an email to acoustid+u...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
You received this message because you are subscribed to a topic in the Google Groups "AcoustID" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/acoustid/C3EHIkZVpZI/unsubscribe.
To unsubscribe from this group and all its topics, send an email to acoustid+u...@googlegroups.com.
Here is the code that I'm going to use inthe next release of the Acoustid server:
https://github.com/lalinsky/acoustid-server/blob/master/postgresql/acoustid_compare.c#L114
Maybe this kind of small tools could be provided within the libchromaprint-tools package, side to fpcalc.