patch for duplicate filtering

14 views
Skip to first unread message

Wade Menard

unread,
Feb 11, 2009, 3:04:10 PM2/11/09
to Mirage - Automatic Playlist Generation
This patch does several things and fixes issue 9 (Exclude tracks that
are too similar to the seed tracks) and issue 22 (A track removed
manually should not be re-added by mirage) and various other things.

Thanks to Dominik's recent Scms changes (well, recent since I wrote a
patch for this back in 2007. Using distance calculations for duplicate
detection has a very low risk of false positives.

I'm looking for feedback on this so i'll explain my methodology.
Apologies if this is disjointed as I built this as a draft email over
several days.

I've made a small change to Mirage/Mir.cs to allow SimilarTracks() to
accept a "distance ceiling" float that it can filter against. It keeps
the ignored tracks (with a distance under this ceiling) in a
Dictionary<int, float>.

The Banshee extension now passes this and when you use a first seed
track that has duplicates you will see output if running with --debug
like this:

[Debug 00:57:50.064] Mirage - Considering [7886] "always" on "Wish You
The Best" a duplicate of [7918] "always" on "Perfect Crime" (distance:
0.05957413)
[Debug 00:57:50.065] Mirage - Considering [8010] "always" on "always"
a
duplicate of [7918] "always" on "Perfect Crime" (distance: 0.1150455)

It gets this information by checking Mir's IgnoreList Dictionary, and
they are already gone from the automatic playlist. The tracks are also
added to the skipped Dictionary and will be excluded in future
iterations.

Unfortunately you soon realize that similar songs can have duplicates
themselves. My solution instead of having Mir do this analysis at once
(and taking 10+ seconds to return the first playlist) was to go ahead
and use this initial playlist and spawn a DuplicateFilter helper
thread
that seeds the next tracks into Mir one by one and see if Mir finds
any
duplicates (or more specifically tracks with distances under the
specified ceiling). This happens in the background and any new
duplicates will print the message above and be added to the exclude
list
for future iterations. Unfortunately I can't figure out a proper way
to
remove them from the active playlist as that will appear to require
enhancements to the API in Banshee itself. However the duplicates will
never actually get played since 60% into the first song the new
playlist
iteration will exclude the duplicates and they will disappear. It is
an
annoying visual bug though.

This isn't complete just yet. But if Bertrand or Dominik approve of
this
route I'd like to tweak it some more and add GUI preferences for this.
I've tried to keep the separation between Banshee.Mirage and Mirage,
and
I'd like to build on it and offer "duplicate finder" source view that
will cluster distances. I don't really have a use for that but it
seems
to be a frequent request and I could use the experience.

http://ezri.org/dupefilter.patch


-Wade "kurros" Menard



Bertrand Lorentz

unread,
Feb 12, 2009, 4:21:07 PM2/12/09
to mirag...@googlegroups.com
On Wed, 2009-02-11 at 12:04 -0800, Wade Menard wrote:
> This patch does several things and fixes issue 9 (Exclude tracks that
> are too similar to the seed tracks) and issue 22 (A track removed
> manually should not be re-added by mirage) and various other things.
<snip>
> http://ezri.org/dupefilter.patch

Thanks for working on this !

I have a few questions and comments about your code :

- You've changed Db.cs to use Mono.Data.Sqlite instead of
Mono.Data.SqliteClient, probably because you had to in order to build
mirage with banshee trunk. I just committed some changes to the build
system so that Mirage doesn't use the banshee references.
We might also switch to Mono.Data.Sqlite but I'd rather do it
separately.

- Is there a specific reason why you're making a copy of the
Mir.IgnoredList almmost every time you're accessing it ?

- I think you're trying to do 2 different things with
Mir.SimilarTracks :
a) get a list of tracks similars to the seeds, excluding tracks that
are too similar to the seeds
b) detect if 2 tracks are very similar to each other
It would be better to add a new method to the Mir class for doing b)

- UpdateIgnoreList might also benefit from being split in 2 methods

Please also attach your patch to issue 9 on the issue tracker
(http://code.google.com/p/banshee-unofficial-plugins/issues/detail?id=9)

Cheers,

--
Bertrand Lorentz <bertrand...@gmail.com>
> http://flickr.com/photos/bl8/ <

signature.asc

Wade Menard

unread,
Feb 12, 2009, 6:26:53 PM2/12/09
to Mirage - Automatic Playlist Generation
On Thu, 12 Feb 2009 22:21 +0100, "Bertrand Lorentz"
<bertrand...@gmail.com> wrote:
> On Wed, 2009-02-11 at 12:04 -0800, Wade Menard wrote:
> I have a few questions and comments about your code :
>
> - You've changed Db.cs to use Mono.Data.Sqlite instead of
> Mono.Data.SqliteClient, probably because you had to in order to build
> mirage with banshee trunk. I just committed some changes to the build
> system so that Mirage doesn't use the banshee references.
> We might also switch to Mono.Data.Sqlite but I'd rather do it
> separately.

Excellent. Thank you.

>
> - Is there a specific reason why you're making a copy of the
> Mir.IgnoredList almmost every time you're accessing it ?
>

I ran into some race issues where the the DuplicateFilter thread would
clear Mir's ignore list before UpdateIgnoreList could finish with it.
In
DuplicateFilter it ships it to UpdateIgnoreList and then clears, which
would end up clearing the shallow copy sent to UpdateIgnoreList. I
did
make some changes in the order of operations since than that this
might
not be an issue anymore, I'll will do some more testing tonight.

I'm running UpdateIgnoreList after each iteration because I want the
"Considering x to be a duplicate" message to easily know what the seed
track was for the iteration its checking after. Originally I just ran
it
once after iterating over the tracks but the message isn't as useful I
think. I know I kind of created a convoluted way to this but was
trying
to keep the changes to Mir minimal.

> - I think you're trying to do 2 different things with
> Mir.SimilarTracks :
> a) get a list of tracks similars to the seeds, excluding tracks that
> are too similar to the seeds
> b) detect if 2 tracks are very similar to each other
> It would be better to add a new method to the Mir class for doing b)
>

Well the flow is like this

Banshee> asks Mir for a first playlist using the seed(s)
Mir> If dupeceiling is passed to Mir it skips and records any that
are.
Banshee> Has first playlist with no dupes of seed (not likely to have
dupes if more than one seed tracks were used)
Banshee> checks Mir's IgnoredList to see which ones were skipped on
first iteration and adds to it's exclude list
Banshee> starts a DuplicateFilter thread to do a SimilarTracks on each
subsequent track in the playlist to check for duplicates
Banshee> checks Mir's IgnoredList after each track is checked in
DuplicateFilter to add to exclude list.

Mir is basically doing the same thing as always (especailly if no
dupeceiling is passed or its 0), but can be used by Banshee in this
way
by checking the ignorelist after each SimilarTracks.

> - UpdateIgnoreList might also benefit from being split in 2 methods

What chunk(s) did you see as a candidate for this? I was thinking of
wrapping the !skipped.Exists check and making an AddSkipped method to
consolidate that (it's done in a few places)

>
> Please also attach your patch to issue 9 on the issue tracker
> (http://code.google.com/p/banshee-unofficial-plugins/issues/detail?id=9)
>

I'll regenerate the patch and attach it to night.

Thank you for your feedback.
Reply all
Reply to author
Forward
0 new messages