Duplicates

1,023 views
Skip to first unread message

Jens

unread,
Mar 18, 2010, 12:06:35 PM3/18/10
to ResourceSpace, jenin...@croptrust.org
What is the current SOP for finding duplicates in Resource Space? I've
been reading through the group posts but there doesn't seem to be any
way to find duplicates en masse, so to speak. Obviously I am aware
that it is possible to find out if there is a duplicate of a single
image by searching for the original filename but it would be much
easier if this could be done for the whole system at once, if you see
what I mean. Is there any way to quickly scan for duplicates so that
all of them could be culled in one fell swoop?

Cheers

Tom Gleason

unread,
Mar 18, 2010, 12:40:21 PM3/18/10
to resour...@googlegroups.com
there is some code that enables duplicate checking based on a checksum
of some initial file contents, but it has been disabled by default and
isn't easily turned on.

You would have to set $file_checksums=true;
then run pages/tools/update_checksums.php

and then run a search for !duplicates, or uncomment the link that is
commented out in pages/team/team_resource.php:
<li><a href="../search.php?search=<?php echo
urlencode("!duplicates")?>"><?php echo
$lang["viewduplicates"]?></a></li>

This check has worked well for me to find duplicate files that are
exactly the same (ex. it will stop working if embedded metadata is
altered from file to file ) regardless of filename,

I've also been experimenting with visual comparison tools which
wouldn't care about things like metadata or file format, but rather
find duplicates based on visual similarity of the previews. None of
that has been developed far enough to release.

Tom

> --
> You received this message because you are subscribed to the Google Groups "ResourceSpace" group.
> To post to this group, send email to resour...@googlegroups.com.
> To unsubscribe from this group, send email to resourcespac...@googlegroups.com.
> For more options, visit this group at http://groups.google.com/group/resourcespace?hl=en.
>
>

--
Tom Gleason, PHP Developer
DBA Impressive Design

Exploring ResourceSpace at:
http://resourcespace.blogspot.com

Amit Vernekar

unread,
Jul 2, 2013, 9:14:33 AM7/2/13
to resour...@googlegroups.com, jenin...@croptrust.org
Well you can sort by color and then will show the duplicates side by side.

Jeff Harmon

unread,
Jul 3, 2013, 6:55:05 PM7/3/13
to resour...@googlegroups.com, resour...@googlegroups.com, jenin...@croptrust.org
You have to turn on checksums in config, run and schedule a cron job to calculate the checksums, then searching for

!duplicates

Shows all duplicates. 

More details will become apparent as you peruse config_default. 

Jeff

--
Jeff Harmon
Chief Executive Officer
Colorhythm LLC

Main Office:  +1 415-399-9921
Mobile:  +1 510-710-9590

--
You received this message because you are subscribed to the Google Groups "ResourceSpace" group.
To unsubscribe from this group and stop receiving emails from it, send an email to resourcespac...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.
 
 

Tom Gleason

unread,
Jul 3, 2013, 7:01:14 PM7/3/13
to ResourceSpace
in config:
$file_checksums=true;

then run
pages/tools/update_checksums.php

this only needs to be run once if you haven't generated checksums yet,
so no need for a cron job if you set
$file_checksums_offline = false; (which is not the default for some reason)

I personally haven't experienced any major lags when checksums are
created at upload time (since the default is $file_checksums_50k =
true; which only checks a portion of the file), but if it is slow,
then $file_checksums_offline = true and a cron job would work better.
--
Tom Gleason

Colorhythm LLC
http://www.colorhythm.com

Main Office: +1 415-399-9921
Fax: +1 253-399-9928
Mobile: +1 347-537-8465

tgle...@colorhythm.com

Jeremy Neech

unread,
Feb 27, 2015, 8:58:40 AM2/27/15
to resour...@googlegroups.com
Hi Tom

This is a very old thread and I was wondering if any of the tools for visual image matching and search had become developed further for possible integration into Resource Space?

We use it four a variety of clients and a visual match search would come in very handy.

thanks

Jeremy

Jeff Nova

unread,
Feb 27, 2015, 10:18:31 AM2/27/15
to resour...@googlegroups.com
We've solved this technically but haven't yet been able to make it a priority and develop it due to competing, funded efforts. Is this something you could contribute toward funding?

Best,
Jeff

--
Jeff Nova
Chief Executive Officer
Colorhythm LLC

Main Office:  +1 415-399-9921
Mobile:  +1 510-710-9590

--
ResourceSpace: Open Source Digital Asset Management
http://www.resourcespace.org
---
You received this message because you are subscribed to the Google Groups "ResourceSpace" group.
To unsubscribe from this group and stop receiving emails from it, send an email to resourcespac...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Robert Damrau

unread,
Feb 27, 2015, 2:07:42 PM2/27/15
to resour...@googlegroups.com
I just started writing a plugin to find similar images using this php implementation which looks very promising. My first tests indicate it's pretty fast, maybe have a look at it!

Robert

Roger Howard

unread,
Mar 3, 2015, 5:32:56 PM3/3/15
to resour...@googlegroups.com

On Feb 27, 2015, at 11:07 AM, Robert Damrau <robert...@gmail.com> wrote:

I just started writing a plugin to find similar images using this php implementation which looks very promising. My first tests indicate it's pretty fast, maybe have a look at it!

I’ve used both pHash (which this library is based on) and blockhash (a more recent variant: http://blockhash.io/)

These both work great for detecting duplicates that only differ by scale and slight tonal variation - so works great, eg. for matching sRGB thumbnails to the original Adobe RGB TIFF. The major challenge is that you have to calculate the Hamming distance between your sample and all existing hashes, so the search cost (CPU, memory) scales linearly with the size of your collection - in other words, twice as many images means the search takes (roughly) twice as long. For really big libraries this gets prohibitive…

There are ways to optimize this, depending on the use case. If you just want to link assets in the database together, you can precalculate/compare in a batch; but if you’re letting users upload an arbitrary image to do Search by Example (a la TinEye.com or Google Image Search) you may run into serious scaling issues.

If you’re implementing Search by Example, I recommend pre-filtering as many images out from the set you’ll compare against - for instance, calculate the aspect ratio of the example image, and don’t bother checking against images whose aspect ratios are too far off… this helps reduce the set of images you need to compare against substantially.

ImgSeek is orders of magnitude faster and more robust, and not difficult to integrate with (it has a simple JSON web service interface)… but it’s a crusty old project that’s not been maintained much over the years so I don’t recommend it much anymore.

Would be happy to contribute to a project that wants to build similarity tooling into RS… I’ve done this so many times in my career I know a lot of the pitfalls to avoid, and optimizations that can be had.

Robert Damrau

unread,
Mar 5, 2015, 5:42:19 AM3/5/15
to resour...@googlegroups.com
Hi Roger,

thanks very much for your input! I just started a plugin to find similar images in ResourceSpace (meaning you have one image for which you want to find similar images). I have no programming experience so RS with procedural php approaches me.

Like you said the library works well and results are pretty good but i only tested it on a small image library.
So far calculating the image hashes is no problem, it can be done on upload and/or as a cronjob for existing resources.
Calculating the hamming distance against all resources in acceptable time is the challenge, like you said. I did some measurements on my local machine, processing time scales linear and is ~0.002s per image (2009 MacBook Pro, C2D 2,5GHz) that would be 200s for 100K images...
I just tested the php gmp extension which has a built in function for calculating the hamming distance (gmp_hamdist) and it looks like it is 30-50% faster than the imagehash implementation.

If i find the time i will properly set up a github project for the plugin and hope you'll have a look at it.

Jon Bergh

unread,
Mar 5, 2015, 9:07:08 AM3/5/15
to resour...@googlegroups.com
With any of these tools would there be a way to leverage the creatlion of a temp image using a common spec (e.g. 150px Highest Qualuty JPG using Adobe98 RGB etc), get the hash for that image and store it in a table?

Then you just create your new temp image and compare.

Or am I reading this wrong and that's what this already is?!? Sounded to me like it was calculating the hash for every image every time... Regardless, either now I'm on the same page as you all or its a darn good idea!

Thanks.
-jon
--

Jeff Nova

unread,
Mar 5, 2015, 10:08:58 AM3/5/15
to resour...@googlegroups.com
That's how perceptual hashes work already - they reduce the complexity of the image and then perform metrics. Perhaps making a simple image upstream would speed up the hash creation but that's not where the performance issue really shows itself. 

The scaling of the comparison has always been the problem with this. I have a solution that works ridiculously fast but unfortunately it cannot be open sourced, by direction of its author. 

PHash Pro does offer a cloud service named Cumulix that addresses the issue as well, I believe. Again, closed source and a web service actually.

You could use Fred Weinhaus's phashconvert and phashcompare scripts as a basis for a home grown solution. The scaling issue would remain.

Best,
Jeff

--
Jeff Nova
Chief Executive Officer
Colorhythm LLC

Main Office:  +1 415-399-9921
Mobile:  +1 510-710-9590

Roger Howard

unread,
Mar 5, 2015, 1:01:19 PM3/5/15
to resour...@googlegroups.com
Yeah, as Jeff points out, while the actual hashing of the image is computationally intense, it pales next to the hash comparisons. There are two main scenarios, and both are brutal:

- Search by example… upload an image, and then find all similar images. This is a linear 0(n) process - double the imageset to compare, you double the computation time, because for each example image you have to perform a comparison against each target image…. however, that’s nothing compared to -
- Find duplicates/similars - if you want to analyze your catalog to, e.g., find and relate similar images, or to weed out derivatives/duplicates, this happens in essentially exponential time … so the amount of work done scales very quickly with the size of your catalog

Naturally you can cache results so you really only have to do the comparison once per image pair, but it’s still a massive computational challenge. I mentioned there are many optimizations, but it’s still a difficult problem. And this doesn’t even begin to address sub-image comparisons - identifying one image as having been a crop from another.

Commercial products like Idee’s (TinEye.com) engine invariably use a heavily tuned process - lots and lots of little optimizations they’ve learned over time (I mentioned obvious optimizations like not comparing images with radically different aspect ratios, but that’s just one of many) - plus they have algorithms that are vastly more sophisticated. These are very hard problems, with a lot at stake for those who get it right.

I mentioned ImgSeek - the closest I’ve seen to an affordable (open source) product that’s ready to integrate - however it’s getting very long in the tooth and needs a lot of maintenance work right now. Last time I touched it I wasn’t able to get it to deploy on a current Linux box without a lot of trouble. 

This is an area I keep a constant eye on, as content-based indexing can solve a lot of problems on my projects - but in practice I only ever end up using it offline… for instance, running a large batch job to do one-off comparisons to weed out dupes is practical, but realtime applications (like search by example) not so much. Unless you have a budget, in which case I cannot say enough good things about the uncanny, remarkable engine from the Idee guys.

Robert Damrau

unread,
Mar 5, 2015, 1:10:05 PM3/5/15
to resour...@googlegroups.com
I just realized that doing BIT_COUNT in mysql calculates the hamming distance. I tested and it is nearly 90% faster than doing it in php (on my testmachine, macbook pro 2009 c2d 2,5 GHz, it now took 0,0028s for 100 images). I can see that it is still a problem for very big catalogs but for my purpose i think it will do it.

Roger Howard

unread,
Mar 5, 2015, 1:16:40 PM3/5/15
to resour...@googlegroups.com
That’s an excellent optimization - please post to the list if/when you do decide to work on a plugin in earnest - I’d be happy to help, the hard work is largely done and wrapping it in a plugin isn’t difficult.

Cheers -R

Jeff Nova

unread,
Mar 5, 2015, 2:18:49 PM3/5/15
to resour...@googlegroups.com
It really has to be about pre-zoning images so you can comfortably exclude comparisons ahead of time. It's also about using the shortest hash possible.

Jeff



--
Jeff Nova
Chief Executive Officer
Colorhythm LLC

Main Office:  +1 415-399-9921
Mobile:  +1 510-710-9590

Robert Damrau

unread,
Mar 16, 2015, 10:07:38 AM3/16/15
to resour...@googlegroups.com
Don't have much time at the moment but here is what i started https://github.com/winkelement/rs_imagehash. Help welcome (don't be harsh i'm not a programmer at all).

Robert
Reply all
Reply to author
Forward
0 new messages