Digital photograph deduplication?

kristenyt

unread,

Dec 1, 2014, 12:11:19 PM12/1/14

to digital-...@googlegroups.com

Hello all,

I'm working on large collection of born digital photographs, and I'm coming across quite a bit of duplication. Does anyone have any recommendations for user-friendly tools for batch deduplication of digital images? In this case, the image files (all JPEG) will have different file names and possibly different dates of creation, but they'd have the same pixels. Ideally, I'd also want to identify near duplicates too (e.g., same image but resized or cropped).

Thanks in advance for any suggestions!
Kristen

Paul Wheatley

unread,

Dec 1, 2014, 1:14:38 PM12/1/14

to digital-...@googlegroups.com

Hi Kristen,

There are quite a few options out there for identifying exact duplicates, and it's quite easy to do with a simple hashing tool:

http://coptr.digipres.org/Category:Fixity

Or there are some dedicated de-dupe tools listed here:

http://coptr.digipres.org/Category:De-Duplication

For the images that aren't exact duplicates, you should try Matchbox, which the Scape Project designed for just the use case you are describing.

If you have good and/or bad experiences with a tool, or find anything else of value, please add to COPTR to help others:

http://coptr.digipres.org/

Cheers

Paul

______________

Paul Wheatley Consulting Limited

Digital Preservation Services

@prwheatley

http://bit.ly/paulrobertwheatley

http://openpreservation.org/knowledge/blogs/author/paul/

Jackson, Andy

unread,

Dec 2, 2014, 7:44:10 AM12/2/14

to digital-...@googlegroups.com

Hi Kristen,

Although it’s not quite the question you are asking, you might find the links and information presented here helpful: http://qanda.digipres.org/58/what-techniques-there-detecting-similar-images-large-scale

For example: http://www.imgseek.net/

HTH,

Andy

--

Dr Andrew N Jackson

Web Archiving Technical Lead

The British Library

Tel: 01937 546602

Mobile: 07765 897948

Web: www.webarchive.org.uk

Twitter: @UKWebArchive

--
You received this message because you are subscribed to the Google Groups "Digital Curation" group.
To unsubscribe from this group and stop receiving emails from it, send an email to digital-curati...@googlegroups.com.
To post to this group, send email to digital-...@googlegroups.com.
Visit this group at http://groups.google.com/group/digital-curation.
For more options, visit https://groups.google.com/d/optout.

******************************************************************************************************************

Experience the British Library online at www.bl.uk

The British Library’s latest Annual Report and Accounts : www.bl.uk/aboutus/annrep/index.html

Help the British Library conserve the world's knowledge. Adopt a Book. www.bl.uk/adoptabook

The Library's St Pancras site is WiFi - enabled

*****************************************************************************************************************

The information contained in this e-mail is confidential and may be legally privileged. It is intended for the addressee(s) only. If you are not the intended recipient, please delete this e-mail and notify the postm...@bl.uk : The contents of this e-mail must not be disclosed or copied without the sender's consent.

The statements and opinions expressed in this message are those of the author and do not necessarily reflect those of the British Library. The British Library does not take any responsibility for the views of the author.

*****************************************************************************************************************

Think before you print

Carol Kussmann

unread,

Dec 2, 2014, 11:39:43 AM12/2/14

to digital-...@googlegroups.com

I have used Duplicate File Finder. The free version uses some sort of hash to determine matches, so the file names can be different - I don't know what method for the free version. (The paid version uses Sha-1.) If you do a search for it there are two programs that have the same name and are very similar. This is the one I have used. This program lets you delete the duplicates as well which comes in handy. You don't have to go and find them yourself. (I've only used the free version so far.)

Guide with screenshots and link to program site: https://docs.google.com/a/umn.edu/file/d/0B8MvBJV_5_s5Sk5iYzJJSkpMNlU/edit

Program page: http://www.ashisoft.com/

- Carol

kristenyt

unread,

Dec 2, 2014, 1:46:05 PM12/2/14

to digital-...@googlegroups.com

Thanks so much, everyone! That gives me some good options to try. I really appreciate your help!

Simon Spero

unread,

Dec 2, 2014, 5:26:59 PM12/2/14

to digital-...@googlegroups.com

At a more abstract level, there are several cases to consider:

I. Technical approaches.

1. Programs that use a hash the whole file.

a) conceptually different files, with different byte content will not match (true negative)
b) changes to file internal metadata will cause a mismatch (false negative)
c) Changes only to file external metadata (filenames, filesystem timestamps, etc) will match (true positive)
d) Files that are conceptually different but have identical bytes will match (false positive).
e) Files that are conceptually the same, but which have been transformed in a way that changes the byte contents will not match (false negative )

Performance is dominated by disk bandwidth .

2. Programs that approximately match whole files

Programs like sdhash may be able to match files that differ in internal metadata if the contents are otherwise unaltered (case b).

Performance is typically dominated by disk bandwidth , though CPU time can be a factor in some configurations.

3. Programs that are EXIF aware.

If EXIF metadata is preserved across transformations, photographs can be grouped using just that metadata. The resulting groups may include transformed images that might not be duplicate enough.
It will also fail to detect images where metadata has not been preserved.

The performance of this approach is dominated by disk seek time.

4. Programs that are OS/application software aware
(I'm not counting image metadata in this category, even where it is extracted by the OS).

If thumbnail images are generated by the camera, the OS or a commonly used application, there may be a relationship between the name/location of a thumbnail image, and the image for which it is a thumbnail.

For example Linux systems may store thumbnails using names generated from a hash of the original files location; some cameras generate a thumbnail file with the same basename as a raw. iPhoto also generates a lot of thumbnail files in predictable places.

This approach is likely to be dominated by disk seek time.

5. Programs that are image content aware.

Programs that are image content aware can detect images that have been transformed in various ways. They may work by extracting characteristic features from the image, or they may use simpler approaches, such as normalizing the images and looking for common tiles.

There are tradeoffs between recall and precision, and this is an active area of research.

ImageMagick can compare two files and look optionally look for sub images; this can be fast for just a few images, but does not scale well, and will not match rotated or radically distorted images.

Matchbox uses a variety of techniques, and is designed for bulk matching; there are likely to be a number of false positives that require review, especially if there are a lot of bursts in the collection, where the photos are almost identical (moving the selector from Einzelfeuer to Feuerstoß ).

If you have a good graphics card or a Xenon Phi accelerator, Matchbox should be able to take advantage of it- make sure you have the latest opencv sdk, and have up to date CUDA or OpenCL drivers. You may want to try the latest 3.0 beta, especially if you are not using nvidia.

---

II. Identity, near equivalence, and work clusters.

One tricky issue issue is dealing with different concepts of what it means to be a duplicate.

Can an image been transform in some way, but is still considered in some sense to be "the same" as another image?
Cropped, resized, or processed images might be different Expressions of a Work. Format conversion might bring about different Manifestations. [Since there are Illini here I am putting "same" in quotes.]

From a preservation standpoint, there would seem to be little advantage in preserving an object which can be exactly and algorithmically regenerated from another object - e.g

If you have a lossless TIFF and a lossy JPEG, you may only need the lossless form. This has been argued over before on this list with good arguments made on both sides.

if you have a camera raw, and the processing settings from DxO, you can regenerate the processed images (as long as you preserve the software and operating system)

Simon

kristenyt

unread,

Dec 3, 2014, 11:13:10 AM12/3/14

to digital-...@googlegroups.com

Simon, your explanation is magnificent -- it really helps to see it broken down that way. Thank you so much!

On Monday, December 1, 2014 12:11:19 PM UTC-5, kristenyt wrote:

Reply all

Reply to author

Forward