Hi Kristen,
Although it’s not quite the question you are asking, you might find the links and information presented here helpful: http://qanda.digipres.org/58/what-techniques-there-detecting-similar-images-large-scale
For example: http://www.imgseek.net/
HTH,
Andy
--
Dr Andrew N Jackson
Web Archiving Technical Lead
The British Library
Tel: 01937 546602
Mobile: 07765 897948
Twitter: @UKWebArchive
--
You received this message because you are subscribed to the Google Groups "Digital Curation" group.
To unsubscribe from this group and stop receiving emails from it, send an email to
digital-curati...@googlegroups.com.
To post to this group, send email to
digital-...@googlegroups.com.
Visit this group at http://groups.google.com/group/digital-curation.
For more options, visit https://groups.google.com/d/optout.
At a more abstract level, there are several cases to consider:
I. Technical approaches.
1. Programs that use a hash the whole file.
a) conceptually different files, with different byte content will not match (true negative)
b) changes to file internal metadata will cause a mismatch (false negative)
c) Changes only to file external metadata (filenames, filesystem timestamps, etc) will match (true positive)
d) Files that are conceptually different but have identical bytes will match (false positive).
e) Files that are conceptually the same, but which have been transformed in a way that changes the byte contents will not match (false negative )
Performance is dominated by disk bandwidth .
2. Programs that approximately match whole files
Programs like sdhash may be able to match files that differ in internal metadata if the contents are otherwise unaltered (case b).
Performance is typically dominated by disk bandwidth , though CPU time can be a factor in some configurations.
3. Programs that are EXIF aware.
If EXIF metadata is preserved across transformations, photographs can be grouped using just that metadata. The resulting groups may include transformed images that might not be duplicate enough.
It will also fail to detect images where metadata has not been preserved.
The performance of this approach is dominated by disk seek time.
4. Programs that are OS/application software aware
(I'm not counting image metadata in this category, even where it is extracted by the OS).
If thumbnail images are generated by the camera, the OS or a commonly used application, there may be a relationship between the name/location of a thumbnail image, and the image for which it is a thumbnail.
For example Linux systems may store thumbnails using names generated from a hash of the original files location; some cameras generate a thumbnail file with the same basename as a raw. iPhoto also generates a lot of thumbnail files in predictable places.
This approach is likely to be dominated by disk seek time.
5. Programs that are image content aware.
Programs that are image content aware can detect images that have been transformed in various ways. They may work by extracting characteristic features from the image, or they may use simpler approaches, such as normalizing the images and looking for common tiles.
There are tradeoffs between recall and precision, and this is an active area of research.
ImageMagick can compare two files and look optionally look for sub images; this can be fast for just a few images, but does not scale well, and will not match rotated or radically distorted images.
Matchbox uses a variety of techniques, and is designed for bulk matching; there are likely to be a number of false positives that require review, especially if there are a lot of bursts in the collection, where the photos are almost identical (moving the selector from Einzelfeuer to Feuerstoß ).
If you have a good graphics card or a Xenon Phi accelerator, Matchbox should be able to take advantage of it- make sure you have the latest opencv sdk, and have up to date CUDA or OpenCL drivers. You may want to try the latest 3.0 beta, especially if you are not using nvidia.
---
II. Identity, near equivalence, and work clusters.
One tricky issue issue is dealing with different concepts of what it means to be a duplicate.
Can an image been transform in some way, but is still considered in some sense to be "the same" as another image?
Cropped, resized, or processed images might be different Expressions of a Work. Format conversion might bring about different Manifestations. [Since there are Illini here I am putting "same" in quotes.]
From a preservation standpoint, there would seem to be little advantage in preserving an object which can be exactly and algorithmically regenerated from another object - e.g
If you have a lossless TIFF and a lossy JPEG, you may only need the lossless form. This has been argued over before on this list with good arguments made on both sides.
if you have a camera raw, and the processing settings from DxO, you can regenerate the processed images (as long as you preserve the software and operating system)
Simon