List Duplicate Finder

1 view

Skip to first unread message

Epicuro Kishore

unread,

Aug 5, 2024, 11:53:19 AM8/5/24

to dholophsoper

Findduplicates lines of text in a list. The duplicate lines are returned in a separate list ordered by when each line was determined to be a duplicate. Each entry in the list should be separated by a newline.

Indicate / copy-paste items in the List field. dCode detects if the list is a list of terms (one per line) or a list of words (with separator) and finds redundant items repeated more than once in order to un-duplicate.

Be carefull to take into account all writing variants of a same element: ignore accents and diacritics allow to treat words like item and tm as a same word twice. Also, ignoring uppercase and lowercase allow to detect/find item and ITEM as a duplication of the same word and delete them.

This free, online Javascript tool eliminates duplicates and lists the distinct values in any web browser.Every line that is repeated is counted and displayed. This form is secure because your list of values is processed by Javascript on your device only.

New: You can hide or show the counts column. You can also see alllines in the results, or just the lines with duplicates. Lines with duplicates are those that occur two (2) or more times.The TAB output format option allows you to directly copy and paste the results into a Excel or LibreOffice Calc in a spreadsheet format.Click Copy Results to Clipboard button, then press Ctrl+V in your spreadsheet.(You do not have to use the Data > Text to Columns button.)

Use the Clear button to clear the text input. Use the Restorebutton to restore the last cleared text input. (For your security, the data is saved in a Javascript variable only,and will clear as soon as you leave the page.)

That would be what you use to determine a duplicate. I use the tags like Artist and Title but you could use filename. I then use the sort capability of the tool, in my case MS Excel to sort the list in Artist Title order.

There are various functions that you can use to program kind of like the scripting in MP3Tag that allow you to do comparisons of 2 rows in a list to see if fields in each are a match. If they are you can flag the duplicate row (record) for deletion.

I still would like to add that a strict comparison between 2 lines in an Excel table still has the fuzziness of showing only exact matches but leaving out all those files with spelling mistakes in any of the compared data fields. In this case duplicates would be left in the collection.

Or, the other way round: the Excel method could also indicate duplicates which are none - e.g. if the tag data is incomplete and misses vital information. In this case files would be deleted that really should be kept and their tag data updated so that they can be distinguished from the other files.

I agree with ohrenkino, There is no perfect way to find all pure duplicated from within a list of imperfect data. That being said preparation and standardization of the tags in all your files will go a long way into making it possible to do a reasonable job or at least get you a list. That is what I do and the attached file is the steps I go through to do that. It works for me but may not be enough for others. If you can use it great if not I tried. Good luck to all.

How to find Duplicates with MP3Tag and MS Excel.zip (1.4 MB)

Same question again:

why use MP3 tag for this? Again: it's an really cool tootl but not designed for doing that.

There are many specialized tools for that doing fuzzy search and audio comparison. (scroll up)

what i was surprised to see in all the comments/replies about the issue of 'duplicate files' (surely an issue that all of us must relate to i think, certainly in my case, is that no-one seemed to suggest other such specialised tools/apps

certainly you can do 'visually' by merging dirs, and sorting by name. one adv that MP3Tag has is that it displays all the relevant Tag values eg size, length, bit rate etc, to make a visual comparison more accurate.

Beyond Compare 4 is a sync app, and is a wonderful too for comparing files in different directories (and it has an often very useful option to list files on both sides w/o regard to the sub-dir structure, and many other useful features for filters and many others). this is an app i use almost every day. btw, i find the way it displays the dirs/files side by side FAR easier to understand/check that the many other such tools that have in a vertical list.

recently i finally forked out ($US30) for Audio Dedup by MindGems - and it gets the most amazing results. found all these dups with quite different names/artists that were actually the same. and it will display all the Tag field values for closer inspection and decisions about what to keep and what to 'delete'; but be aware, i have unexpectedly, found some groups of false positives, ie dups that were not dups, so some caution is advised before just using Dele All - moving (instead of deleting) into a Deleted Folder is often good advice. this is the BEST app i have found for all sorts of audio files. i would have to check again if it does MP4 too.

and surprisingly, i found that another MindGems app i already had but was not using - Fast Duplicate File Finder (FDFF) - i have just found it actually does a really great job on audio (and photos) as well as normal files (with the same caveat above)

and finally a specialised Dup app that can be used for files of all sorts and has options for images, audio, and video (?), and i use constantly for both audio (until i got Audio Dedup) and images (and 'normal' files) is - AllDup! a wonderful all-purpose dup app - and free if i remember correctly. but now maybe i will use the previous two apps in preference.

There are a number of deduplication utilities that compare audio fingerprints across various bitrates and filetypes along with other complex ways of comparing media (such as ignoring metadata and comparing only the audio/video stream). Based on the size and complexity of such programs, it doesn't seem an idea fit for a metadata editor.

AI can do a lot of things, but be aware, it is not "the answer to life the universe and everything".

AI often fail, and explain wrong results with wrong facts.

Right now - nobody know how to fix that.

Another problem is that an AI network just do what it was trained for, no inteligence. that may cause plausible but fatal wrong results if something unexpected happen, maybe unexpected input data where every human would say "hey, there's something strange here".

from where was pointed to this thread.

I totally love how good this thing is at finding similar songs without ID3 or any titles in file names!

Finding similar ID3 tags is very good too - it doesn't do just partial matching as even if words in the ID3 are rearranged it still recognizes that the title is similar.

My audio library has never been so organized

I got the PRO version - it is worth every cent and has so many features.

Ive seen "duplicate" scenarios before, even a bunch of mac apps that helps to find and/or get rid of duplicate files in the hard disk in order to gain some space. That is not my case this time though, and by the way, this is NOT a weird-o case. Im sure many of us found our selfs in a similar situation every once in a while.

I can go thru the process of checking each file, one by one, and once finding its copy in folder (2), delete it in folder (1). That way I will end up with a tiny little folder (1) with only the few files whichs copy couldn't be found in folder (2).

30,000 is a lot of files to have in one directory. From what you said I expected it to be large, which is why I opted for a find in the shell rather than the Finder, but even then there are limitations.

(with a corresponding 'end repeat' at the end). This will break the list of 30,000 files into (hopefully) more manageable chunks based on the first character - you can change that to break on any parameters you like if you need to

As always with these kinds of questions, though, the devil is in the details. In this case, what - specifically - constitutes a match? are files with the same name considered a match? what if one is newer/older/larger/smaller than the other? are they still considered a match? which one should you keep? the newest? the biggest?

The only action should be to delete or not the file in folder (1), not to choose which one comparing them. And I believe the match should be considered when two things match: the name (including the extension) and the size. Can you write the script?, I just don't know how to code

The following script (minimally tested!) should do what you want. Copy the script into a new Script Editor document and run it (the delete command is actually a misnomer... it only moves the files to the trash so there's still a chance of recovery.

It starts off by prompting for two folders - the first should be the one that contains the flat directory that you want to clean up (folder1). The second should be the one with the hierarchal directories you want to search in (folder2).

Then it uses a shell command find to get a listing of all the files in folder2. I do this because the Finder is notoriously slow in traversing large directory trees, so even though using the Finder would be simpler, it would be much slower.

Here I use another shell trick. I first get the size of the current file. I then perform another find to find a file that has the same size as the current find. If I get back an empty list I know the file sizes are different, so I leave the file alone, but if the file sizes match I know it's safe to delete the file.

I know this may sound convoluted, but for a large directory tree, with a large number of files in folder1, it would be cumbersome/unwieldy to perform a full depth traversal of folder2 for every file, so I first cache the list of file names and just do a secondary search for those that have matching filenames.

Thank you very VERY much for your kindness, your time invested in such an altruistic help. I hope this little script helps not only me, but other people as well. Im running it as I write, according to my calculations it will take about a week to finish (since a hear the little trash sound every time it deletes a file ), but it is obviously much better that doing it manually.