Finding duplicate file names and modifying them based on elements of the path

larry....@gmail.com

unread,

Jul 18, 2012, 6:20:51 PM7/18/12

to

I have an interesting problem I'm trying to solve. I have a solution
almost working, but it's super ugly, and know there has to be a
better, cleaner way to do it.

I have a list of path names that have this form:

/dir0/dir1/dir2/dir3/dir4/dir5/dir6/file

I need to find all the file names (basenames) in the list that are
duplicates, and for each one that is a dup, prepend dir4 to the
filename as long as the dir4/file pair is unique. If there are
multiple dir4/files in the list, then I also need to add a sequence
number based on the sorted value of dir5 (which is a date in ddMONyy
format).

For example, if my list contains:

/dir0/dir1/dir2/dir3/qwer/09Jan12/dir6/file3
/dir0/dir1/dir2/dir3/abcd/08Jan12/dir6/file1
/dir0/dir1/dir2/dir3/abcd/08Jan12/dir6/file2
/dir0/dir1/dir2/dir3/xyz/08Jan12/dir6/file1
/dir0/dir1/dir2/dir3/qwer/07Jan12/dir6/file3

Then I want to end up with:

/dir0/dir1/dir2/dir3/qwer/09Jan12/dir6/qwer_01_file3
/dir0/dir1/dir2/dir3/abcd/08Jan12/dir6/abcd_file1
/dir0/dir1/dir2/dir3/abcd/08Jan12/dir6/file2
/dir0/dir1/dir2/dir3/xyz/08Jan12/dir6/xyz_file1
/dir0/dir1/dir2/dir3/qwer/07Jan12/dir6/qwer_00_file3

My solution involves multiple maps and multiple iterations through the
data. How would you folks do this?

Paul Rubin

unread,

Jul 18, 2012, 6:49:14 PM7/18/12

to

"Larry....@gmail.com" <larry....@gmail.com> writes:
> I have an interesting problem I'm trying to solve. I have a solution
> almost working, but it's super ugly, and know there has to be a

> better, cleaner way to do it. ...

>
> My solution involves multiple maps and multiple iterations through the
> data. How would you folks do this?

You could post your code and ask for suggestions how to improve it.
There are a lot of not-so-natural constraints in that problem, so it
stands to reason that the code will be a bit messy. The whole
specification seems like an antipattern though. You should just give a
sensible encoding for the filename regardless of whether other fields
are duplicated or not. You also don't seem to address the case where
basename, dir4, and dir5 are all duplicated.

The approach I'd take for the spec as you wrote it is:

1. Sort the list on the (basename, dir4, dir5) triple, saving original
location (numeric index) of each item
2. Use itertools.groupby to group together duplicate basenames.
3. Within the groups, use groupby again to gather duplicate dir4's,
4. Within -those- groups, group by dir5 and assign sequence numbers in
groups where there's more than one file
5. Unsort to get the rewritten items back into the original order.

Actual code is left as an exercise.

Simon Cropper

unread,

Jul 18, 2012, 8:36:27 PM7/18/12

to pytho...@python.org

Hi Larry,

I am making the assumption that you intend to collapse the directory
tree and store each file in the same directory, otherwise I can't think
of why you need to do this.

If this is the case, then I would...

1. import all the files into an array
2. parse path to extract forth level directory name and base name.
3. reiterate through the array
3.1 check if base filename exists in recipient directory
3.2 if not, copy to recipient directory
3.3 if present, append the directory path then save
3.4 create log of success or failure

Personally, I would not have some files with abcd_file1 and others as
file2 because if it is important enough to store a file in a separate
directory you should also note where file2 came from as well. When
looking at your results at a later date you are going to have to open
file2 (which I presume must record where it relates to) to figure out
where it came from. If it is in the name it is easier to review.

In short, consistency is the name of the game; if you are going to do it
for some then do it for all; and finally it will be easier for others
later to work out what you have done.

--
Cheers Simon

Simon Cropper - Open Content Creator

Free and Open Source Software Workflow Guides
------------------------------------------------------------
Introduction http://www.fossworkflowguides.com
GIS Packages http://www.fossworkflowguides.com/gis
bash / Python http://www.fossworkflowguides.com/scripting

larry....@gmail.com

unread,

Jul 19, 2012, 2:54:56 PM7/19/12

to

On Jul 18, 6:36 pm, Simon Cropper
<simoncrop...@fossworkflowguides.com> wrote:

Hi Simon, thanks for the reply. It's not quite this - what I am doing
is creating a zip file with relative path names, and if there are
duplicate files the parts of the path that are not be carried over
need to get prepended to the file names to make then unique,

>
> If this is the case, then I would...
>
> 1. import all the files into an array
> 2. parse path to extract forth level directory name and base name.
> 3. reiterate through the array
> 3.1 check if base filename exists in recipient directory
> 3.2 if not, copy to recipient directory
> 3.3 if present, append the directory path then save
> 3.4 create log of success or failure
>
> Personally, I would not have some files with abcd_file1 and others as
> file2 because if it is important enough to store a file in a separate
> directory you should also note where file2 came from as well. When
> looking at your results at a later date you are going to have to open
> file2 (which I presume must record where it relates to) to figure out
> where it came from. If it is in the name it is easier to review.
>
> In short, consistency is the name of the game; if you are going to do it
> for some then do it for all; and finally it will be easier for others
> later to work out what you have done.

Yeah, I know, but this is for a client, and this is what they want.

larry....@gmail.com

unread,

Jul 19, 2012, 3:00:46 PM7/19/12

to

On Jul 18, 4:49 pm, Paul Rubin <no.em...@nospam.invalid> wrote:

I replied to this before, but I don't see, so if this is a duplicate,
sorry.

Thanks for the reply Paul. I had not heard of itertools. It sounds
like just what I need for this. But I am having 1 issue - how do you
know how many items are in each group? Without knowing that I have to
either make 2 passes through the data, or else work on the previous
item (when I'm in an iteration after the first then I know I have
dups). But that very quickly gets crazy with trying to keep the
previous values.

Prasad, Ramit

unread,

Jul 19, 2012, 3:02:37 PM7/19/12

to pytho...@python.org

> > I am making the assumption that you intend to collapse the directory
> > tree and store each file in the same directory, otherwise I can't think
> > of why you need to do this.
>
> Hi Simon, thanks for the reply. It's not quite this - what I am doing
> is creating a zip file with relative path names, and if there are
> duplicate files the parts of the path that are not be carried over
> need to get prepended to the file names to make then unique,

Depending on the file system of the client, you can hit file name
length limits. I would think it would be better to just create
the full structure in the zip.

Just something to keep in mind, especially if you see funky behavior.

Ramit

This email is confidential and subject to important disclaimers and
conditions including on offers for the purchase or sale of
securities, accuracy and completeness of information, viruses,
confidentiality, legal privilege, and legal entity disclaimers,
available at http://www.jpmorgan.com/pages/disclosures/email.

larry....@gmail.com

unread,

Jul 19, 2012, 3:06:20 PM7/19/12

to

On Jul 19, 1:02 pm, "Prasad, Ramit" <ramit.pra...@jpmorgan.com> wrote:
> > > I am making the assumption that you intend to collapse the directory
> > > tree and store each file in the same directory, otherwise I can't think
> > > of why you need to do this.
>
> > Hi Simon, thanks for the reply. It's not quite this - what I am doing
> > is creating a zip file with relative path names, and if there are
> > duplicate files the parts of the path that are not be carried over
> > need to get prepended to the file names to make then unique,
>
> Depending on the file system of the client, you can hit file name
> length limits. I would think it would be better to just create
> the full structure in the zip.
>
> Just something to keep in mind, especially if you see funky behavior.

Thanks, but it's not what the client wants.

larry....@gmail.com

unread,

Jul 19, 2012, 2:52:09 PM7/19/12

to

On Jul 18, 4:49 pm, Paul Rubin <no.em...@nospam.invalid> wrote:

Thanks very much for the reply Paul. I did not know about itertools.
This seems like it will be perfect for me. But I'm having 1 issue, how
do I know how many of a given basename (and similarly how many
basename/dir4s) there are? I don't know that I have to modify a file
until I've passed it, so I have to do all kinds of contortions to save
the previous one, and deal with the last one after I fall out of the
loop, and it's getting very nasty.

reports_list is the list sorted on basename, dir4, dir5 (tool is dir4,
file_date is dir5):

for file, file_group in groupby(reports_list, lambda x: x[0]):
# if file is unique in file_group do nothing, but how can I tell
if file is unique?
for tool, tool_group in groupby(file_group, lambda x: x[1]):
# if tool is unique for file, change file to tool_file, but
how can I tell if tool is unique for file?
for file_date, file_date_group in groupby(tool_group, lambda
x: x[2]):

You can't do a len on the iterator that is returned from groupby, and
I've tried to do something with imap or defaultdict, but I'm not
getting anywhere. I guess I can just make 2 passes through the data,
the first time getting counts. Or am I missing something about how
groupby works?

Thanks!
-larry

Paul Rubin

unread,

Jul 19, 2012, 3:43:03 PM7/19/12

to

"Larry....@gmail.com" <larry....@gmail.com> writes:
> Thanks for the reply Paul. I had not heard of itertools. It sounds
> like just what I need for this. But I am having 1 issue - how do you
> know how many items are in each group?

Simplest is:

for key, group in groupby(xs, lambda x:(x[-1],x[4],x[5])):
gs = list(group) # convert iterator to a list
n = len(gs) # this is the number of elements

there is some theoretical inelegance in that it requires each group to
fit in memory, but you weren't really going to have billions of files
with the same basename.

If you're not used to iterators and itertools, note there are some
subtleties to using groupby to iterate over files, because an iterator
actually has state. It bumps a pointer and maybe consumes some input
every time you advance it. In a situation like the above, you've got
some nexted iterators (the groupby iterator generating groups, and the
individual group iterators that come out of the groupby) that wrap the
same file handle, so bad confusion can result if you advance both
iterators without being careful (one can consume file input that you
thought would go to another).

This isn't as bad as it sounds once you get used to it, but it can be
a source of frustration at first.

BTW, if you just want to count the elements of an iterator (while
consuming it),

n = sum(1 for x in xs)

counts the elements of xs without having to expand it into an in-memory
list.

Itertools really makes Python feel a lot more expressive and clean,
despite little kinks like the above.

Paul Rubin

unread,

Jul 19, 2012, 3:56:32 PM7/19/12

to

"Larry....@gmail.com" <larry....@gmail.com> writes:
> You can't do a len on the iterator that is returned from groupby, and
> I've tried to do something with imap or defaultdict, but I'm not
> getting anywhere. I guess I can just make 2 passes through the data,
> the first time getting counts. Or am I missing something about how
> groupby works?

I posted another reply to your other message, which reached me earlier.
If you're still stuck, post again, though I probably won't be able to
reply til tomorrow or the next day.

MRAB

unread,

Jul 19, 2012, 5:32:46 PM7/19/12

to pytho...@python.org

Here's another solution, not using itertools:

from collections import defaultdict
from os.path import basename, dirname
from time import strftime, strptime

# Starting with the original paths

paths = [
"/dir0/dir1/dir2/dir3/qwer/09Jan12/dir6/file3",
"/dir0/dir1/dir2/dir3/abcd/08Jan12/dir6/file1",
"/dir0/dir1/dir2/dir3/abcd/08Jan12/dir6/file2",
"/dir0/dir1/dir2/dir3/xyz/08Jan12/dir6/file1",
"/dir0/dir1/dir2/dir3/qwer/07Jan12/dir6/file3",
]

def make_dir5_key(path):
date = strptime(path.split("/")[6], "%d%b%y")
return strftime("%y%b%d", date)

# Collect the paths into a dict keyed by the basename

files = defaultdict(list)
for path in paths:
files[basename(path)].append(path)

# Process a list of paths if there's more than one entry

renaming = []

for name, entries in files.items():
if len(entries) > 1:
# Collect the paths in each subgroup into a dict keyed by dir4

subgroup = defaultdict(list)
for path in entries:
subgroup[path.split("/")[5]].append(path)

for dir4, subentries in subgroup.items():
# Sort the subentries by dir5 (date)
subentries.sort(key=make_dir5_key)

if len(subentries) > 1:
for index, path in enumerate(subentries):
renaming.append((path,
"{}/{}_{:02}_{}".format(dirname(path), dir4, index, name)))
else:
path = subentries[0]
renaming.append((path, "{}/{}_{}".format(dirname(path),
dir4, name)))
else:
path = entries[0]

for old_path, new_path in renaming:
print("Rename {!r} to {!r}".format(old_path, new_path))

larry....@gmail.com

unread,

Jul 19, 2012, 8:58:13 PM7/19/12

to

On Jul 19, 1:56 pm, Paul Rubin <no.em...@nospam.invalid> wrote:

I really appreciate the offer, but I'm going to go with MRAB's
solution. It works, and I understand it ;-)

larry....@gmail.com

unread,

Jul 19, 2012, 9:01:26 PM7/19/12

to

On Jul 19, 3:32 pm, MRAB <pyt...@mrabarnett.plus.com> wrote:

Thanks a million MRAB. I really like this solution. It's very
understandable and it works! I had never seen .format before. I had to
add the index of the positional args to them to make it work.

larry....@gmail.com

unread,

Jul 19, 2012, 9:01:36 PM7/19/12

to

On Jul 19, 1:43 pm, Paul Rubin <no.em...@nospam.invalid> wrote:

> "Larry.Mart...@gmail.com" <larry.mart...@gmail.com> writes:
> > Thanks for the reply Paul. I had not heard of itertools. It sounds
> > like just what I need for this. But I am having 1 issue - how do you
> > know how many items are in each group?
>
> Simplest is:
>
> for key, group in groupby(xs, lambda x:(x[-1],x[4],x[5])):
> gs = list(group) # convert iterator to a list
> n = len(gs) # this is the number of elements
>
> there is some theoretical inelegance in that it requires each group to
> fit in memory, but you weren't really going to have billions of files
> with the same basename.
>
> If you're not used to iterators and itertools, note there are some
> subtleties to using groupby to iterate over files, because an iterator
> actually has state. It bumps a pointer and maybe consumes some input
> every time you advance it. In a situation like the above, you've got
> some nexted iterators (the groupby iterator generating groups, and the
> individual group iterators that come out of the groupby) that wrap the
> same file handle, so bad confusion can result if you advance both
> iterators without being careful (one can consume file input that you
> thought would go to another).

It seems that if you do a list(group) you have consumed the list. This
screwed me up for a while, and seems very counter-intuitive.

larry....@gmail.com

unread,

Jul 19, 2012, 11:07:23 PM7/19/12

to

On Jul 19, 7:01 pm, "Larry.Mart...@gmail.com"

Also, in make_dir5_key the format specifier for strftime should be %y%m
%d so they sort properly.

Peter Otten

unread,

Jul 20, 2012, 3:35:02 AM7/20/12

to pytho...@python.org

Larry....@gmail.com wrote:

> It seems that if you do a list(group) you have consumed the list. This
> screwed me up for a while, and seems very counter-intuitive.

Many itertools functions work that way. It allows you to iterate over the
items even if there is more data than fits into memory.
If you need to keep all items and are sure that your computer can cope with
them at once you can always throw in a

group = list(group)

Paul Rubin

unread,

Jul 20, 2012, 3:51:21 AM7/20/12

to

"Larry....@gmail.com" <larry....@gmail.com> writes:
> It seems that if you do a list(group) you have consumed the list. This
> screwed me up for a while, and seems very counter-intuitive.

Yes, that is correct, you have to carefully watch where the stuff in the
iterators is getting consumed, including when there are nested
iterators. That's what I was mentioning earlier--it got me confused at
first, but I use that style all the time now and it is pretty natural.

Paul Rudin

unread,

Jul 20, 2012, 4:37:42 AM7/20/12

to

"Larry....@gmail.com" <larry....@gmail.com> writes:

> It seems that if you do a list(group) you have consumed the list. This
> screwed me up for a while, and seems very counter-intuitive.

You've consumed the *group* which is an iterator, in order to construct
a list from its elements. Sorry if this is excessively nit-picking, but
it generally helps to keep these things very clear in your own mind.

MRAB

unread,

Jul 20, 2012, 11:45:06 AM7/20/12

to pytho...@python.org

On 20/07/2012 04:07, Larry....@gmail.com wrote:
[snip]

> Also, in make_dir5_key the format specifier for strftime should be %y%m
> %d so they sort properly.
>

Correct. I realised that only some time later, after I'd turned off my
computer for the night. :-(