os.walk restart

Keir Vaughan-taylor

unread,

Mar 17, 2010, 6:04:14 PM3/17/10

to

I am traversing a large set of directories using

for root, dirs, files in os.walk(basedir):
run program

Being a huge directory set the traversal is taking days to do a
traversal.
Sometimes it is the case there is a crash because of a programming
error.
As each directory is processed the name of the directory is written to
a file
I want to be able to restart the walk from the directory where it
crashed.

Is this possible?

Steven D'Aprano

unread,

Mar 17, 2010, 6:42:35 PM3/17/10

to

On Wed, 17 Mar 2010 15:04:14 -0700, Keir Vaughan-taylor wrote:

> I am traversing a large set of directories using
>
> for root, dirs, files in os.walk(basedir):
> run program
>
> Being a huge directory set the traversal is taking days to do a
> traversal.
> Sometimes it is the case there is a crash because of a programming
> error.
> As each directory is processed the name of the directory is written to a
> file

What, a proper, honest-to-goodness core dump?

Or do you mean an exception?

> I want to be able to restart the walk from the directory where it
> crashed.
>
> Is this possible?

Quick and dirty with no error-checking:

# Untested
last_visited = open("last_visited.txt", 'r').read()
for root, dirs, files in os.walk(last_visited or basedir):
open("last_visited.txt", 'w').write(root)
run program

--
Steven

Gabriel Genellina

unread,

Mar 17, 2010, 7:09:28 PM3/17/10

to pytho...@python.org

En Wed, 17 Mar 2010 19:04:14 -0300, Keir Vaughan-taylor <kei...@gmail.com>
escribió:

If the 'dirs' list were guaranteed to be sorted, you could remove at each
level all previous directories already traversed. But it's not :(

Perhaps a better approach would be, once, collect all directories to be
processed and write them on a text file -- these are the pending
directories. Then, read from the pending file and process every directory
in it. If the process aborts for any reason, manually delete the lines
already processed and restart.

If you use a database instead of a text file, and mark entries as "done"
after processing, you can avoid that last manual step and the whole
process may be kept running automatically. In some cases you may want to
choose the starting point at random.

--
Gabriel Genellina

alex23

unread,

Mar 17, 2010, 11:08:48 PM3/17/10

to

Steven D'Aprano <ste...@REMOVE.THIS.cybersource.com.au> wrote:
> # Untested
> last_visited = open("last_visited.txt", 'r').read()
> for root, dirs, files in os.walk(last_visited or basedir):
> open("last_visited.txt", 'w').write(root)
> run program

Wouldn't this only walk the directory the exception occured in and not
the remaining unwalked dirs from basedir?

Something like this should work:

import os
basedir = '.'

walked = open('walked.txt','r').read().split()
unwalked = ((r,d,f) for r,d,f in os.walk(basedir) if r not in walked)

for root, dirs, files in unwalked:
# do something
print root
walked.append(root)

open('walked.txt','w').write('\n'.join(walked))

MRAB

unread,

Mar 17, 2010, 11:34:46 PM3/17/10

to pytho...@python.org

I would write my own walker which sorts the directory entries it finds
before walking them and can skip the entries until it gets to the
desired starting path, eg if I want to start at "/foo/bar" then skip
over the entries in the root directory which start with a name before
"foo" and the entries in the subdirectory "/foo" which start with a name
before "bar".

Steve Howell

unread,

Mar 17, 2010, 11:49:58 PM3/17/10

to

I assume it's the operation that you are doing on each file that is
expensive, not the walk itself.

If that's the case, then you might be able to get away with just
leaving some kind of breadcrumbs whenever you've successfully
processed a directory or a file, so you can quickly short-circuit
entire directories or files on the next run, without having to
implement any kind of complicated start-where-I-left-off before
algorithm.

The breadcrumbs could be hidden files in the file system, or an easy-
indexable list of files that you persist, etc.

What are you doing that takes so long?

Also, I can understand why the operations on the files themselves
might crash, but can't you catch an exception and keep on chugging?

Another option, if you do not do some kind of pruning on the fly, is
to persist the list of files that you need to process up front to a
file, or a database, and persist the index of the last successfully
processed file, so that you can restart as needed from where you left
off.

Steven D'Aprano

unread,

Mar 18, 2010, 12:04:46 AM3/18/10

to

On Wed, 17 Mar 2010 20:08:48 -0700, alex23 wrote:

> Steven D'Aprano <ste...@REMOVE.THIS.cybersource.com.au> wrote:
>> # Untested
>> last_visited = open("last_visited.txt", 'r').read()
>> for root, dirs, files in os.walk(last_visited or basedir):
>> open("last_visited.txt", 'w').write(root) run program
>
> Wouldn't this only walk the directory the exception occured in and not
> the remaining unwalked dirs from basedir?

Only if you have some sort of branching file hierarchy with multiple sub-
directories in the one directory, instead of a nice simple linear chain
of directories a/b/c/d/.../z as nature intended.

*wink*

Yes, good catch. I said it was untested. You might be able to save the
parent of the current directory, and restart from there instead.

--
Steven

Tim Chase

unread,

Mar 18, 2010, 6:25:19 AM3/18/10

to Steve Howell, pytho...@python.org

Steve Howell wrote:
> If that's the case, then you might be able to get away with just
> leaving some kind of breadcrumbs whenever you've successfully
> processed a directory or a file,

Unless you're indexing a read-only device (whether hardware
read-only like a CD, or permission-wise read-only like a network
share or a non-priv user walking system directories)...

> Also, I can understand why the operations on the files themselves
> might crash, but can't you catch an exception and keep on chugging?

I also wondered this one, perhaps logging the directory in which
the exception happened to later revisit. :)

-tkc

Piet van Oostrum

unread,

Mar 31, 2010, 8:50:51 AM3/31/10

to

You have no guarantee that on the next run the directories will be
visited in the same order as in the first run (this could depend on the
filesystem). So then remembering a last directory won't do it. You could
write each completed directory name to a file, and then on the second
run check whether a directory is in that list and skip the program run
for these.

Something like this (symbolically):

lastrun = map(string.strip, logfile.readlines())
newlog = ... open logfile in append mode ...

for root, dirs, files in os.walk(basedir):

if root not in lastrun:
run program
newlog.write(root)
newlog.flush()

--
Piet van Oostrum <pi...@vanoostrum.org>
WWW: http://pietvanoostrum.com/
PGP key: [8DAE142BE17999C4]
Nu Fair Trade woonwaar op http://www.zylja.com