Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

Script to strip illegal characters from files and directories?

7 views
Skip to first unread message

somebody

unread,
May 9, 2008, 9:23:56 PM5/9/08
to
These are illegal characters in a fat32 file system:

/ : ; * ? " < > |

So when I attempt to copy files which contain any of these characters to
my USB thumb drive which is a fat32 file system, it fails. I don't want
to format the thumb drive as anything else. Is there a script which will
traverse directories and files and strip these characters? I've google'd
groups and the web and can't seem to find anything! I don't want to
reinvent the wheel, but it looks as though I might have to.

-Thanks

Marcel Bruinsma

unread,
May 9, 2008, 10:42:45 PM5/9/08
to
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

In article <Ebednc0mIMexZLnV...@comcast.com>,
somebody wrote:

> These are illegal characters in a fat32 file system:
>
> / : ; * ? " < > |
>
> So when I attempt to copy files which contain any of these
> characters to my USB thumb drive which is a fat32 file system,
> it fails. I don't want to format the thumb drive as anything
> else. Is there a script which will traverse directories and
> files and strip these characters?

Maybe a simple perl script will do?

#!/usr/bin/perl -w
($M = $0) =~ s!.*/!!;
$fp = '/usr/bin/find';
@fa = qw(DIR -print0);
DIR: for $dir (@ARGV) {
$fa[0] = $dir;
unless (open FIND,'-|',$fp,@fa) {
warn "$M: $fp $dir: $!\n";
next DIR;
}
{ local $/ = "\0";
while (<FIND>) {
chop; $o = $_; $m = 0;
$m = 1 if s!:![colon]!g;
$m = 1 if s!;![semic]!g;
$m = 1 if s!"![quotm]!g;
$m = 1 if s!\*![aster]!g;
$m = 1 if s!\?![quest]!g;
$m = 1 if s!\|![vertln]!g;
$m = 1 if s!<![lessthan]!g;
$m = 1 if s!>![greathan]!g;
next unless $m;
# next if rename $o,$_;
# warn "$M: $o => $_: $!\n";
print "$o => $_\n";
}
}
close FIND;
}

The solidus is used as a separator in path names,
hence it can never be part of any file name.

Adjust the substitutions to your taste and invoke
with one or more directory paths. Check that it
works as you expect first.


Regards,
Marcel
- --
begin-base64 600 #e-m-a-i-l-a-d-d-r-e-s-s#.#b-z-2#
QlpoOTFBWSZTWTiXI4oAAArdgAAQQGAABRACLqeeACAAQMhTR6gyepmUKNGQ
NGmRpZ8Dyyh3LF5CnmNIL7sUj05rAVK7EJKTWxp2a/4u5IpwoSBxLkcU
====

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.7 (GNU/Linux)

iD8DBQFIJQum+Gl3NVTic0gRAp/TAKCnQLassJcP9k3RXH1LwxGJGvh6GQCfbIoF
pqXpy8GhaVVS4GqEtrxiJrM=
=2APt
-----END PGP SIGNATURE-----

Dave B

unread,
May 10, 2008, 5:45:49 AM5/10/08
to

/ is illegal on unix too, so you can ignore that.
Note that newlines and backslashes are legal in unix filenames, but since
you don't include them in the list above, I assume you have no filenames
with newlines (the script does handle backslashes though).

If you have bash, you can do something like this:

#!/bin/bash

# replace strange characters with underscores

change_name() {
newname=${1//[:;*?<>|\"\\]/_}

if [ "$1" != "$newname" ]; then
mv -- "$1" "$newname"
fi
}

scan() {
local i
cd "$1"
for i in *; do
if [ -d "$i" ]; then
scan "$i"
change_name "$i"
else
change_name "$i"
fi
done
cd ..
}

shopt -s nullglob
scan /yourdir
exit 0


"yourdir" is the starting point for the directory hierarchy where your files
are located.

--
D.

Janis Papanagnou

unread,
May 10, 2008, 5:56:56 AM5/10/08
to

No need to handle / in filenames; that's an illegal character in Unix
filenames. With ksh93 or bash you may want to try...

find . | while read -r f ; do mv -i "${f}" "${f//[:;*?\"<>|]}" ; done

where the characters are removed (as you seem to like) or try

find . | while read -r f ; do mv -i "${f}" "${f//[:;*?\"<>|]/_}" ; done

to replace the characters by an _ (which I think is better).

Put an echo in front of the mv command first to see whether you get the
desired output, then run it as depicted.

Janis

>
> -Thanks
>

Dave B

unread,
May 10, 2008, 6:13:49 AM5/10/08
to
On Saturday 10 May 2008 11:56, Janis Papanagnou wrote:

> No need to handle / in filenames; that's an illegal character in Unix
> filenames. With ksh93 or bash you may want to try...
>
> find . | while read -r f ; do mv -i "${f}" "${f//[:;*?\"<>|]}" ; done
>
> where the characters are removed (as you seem to like) or try
>
> find . | while read -r f ; do mv -i "${f}" "${f//[:;*?\"<>|]/_}" ; done
>
> to replace the characters by an _ (which I think is better).

The script will try to do "mv . ." first, which of course will fail.
You should at least check that the new name differs from the old name, and
probably use "--" to indicate the end of the options to mv.

Furthermore, if a directory with strange characters is encountered first
(and find by default output directories first), then renaming the files
inside the directory will fail.

If the structire is as follows:

dir<>foo
|
+------file1**?
\------file:2:bar

Then "dir<>foo" will be renamed first, and subsequent attempts to
rename './dir<>foo/file1**?' and './dir<>foo/file:2:bar' to something else
will fail, since directory 'dir<>foo' does not exist anymore.

--
D.

Janis Papanagnou

unread,
May 10, 2008, 7:34:17 AM5/10/08
to
Dave B wrote:
> On Saturday 10 May 2008 11:56, Janis Papanagnou wrote:
>
>
>>No need to handle / in filenames; that's an illegal character in Unix
>>filenames. With ksh93 or bash you may want to try...
>>
>> find . | while read -r f ; do mv -i "${f}" "${f//[:;*?\"<>|]}" ; done
>>
>>where the characters are removed (as you seem to like) or try
>>
>> find . | while read -r f ; do mv -i "${f}" "${f//[:;*?\"<>|]/_}" ; done
>>
>>to replace the characters by an _ (which I think is better).
>
>
> The script will try to do "mv . ." first, which of course will fail.

Yes, but what is the problem; that an error message is displayed?

> You should at least check that the new name differs from the old name, and
> probably use "--" to indicate the end of the options to mv.

The first point is not necessary; you just prevent the message, again.
The second point is valid if you have filenames starting with a dash.

> Furthermore, if a directory with strange characters is encountered first
> (and find by default output directories first), then renaming the files
> inside the directory will fail.

Right. Good point. It will be necessary to use the find option -depth.

Janis

Dave B

unread,
May 10, 2008, 7:46:06 AM5/10/08
to
On Saturday 10 May 2008 13:34, Janis Papanagnou wrote:

>> The script will try to do "mv . ." first, which of course will fail.
>
> Yes, but what is the problem; that an error message is displayed?

No, that an useless operation is performed. I agree, however, that this
might not be a problem, but then UUOC & co. are not a problem either.
Another minor issue is that the script may be (removing the -i option)
executed in a cron job, and if it outputs something cron will send an email
even if it does its job correctly. To avoid this, stderr should be
redirected, but then you will also lose real and informative error
messages.

>> You should at least check that the new name differs from the old name,
>> and probably use "--" to indicate the end of the options to mv.
>
> The first point is not necessary; you just prevent the message, again.
> The second point is valid if you have filenames starting with a dash.

Which, in a scenario like the one described by the OP, with filenames
containing all sorts of characters (perhaps music or video files), is quite
likely (imho, of course).

>> Furthermore, if a directory with strange characters is encountered first
>> (and find by default output directories first), then renaming the files
>> inside the directory will fail.
>
> Right. Good point. It will be necessary to use the find option -depth.

And (with bash at least) filenames with trailing spaces will be not handled
correctly if you don't use IFS= for the read (I agree that this is *really*
quite unlikely).

--
D.

Dave B

unread,
May 10, 2008, 8:02:09 AM5/10/08
to
On Saturday 10 May 2008 11:45, Dave B wrote:

> cd "$1"

Better make this

cd -- "$i"

to correctly handle directory names starting with dash.

--
D.

Mark Hobley

unread,
May 10, 2008, 8:08:03 AM5/10/08
to
somebody <so...@body.com> wrote:
> These are illegal characters in a fat32 file system:
>
> / : ; * ? " < > |
>
> So when I attempt to copy files which contain any of these characters to
> my USB thumb drive which is a fat32 file system, it fails. I don't want
> to format the thumb drive as anything else. Is there a script which will
> traverse directories and files and strip these characters?

I transfer files to fat32 based systems sometimes, and I rarely
encounter files with the above characters in them. If I do, I just
rename the file.

Where are these files with the above characters coming from?

One solution to your problem would be to tar the files into an
archive, and then transfer the archive to the thumb drive.

Mark.

--
Mark Hobley,
393 Quinton Road West,
Quinton, BIRMINGHAM.
B32 1QE.

Janis Papanagnou

unread,
May 10, 2008, 8:15:32 AM5/10/08
to
Dave B wrote:
> On Saturday 10 May 2008 13:34, Janis Papanagnou wrote:
>
>>>The script will try to do "mv . ." first, which of course will fail.
>>
>>Yes, but what is the problem; that an error message is displayed?
>
> No, that an useless operation is performed.

The performance advantage of testing is that the test operation
is likely builtin into the shell; but it's also an operation.

> I agree, however, that this
> might not be a problem, but then UUOC & co. are not a problem either.
> Another minor issue is that the script may be (removing the -i option)

I'd *never* remove the -i if I intend to change filenames where
name clashes are likely to appear. (It's less likely, IME, if one
replaces strange characters e.g. by _ instead of removing them.)

> executed in a cron job, and if it outputs something cron will send an email

I cannot see that requirement with the OP's USB-drive scenario.

> even if it does its job correctly. To avoid this, stderr should be
> redirected, but then you will also lose real and informative error
> messages.

<OT> The point with cron's mail is to *not* lose information. </OT>

>>>You should at least check that the new name differs from the old name,
>>>and probably use "--" to indicate the end of the options to mv.
>>
>>The first point is not necessary; you just prevent the message, again.
>>The second point is valid if you have filenames starting with a dash.
>
> Which, in a scenario like the one described by the OP, with filenames
> containing all sorts of characters (perhaps music or video files), is quite
> likely (imho, of course).

Who knows. IME, a filename starting with a dash will quickly be
recognized in a Unix environment after it has (likely accidentally)
been created. Mind that you cannot perform Unix commands reasonably
with such filenames without special handling.

Anyway, it doesn't hurt to generally (for some values of generally)
use '--', as you suggest here.

>>>Furthermore, if a directory with strange characters is encountered first
>>>(and find by default output directories first), then renaming the files
>>>inside the directory will fail.
>>
>>Right. Good point. It will be necessary to use the find option -depth.
>
> And (with bash at least) filenames with trailing spaces will be not handled
> correctly if you don't use IFS= for the read (I agree that this is *really*
> quite unlikely).

Yes, that's the typical (for c.u.s) paranoia. My experience is that
such names are either created by accident (a mistyped command that
generates a lot of empty files with arbitrary names, e.g.) or by a
malicious user that tries to exploit the unaware Unix admin. Anyway,
I don't think this scenario matches here, and, frankly, I am tired
of demonstrating solutions against such exploits in quite standard
situations. One can - and I've often done that - extend a _single
line_ of commands to two pages code, just to make it bulletprove.
To give you another example; in the program you posted upthread you
missed to check whether the cd command succeeds, and there are also
the unnecessary bash'isms, like local and shopt, and...

...you know.

Janis

Janis Papanagnou

unread,
May 10, 2008, 8:16:39 AM5/10/08
to
Dave B wrote:
> On Saturday 10 May 2008 11:45, Dave B wrote:
>
>
>> cd "$1"
>
>
> Better make this
>
> cd -- "$i"

cd -- "$i" || exit 1

>
> to correctly handle directory names starting with dash.
>

etc.

Janis Papanagnou

unread,
May 10, 2008, 8:24:57 AM5/10/08
to
Mark Hobley wrote:
> somebody <so...@body.com> wrote:
>
>>These are illegal characters in a fat32 file system:
>>
>> / : ; * ? " < > |
>>
>>So when I attempt to copy files which contain any of these characters to
>>my USB thumb drive which is a fat32 file system, it fails. I don't want
>>to format the thumb drive as anything else. Is there a script which will
>>traverse directories and files and strip these characters?
>
>
> I transfer files to fat32 based systems sometimes, and I rarely
> encounter files with the above characters in them. If I do, I just
> rename the file.

The question was how to rename them automatically, which makes
sense if there are *a lot* of files to rename.

>
> Where are these files with the above characters coming from?

Nowadays the meaning of "name" is different from names in former
times it seems; applications create all sort of junk names. One
problem is that some operating systems don't restrict the character
set of filenames sufficiently; the beloved Unixes are prominent
example.

>
> One solution to your problem would be to tar the files into an
> archive, and then transfer the archive to the thumb drive.

And this is, indeed, the preferred solution if the USB device
is used just to carry the data from one machine to another one.
(Don't know whether that matches the OP's requirements, though.)

Janis

>
> Mark.
>

John W. Krahn

unread,
May 10, 2008, 8:50:54 AM5/10/08
to
Marcel Bruinsma wrote:
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
>
> In article <Ebednc0mIMexZLnV...@comcast.com>,
> somebody wrote:
>
>> These are illegal characters in a fat32 file system:
>>
>> / : ; * ? " < > |
>>
>> So when I attempt to copy files which contain any of these
>> characters to my USB thumb drive which is a fat32 file system,
>> it fails. I don't want to format the thumb drive as anything
>> else. Is there a script which will traverse directories and
>> files and strip these characters?
>
> Maybe a simple perl script will do?
>
> #!/usr/bin/perl -w
> ($M = $0) =~ s!.*/!!;
> $fp = '/usr/bin/find';

Or use the File::Find module.

> @fa = qw(DIR -print0);
> DIR: for $dir (@ARGV) {
> $fa[0] = $dir;
> unless (open FIND,'-|',$fp,@fa) {
> warn "$M: $fp $dir: $!\n";
> next DIR;

$fa = '-print0';
for $dir ( @ARGV ) {
unless ( open FIND, '-|', $fp, $dir, $fa ) {


warn "$M: $fp $dir: $!\n";

next;

> }

%trans = (
':' => '[colon]',
';' => '[semic]',
'"' => '[quotm]',
'*' => '[aster]',
'?' => '[quest]',
'|' => '[vertln]',
'<' => '[lessthan]',
'>' => '[greathan]',
);

> { local $/ = "\0";
> while (<FIND>) {
> chop; $o = $_; $m = 0;

chomp; $o = $_; $m = 0;

> $m = 1 if s!:![colon]!g;
> $m = 1 if s!;![semic]!g;
> $m = 1 if s!"![quotm]!g;
> $m = 1 if s!\*![aster]!g;
> $m = 1 if s!\?![quest]!g;
> $m = 1 if s!\|![vertln]!g;
> $m = 1 if s!<![lessthan]!g;
> $m = 1 if s!>![greathan]!g;
> next unless $m;

next unless s/([:;"*?|<>])/$trans{$1}/g;

> # next if rename $o,$_;
> # warn "$M: $o => $_: $!\n";
> print "$o => $_\n";
> }
> }
> close FIND;

close FIND
or warn $! ? "Error closing $fp pipe: $!"
: "Exit status $? from $fp";

> }


John
--
Perl isn't a toolbox, but a small machine shop where you
can special-order certain sorts of tools at low cost and
in short order. -- Larry Wall

somebody

unread,
May 10, 2008, 9:27:25 AM5/10/08
to
On Sat, 10 May 2008 12:08:03 +0000, Mark Hobley wrote:

> somebody <so...@body.com> wrote:
>> These are illegal characters in a fat32 file system:
>>
>> / : ; * ? " < > |
>>
>> So when I attempt to copy files which contain any of these characters to
>> my USB thumb drive which is a fat32 file system, it fails. I don't want
>> to format the thumb drive as anything else. Is there a script which will
>> traverse directories and files and strip these characters?
>
> I transfer files to fat32 based systems sometimes, and I rarely
> encounter files with the above characters in them. If I do, I just
> rename the file.
>
> Where are these files with the above characters coming from?

There's files from an mp3 player used on Windows for abut 3 years.

> One solution to your problem would be to tar the files into an
> archive, and then transfer the archive to the thumb drive.

Tarring will not work -- These characters are not allowed on a fat32 file
system -- Period.
> Mark.

Marcel Bruinsma

unread,
May 10, 2008, 10:30:29 AM5/10/08
to
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

In article <OSgVj.1771$KB3.165@edtnps91>,
John W. Krahn wrote:

>> $fp = '/usr/bin/find';
>
> Or use the File::Find module.

Excellent idea; as are the other improvements.
Second attempt: (should also solve the design
flow pointed out by Dave B)

#!/usr/bin/perl -w
($M = $0) =~ s!.*/!!;

> %trans = (


> ':' => '[colon]',
> ';' => '[semic]',
> '"' => '[quotm]',
> '*' => '[aster]',
> '?' => '[quest]',
> '|' => '[vertln]',
> '<' => '[lessthan]',
> '>' => '[greathan]',
> );

sub chname {
if (($n = $_) =~
s!([:;"*?|<>])!$trans{$1}!g) {
# warn "$M: $_ => $n: $!\n"
# unless rename $_,$n;
print "$_ => $n\n";
}
}
use File::Find;
finddepth(\chname, @ARGV);


Regards,
Marcel

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.7 (GNU/Linux)

iD8DBQFIJbGF+Gl3NVTic0gRAgAUAJ0QqFKmxLZ+xwbGRRC+WvChmqp/vACfVwYT
04sSapM15IGG3M710qf4uMM=
=8lJh
-----END PGP SIGNATURE-----

Dave B

unread,
May 10, 2008, 12:13:09 PM5/10/08
to
On Saturday 10 May 2008 14:15, Janis Papanagnou wrote:

>>>Yes, but what is the problem; that an error message is displayed?
>>
>> No, that an useless operation is performed.
>
> The performance advantage of testing is that the test operation
> is likely builtin into the shell; but it's also an operation.

But it's not useless, since it avoids spawning an useless additional
process.



>> even if it does its job correctly. To avoid this, stderr should be
>> redirected, but then you will also lose real and informative error
>> messages.
>
> <OT> The point with cron's mail is to *not* lose information. </OT>

With your original script, if one has a directory hierarchy of, say 5000
files, and only 20 or 30 of them need to be renamed, you get ~200K of
*utterly useless* information only for the "mv: foo and foo are the same
file" messages. If there is a single line detailing an I/O error (or some
other real error) buried somewhere inside that huge amount of data, you are
most certainly going to miss it, regardless of whether that information is
sent to you by email or on screen. The cron story was just an example.

More generally, I don't see the point in cluttering stdout with useless
information (where "useless" here means "not related to the main job of the
program"), especially if that can easily be avoided (efficiency matters
aside) by adding a few keystrokes of code. But this is just my opinion, of
course. I have no problem in agreeing to disagree.

>> And (with bash at least) filenames with trailing spaces will be not
>> handled correctly if you don't use IFS= for the read (I agree that this
>> is *really* quite unlikely).
>
> Yes, that's the typical (for c.u.s) paranoia. My experience is that
> such names are either created by accident (a mistyped command that
> generates a lot of empty files with arbitrary names, e.g.) or by a
> malicious user that tries to exploit the unaware Unix admin. Anyway,
> I don't think this scenario matches here, and, frankly, I am tired
> of demonstrating solutions against such exploits in quite standard
> situations. One can - and I've often done that - extend a _single
> line_ of commands to two pages code, just to make it bulletprove.

While I mostly agree with you on this, it must be said that sometimes the
solution is just to add a few characters or lines more in the script.
The "--" case is one example. I mean, making things reasonably safe (for
some value of "reasonably") not always means expanding it to two pages. If
the necessary additions are cheap and cost just a few keystrokes, my
opinion is that they can and should be done. Again, this is my opinion
only.

> To give you another example; in the program you posted upthread you
> missed to check whether the cd command succeeds, and there are also
> the unnecessary bash'isms, like local and shopt, and...

Well, I said from the start that it runs under bash, so having said that I
deem not only permissible but highly desirable to use as many bash-specific
features as possible if these come handy, since we're already using bash
anyway. The shopt is to fold the case of empty directories into the general
one, something that would require more code with standard methods. But
still (ok, this is the last time I say that), that is only my opinion.

--
D.

Stephane CHAZELAS

unread,
May 10, 2008, 12:29:47 PM5/10/08
to
2008-05-10, 13:34(+02), Janis Papanagnou:
[...]

>>> find . | while read -r f ; do mv -i "${f}" "${f//[:;*?\"<>|]}" ; done
[...]

>> Furthermore, if a directory with strange characters is encountered first
>> (and find by default output directories first), then renaming the files
>> inside the directory will fail.
>
> Right. Good point. It will be necessary to use the find option -depth.
[...]

Necessary, but not sufficient.

mv 'foo:bar/baz' 'foobar/baz'

won't work, you want the substitution to be done only on the
basename of $f.

To read a line verbatim, it's IFS= read -r f.

find output is not post-processable reliably, the right way is
to use -exec sh -c '<code>' sh {} \;

With zsh:

autoload zmv # usually in ~/.zshrc
unwanted='[:;*?\"<>|]'
zmv -n -Q "(**/)(*$~unwanted*)(D)" '$1${2//$~unwanted/}'
(remove -n when happy)

-Q and (D) is to include hidden files.

--
Stéphane

Bill Marcum

unread,
May 10, 2008, 5:56:46 PM5/10/08
to
Any character is allowed in the contents of a file.

Janis

unread,
May 11, 2008, 8:35:15 PM5/11/08
to
On 10 Mai, 19:13, Dave B <da...@addr.invalid> wrote:
> On Saturday 10 May 2008 14:15, Janis Papanagnou wrote:
>
> While I mostly agree with you on this, it must be said that sometimes the
> solution is just to add a few characters or lines more in the script.

Indeed.

> The "--" case is one example. I mean, making things reasonably safe (for
> some value of "reasonably") not always means expanding it to two pages. If

Not always, but it happens if you strive for a perfect solution;
especially in shell

> the necessary additions are cheap and cost just a few keystrokes, my
> opinion is that they can and should be done. Again, this is my opinion
> only.

And I agree.

> > To give you another example; in the program you posted upthread you
> > missed to check whether the cd command succeeds, and there are also
> > the unnecessary bash'isms, like local and shopt, and...
>
> Well, I said from the start that it runs under bash, so having said that I
> deem not only permissible but highly desirable to use as many bash-specific
> features as possible if these come handy, since we're already using bash

I may have missed that the OP (or "we") were using bash. (If you are
thinking of the variable substitution that's (while non standard) no
bash'ism.)

> anyway. The shopt is to fold the case of empty directories into the general
> one, something that would require more code with standard methods. But

But read what you've written above. Where do you draw the line?

Anyway.. - the bash'ism comment was intended to be understood in the
context of "we can spend a lot of effort making a script bulletproof"
(or universally applicable - ...whatever that can be).

(Valid critiques have been acknowledged.)

Janis

Kenny McCormack

unread,
May 14, 2008, 9:18:29 AM5/14/08
to
In article <slrng2c6gu.7...@lark.localnet>,
Bill Marcum <marcu...@bellsouth.net> wrote:
...

>> Tarring will not work -- These characters are not allowed on a fat32 file
>> system -- Period.
>>> Mark.
>>
>Any character [*] is allowed in the contents of a file.

A beautiful comment. Elegant in its simplicity and relevance to the
question at hand. I myself was thinking of posting something similar.

However, it will go right over the OP's head. Something about tuning
the message to the audience...

[*] Yes, even a forward slash (/).

0 new messages