Making safe file names

Andrew Berg

unread,

May 7, 2013, 3:58:59 PM5/7/13

to comp.lang.python

Currently, I keep Last.fm artist data caches to avoid unnecessary API calls and have been naming the files using the artist name. However,
artist names can have characters that are not allowed in file names for most file systems (e.g., C/A/T has forward slashes). Are there any
recommended strategies for naming such files while avoiding conflicts (I wouldn't want to run into problems for an artist named C-A-T or
CAT, for example)? I'd like to make the files easily identifiable, and there really are no limits on what characters can be in an artist name.
--
CPython 3.3.1 | Windows NT 6.2.9200 / FreeBSD 9.1

Terry Jan Reedy

unread,

May 7, 2013, 6:01:34 PM5/7/13

to pytho...@python.org

Sounds like you want something like the html escape or urlencode
functions, which serve the same purpose of encoding special chars.
Rather than invent a new tranformation, you could use the same scheme
used for html entities. (Sorry, I forget the details.) It is possible
that one of the functions would work for you as is, or with little
modification.

Terry

Fábio Santos

unread,

May 7, 2013, 6:18:40 PM5/7/13

to Andrew Berg, comp.lang.python

I suggest Base64. b64encode
(http://docs.python.org/2/library/base64.html#base64.b64encode) and
b64decode take an argument which allows you to eliminate the pesky "/"
character. It's reversible and simple.

More suggestions: how about a hash? Or just use IDs from the database?

On Tue, May 7, 2013 at 8:58 PM, Andrew Berg <bahamut...@gmail.com> wrote:
> Currently, I keep Last.fm artist data caches to avoid unnecessary API calls and have been naming the files using the artist name. However,
> artist names can have characters that are not allowed in file names for most file systems (e.g., C/A/T has forward slashes). Are there any
> recommended strategies for naming such files while avoiding conflicts (I wouldn't want to run into problems for an artist named C-A-T or
> CAT, for example)? I'd like to make the files easily identifiable, and there really are no limits on what characters can be in an artist name.

> --
> CPython 3.3.1 | Windows NT 6.2.9200 / FreeBSD 9.1

> --
> http://mail.python.org/mailman/listinfo/python-list

--
Fábio Santos

MRAB

unread,

May 7, 2013, 6:21:17 PM5/7/13

to pytho...@python.org

On 07/05/2013 20:58, Andrew Berg wrote:
> Currently, I keep Last.fm artist data caches to avoid unnecessary API calls and have been naming the files using the artist name. However,
> artist names can have characters that are not allowed in file names for most file systems (e.g., C/A/T has forward slashes). Are there any
> recommended strategies for naming such files while avoiding conflicts (I wouldn't want to run into problems for an artist named C-A-T or
> CAT, for example)? I'd like to make the files easily identifiable, and there really are no limits on what characters can be in an artist name.
>

Conflicts won't occur if:

1. All of the characters of the artist's name are mapped to an encoding.

2. Different characters map to different encodings.

3. No encoding is a prefix of another encoding.

In practice, you'll be mapping most characters to themselves.

Dan Stromberg

unread,

May 7, 2013, 6:29:32 PM5/7/13

to Andrew Berg, comp.lang.python

On 5/7/13, Andrew Berg <bahamut...@gmail.com> wrote:
> Currently, I keep Last.fm artist data caches to avoid unnecessary API calls
> and have been naming the files using the artist name. However,
> artist names can have characters that are not allowed in file names for most
> file systems (e.g., C/A/T has forward slashes). Are there any
> recommended strategies for naming such files while avoiding conflicts (I
> wouldn't want to run into problems for an artist named C-A-T or
> CAT, for example)? I'd like to make the files easily identifiable, and there
> really are no limits on what characters can be in an artist name.

You might consider:
http://stromberg.dnsalias.org/svn/backshift/trunk/escape_mod.py
http://stromberg.dnsalias.org/svn/backshift/trunk/test-escape_mod

It doubles the length of the string, but it produces safe, easily
readable escaped strings - which tends to make debugging easier.

It requires a couple of other modules (easily obtained from the same
SVN repo) though.

Jens Thoms Toerring

unread,

May 7, 2013, 6:37:01 PM5/7/13

to

It's not clear what the context that you need this for. You
could e.g. replace all characters not allowed by the file
system by their hexidecimal (ASCII) values, preceeded by a
'%" (so '/' would be changed to '%2F', and also encode a '%'
itself in a name by '%25'). Then you have a well-defined
two-way mapping ("isomorphic" if I remember my math-lear-
nining days correctly) between the original name and the
way you store it. E.g.

"C/A/T" would become "C%2FA%2FT"

and

"C%2FA/T" would become "C%252FA%2FT"

You can translate back and forth between them with not too
much effort.

Of course, that assumes that '%' is a character allowed by
your file system - otherwise pick some other one, any one
will do in principle. It's a bit harder for a human to in-
terpret but rathe likely not that much of a problem. You
probably will have seen that kind of scheme used in URLs.
The concept is rather old and called 'escape character',
i.e. have one character that assumes some special meaning
and also "escaped" it.

If, on the hand, those names are never to be translated back
to the original name another strategy would be to use the SHA1
hash value of the artists name. Since clashes between SHA1 hash
values are rather hard to produce it's a rather safe method of
converting something (i.e. the artists name) to a number. The
drawback, of course, is that you can't translate back from the
hash value to the original name (if that would be simple the
whole thing wouldn't work;-)

Regards, Jens
--
\ Jens Thoms Toerring ___ j...@toerring.de
\__________________________ http://toerring.de

Chris Angelico

unread,

May 7, 2013, 7:04:23 PM5/7/13

to pytho...@python.org

On Wed, May 8, 2013 at 8:18 AM, Fábio Santos <fabiosa...@gmail.com> wrote:
> I suggest Base64. b64encode
> (http://docs.python.org/2/library/base64.html#base64.b64encode) and
> b64decode take an argument which allows you to eliminate the pesky "/"
> character. It's reversible and simple.

But it doesn't look anything like the original.

I'd be inclined to go for something like quoted-printable or
URL-encoding; special characters become much longer, but ordinary
characters (mostly) stay as themselves.

ChrisA

Andrew Berg

unread,

May 7, 2013, 7:10:25 PM5/7/13

to comp.lang.python

On 2013.05.07 17:18, Fábio Santos wrote:
> I suggest Base64. b64encode
> (http://docs.python.org/2/library/base64.html#base64.b64encode) and
> b64decode take an argument which allows you to eliminate the pesky "/"
> character. It's reversible and simple.
>

> More suggestions: how about a hash? Or just use IDs from the database?

None of these would work because I would have no idea which file stores data for which artist without writing code to figure it out. If I
were to end up writing a bug that messed up a few of my cache files and noticed it with a specific artist (e.g., doing a "now playing" and
seeing the wrong tags), I would either have to manually match up the hash or base64 encoding in order to delete just that file so that it
gets regenerated or nuke and regenerate my entire cache.

Andrew Berg

unread,

May 7, 2013, 7:00:43 PM5/7/13

to comp.lang.python

On 2013.05.07 17:01, Terry Jan Reedy wrote:
> Sounds like you want something like the html escape or urlencode
> functions, which serve the same purpose of encoding special chars.
> Rather than invent a new tranformation, you could use the same scheme
> used for html entities. (Sorry, I forget the details.) It is possible
> that one of the functions would work for you as is, or with little
> modification.

This has the problem of mangling non-ASCII characters (and artist names with non-ASCII characters are not rare). I most definitely want to
keep as many characters untouched as possible so that the files are easy to identify by looking at the file name. Ideally, only characters
that file systems don't like would be transformed.

Dave Angel

unread,

May 7, 2013, 8:14:54 PM5/7/13

to pytho...@python.org

On 05/07/2013 03:58 PM, Andrew Berg wrote:
> Currently, I keep Last.fm artist data caches to avoid unnecessary API calls and have been naming the files using the artist name. However,
> artist names can have characters that are not allowed in file names for most file systems (e.g., C/A/T has forward slashes). Are there any
> recommended strategies for naming such files while avoiding conflicts (I wouldn't want to run into problems for an artist named C-A-T or
> CAT, for example)? I'd like to make the files easily identifiable, and there really are no limits on what characters can be in an artist name.
>

So what you need first is a list of allowable characters for all your
target OS versions. And don't forget that the allowable characters may
vary depending on the particular file system(s) mounted on a given OS.

You also need to decide how to handle Unicode characters, since they're
different for different OS. In Windows on NTFS, filenames are in
Unicode, while on Unix, filenames are bytes. So on one of those, you
will be encoding/decoding if your code is to be mostly portable.

Don't forget that ls and rm may not use the same encoding you're using.
So you may not consider it adequate to make the names legal, but you
may also want they easily typeable in the shell.

--
DaveA

Roy Smith

unread,

May 7, 2013, 8:22:17 PM5/7/13

to

In article <mailman.1428.1367972...@python.org>,

One possible tool that may help you here is unidecode
(https://pypi.python.org/pypi/Unidecode). It doesn't solve your whole
problem, but it does help get unicode text into a form which is both
7-bit clean and human readable.

Andrew Berg

unread,

May 7, 2013, 7:30:53 PM5/7/13

to comp.lang.python

On 2013.05.07 17:37, Jens Thoms Toerring wrote:
> You
> could e.g. replace all characters not allowed by the file
> system by their hexidecimal (ASCII) values, preceeded by a
> '%" (so '/' would be changed to '%2F', and also encode a '%'
> itself in a name by '%25'). Then you have a well-defined
> two-way mapping ("isomorphic" if I remember my math-lear-
> nining days correctly) between the original name and the
> way you store it. E.g.
>
> "C/A/T" would become "C%2FA%2FT"
>
> and
>
> "C%2FA/T" would become "C%252FA%2FT"
>
> You can translate back and forth between them with not too
> much effort.
>
> Of course, that assumes that '%' is a character allowed by
> your file system - otherwise pick some other one, any one
> will do in principle. It's a bit harder for a human to in-
> terpret but rathe likely not that much of a problem.

Yes, something like this is what I am trying to achieve. Judging by the responses I've gotten so far, I think I'll have to roll my own
transformation scheme since URL encoding and the like transform Unicode characters. I can memorize that 植松伸夫 is a Japanese composer who
is well-known for his works in the Final Fantasy series of video games. Trying to match up the URL-encoded version to an artist would be
almost impossible when I have several other artist names that have no ASCII characters.

Andrew Berg

unread,

May 7, 2013, 8:51:24 PM5/7/13

to comp.lang.python

On 2013.05.07 19:14, Dave Angel wrote:
> You also need to decide how to handle Unicode characters, since they're
> different for different OS. In Windows on NTFS, filenames are in
> Unicode, while on Unix, filenames are bytes. So on one of those, you
> will be encoding/decoding if your code is to be mostly portable.

Characters outside whatever sys.getfilesystemencoding() returns won't be allowed. If the user's locale settings don't support Unicode, my
program will be far from the only one to have issues with it. Any problem reports that arise from a user moving between legacy encodings
will generally be ignored. I haven't yet decided how I will handle artist names with characters outside UTF-8, but inside UTF-16/32 (UTF-16
is just fine on Windows/NTFS, but on Unix(-ish) systems, many use UTF-8 in their locale settings).

> Don't forget that ls and rm may not use the same encoding you're using.
> So you may not consider it adequate to make the names legal, but you
> may also want they easily typeable in the shell.

I don't understand. I have no intention of changing Unicode characters.

This is not a Unicode issue since (modern) file systems will happily accept it. The issue is that certain characters (which are ASCII) are
not allowed on some file systems:
\ / : * ? " < > | @ and the NUL character
The first 9 are not allowed on NTFS, the @ is not allowed on ext3cow, and NUL and / are not allowed on pretty much any file system. Locale
settings and encodings aside, these 11 characters will need to be escaped.

Dave Angel

unread,

May 7, 2013, 9:13:11 PM5/7/13

to pytho...@python.org

On 05/07/2013 08:51 PM, Andrew Berg wrote:
> On 2013.05.07 19:14, Dave Angel wrote:
>> You also need to decide how to handle Unicode characters, since they're
>> different for different OS. In Windows on NTFS, filenames are in
>> Unicode, while on Unix, filenames are bytes. So on one of those, you
>> will be encoding/decoding if your code is to be mostly portable.
> Characters outside whatever sys.getfilesystemencoding() returns won't be allowed. If the user's locale settings don't support Unicode, my
> program will be far from the only one to have issues with it. Any problem reports that arise from a user moving between legacy encodings
> will generally be ignored. I haven't yet decided how I will handle artist names with characters outside UTF-8,

There aren't any characters "outside UTF-8". But a character is not "in
utf-8", it can be encoded by utf-8.

but inside UTF-16/32 (UTF-16

Nor outside UTF-16 or 32.

> is just fine on Windows/NTFS, but on Unix(-ish) systems, many use UTF-8 in their locale settings).
>> Don't forget that ls and rm may not use the same encoding you're using.
>> So you may not consider it adequate to make the names legal, but you
>> may also want they easily typeable in the shell.
> I don't understand. I have no intention of changing Unicode characters.

So you're comfortable typing arbitrary characters? what about all the
characters that have identical displays in your font? What about viewing
0x07 in the terminal window? Or 0x04?

>
>
> This is not a Unicode issue since (modern) file systems will happily accept it. The issue is that certain characters (which are ASCII) are
> not allowed on some file systems:
> \ / : * ? " < > | @ and the NUL character
> The first 9 are not allowed on NTFS, the @ is not allowed on ext3cow, and NUL and / are not allowed on pretty much any file system. Locale
> settings and encodings aside, these 11 characters will need to be escaped.
>

As soon as you have a small, finite list of invalid characters, writing
an escape system is pretty easy.

--
DaveA

Neil Hodgson

unread,

May 7, 2013, 9:28:11 PM5/7/13

to

Andrew Berg:

> This is not a Unicode issue since (modern) file systems will happily accept it. The issue is that certain characters (which are ASCII) are
> not allowed on some file systems:
> \ / : * ? "< > | @ and the NUL character
> The first 9 are not allowed on NTFS, the @ is not allowed on ext3cow, and NUL and / are not allowed on pretty much any file system. Locale
> settings and encodings aside, these 11 characters will need to be escaped.

There's also the Windows device name hole. There may be trouble with
artists named 'COM4', 'CLOCK$', 'Con', or similar.

http://support.microsoft.com/kb/74496
http://en.wikipedia.org/wiki/Nul_%28band%29

Neil

Dave Angel

unread,

May 7, 2013, 9:45:08 PM5/7/13

to pytho...@python.org

On 05/07/2013 09:28 PM, Neil Hodgson wrote:
> Andrew Berg:
>
>> This is not a Unicode issue since (modern) file systems will happily
>> accept it. The issue is that certain characters (which are ASCII) are
>> not allowed on some file systems:
>> \ / : * ? "< > | @ and the NUL character
>> The first 9 are not allowed on NTFS, the @ is not allowed on ext3cow,
>> and NUL and / are not allowed on pretty much any file system. Locale
>> settings and encodings aside, these 11 characters will need to be
>> escaped.
>
> There's also the Windows device name hole. There may be trouble with
> artists named 'COM4', 'CLOCK$', 'Con', or similar.
>

In MSDOS 2, there was a switch that would tell the OS to ignore such
names unless they were prefixed by \DEV. But like the switchar switch,
it was largely ignored by the ignorant, and probably doesn't exist in
current versions of M$OS

> http://support.microsoft.com/kb/74496
> http://en.wikipedia.org/wiki/Nul_%28band%29
>
> Neil

While we're looking for trouble, there's also case insensitivity.
Unclear if the user cares, but tom and TOM are the same file in most
configurations of NT.

--
DaveA

Andrew Berg

unread,

May 7, 2013, 10:20:13 PM5/7/13

to comp.lang.python

On 2013.05.07 20:45, Dave Angel wrote:
> While we're looking for trouble, there's also case insensitivity.
> Unclear if the user cares, but tom and TOM are the same file in most
> configurations of NT.

Artist names on Last.fm cannot differ only in case. This does remind me to make sure to update the case of the artist name as necessary,
though. For example, if Sam becomes SAM again (I have seen Last.fm change the case for artist names), I need to make sure that I don't end
up with two file names differing only in case.

Andrew Berg

unread,

May 7, 2013, 10:21:41 PM5/7/13

to comp.lang.python

On 2013.05.07 20:13, Dave Angel wrote:
> So you're comfortable typing arbitrary characters? what about all the
> characters that have identical displays in your font?

Identification is more important than typing. I can copy and paste into a terminal if necessary. I don't foresee typing out one of the
filenames being anything more than a rare occurrence, but I will occasionally just read the list.

> What about viewing
> 0x07 in the terminal window? Or 0x04?

I don't think Last.fm will even send those characters. In any case, control characters in artist names are rare enough that it's not worth
the trouble to write the code to avoid the problems associated with them.

> As soon as you have a small, finite list of invalid characters, writing
> an escape system is pretty easy.

Probably. I was just hoping there was an existing system that would work, but as I said in a different reply, it would seem I need to roll
my own.

Roy Smith

unread,

May 7, 2013, 10:21:53 PM5/7/13

to

In article <mailman.1435.1367977...@python.org>,

Dave Angel <da...@davea.name> wrote:

> While we're looking for trouble, there's also case insensitivity.
> Unclear if the user cares, but tom and TOM are the same file in most
> configurations of NT.

OSX, too.

Andrew Berg

unread,

May 7, 2013, 10:06:27 PM5/7/13

to comp.lang.python

On 2013.05.07 20:28, Neil Hodgson wrote:
> http://support.microsoft.com/kb/74496
> http://en.wikipedia.org/wiki/Nul_%28band%29
I can indeed confirm that at least 'nul' cannot be used as a filename. However, I add an extension to the file names to identify them as caches.

Steven D'Aprano

unread,

May 7, 2013, 11:40:50 PM5/7/13

to

On Tue, 07 May 2013 19:51:24 -0500, Andrew Berg wrote:

> On 2013.05.07 19:14, Dave Angel wrote:
>> You also need to decide how to handle Unicode characters, since they're
>> different for different OS. In Windows on NTFS, filenames are in
>> Unicode, while on Unix, filenames are bytes. So on one of those, you
>> will be encoding/decoding if your code is to be mostly portable.
>
> Characters outside whatever sys.getfilesystemencoding() returns won't be
> allowed. If the user's locale settings don't support Unicode, my program
> will be far from the only one to have issues with it. Any problem
> reports that arise from a user moving between legacy encodings will
> generally be ignored. I haven't yet decided how I will handle artist
> names with characters outside UTF-8, but inside UTF-16/32 (UTF-16 is
> just fine on Windows/NTFS, but on Unix(-ish) systems, many use UTF-8 in
> their locale settings).

There aren't any characters outside of UTF-8 :-) UTF-8 covers the entire
Unicode range, unlike other encodings like Latin-1 or ASCII.

Well, that is to say, there may be characters that are not (yet) handled
at all by Unicode, but there are no known legacy encodings that support
such characters.

To a first approximation, Unicode covers the entire set of characters in
human use, and for those which it does not, there is always the private
use area. So for example, if you wish to record the Artist Formerly Known
As "The Artist Formerly Known As Prince" as Love Symbol, you could pick
an arbitrary private use code point, declare that for your application
that code point means Love Symbol, and use that code point as the artist
name. You could even come up with a custom font that includes a rendition
of that character glyph.

However, there are byte combinations which are not valid UTF-8, which is
a different story. If you're receiving bytes from (say) a file name, they
may not necessarily make up a valid UTF-8 string. But this is not an
issue if you are receiving data from something guaranteed to be valid
UTF-8.

>> Don't forget that ls and rm may not use the same encoding you're using.
>> So you may not consider it adequate to make the names legal, but you
>> may also want they easily typeable in the shell.
>
> I don't understand. I have no intention of changing Unicode characters.

Of course you do. You even talk below about Unicode characters like *
and ? not being allowed on NTFS systems.

Perhaps you are thinking that there are a bunch of characters over here
called "plain text ASCII characters", and a *different* bunch of
characters with funny accents and stuff called "Unicode characters". If
so, then you are labouring under a misapprehension, and you should start
off by reading this:

http://www.joelonsoftware.com/articles/Unicode.html

then come back with any questions.

> This is not a Unicode issue since (modern) file systems will happily
> accept it. The issue is that certain characters (which are ASCII) are
> not allowed on some file systems:
> \ / : * ? " < > | @ and the NUL character

These are all Unicode characters too. Unicode is a subset of ASCII, so
anything which is ASCII is also Unicode.

> The first 9 are not allowed on NTFS, the @ is not allowed on ext3cow,
> and NUL and / are not allowed on pretty much any file system. Locale
> settings and encodings aside, these 11 characters will need to be
> escaped.

If you have an artist with control characters in their name, like newline
or carriage return or NUL, I think it is fair to just drop the control
characters and then give the artist a thorough thrashing with a halibut.

Does your mapping really need to be guaranteed reversible? If you have an
artist called "JoeBlow", and another artist called "Joe\0Blow", and a
third called "Joe\nBlow", does it *really* matter if your application
conflates them?

--
Steven

Dave Angel

unread,

May 8, 2013, 12:10:13 AM5/8/13

to pytho...@python.org

On 05/07/2013 10:06 PM, Andrew Berg wrote:
> On 2013.05.07 20:28, Neil Hodgson wrote:
>> http://support.microsoft.com/kb/74496
>> http://en.wikipedia.org/wiki/Nul_%28band%29
> I can indeed confirm that at least 'nul' cannot be used as a filename. However, I add an extension to the file names to identify them as caches.
>

Won't help. NUL.txt is just as reserved as NUL is. Extensions are
ignored in this particular piece of historical nonsense.

--
DaveA

Dave Angel

unread,

May 8, 2013, 12:13:20 AM5/8/13

to pytho...@python.org

On 05/07/2013 11:40 PM, Steven D'Aprano wrote:
>
> <SNIP>

>
> These are all Unicode characters too. Unicode is a subset of ASCII, so
> anything which is ASCII is also Unicode.
>
>

Typo. You meant Unicode is a superset of ASCII.

--
DaveA

Steven D'Aprano

unread,

May 8, 2013, 12:47:24 AM5/8/13

to

Damn. Yes, you're right. I was thinking superset, but my fingers typed
subset.

Thanks for the correction.

--
Steven

Andrew Berg

unread,

May 8, 2013, 12:49:46 AM5/8/13

to comp.lang.python

On 2013.05.07 22:40, Steven D'Aprano wrote:
> There aren't any characters outside of UTF-8 :-) UTF-8 covers the entire
> Unicode range, unlike other encodings like Latin-1 or ASCII.

You are correct. I'm not sure what I was thinking.

>> I don't understand. I have no intention of changing Unicode characters.
>
> Of course you do. You even talk below about Unicode characters like *
> and ? not being allowed on NTFS systems.

I worded that incorrectly. What I meant, of course, is that I intend to preserve as many characters as possible and have no need to stay
within ASCII.

> If you have an artist with control characters in their name, like newline
> or carriage return or NUL, I think it is fair to just drop the control
> characters and then give the artist a thorough thrashing with a halibut.

While the thrashing with a halibut may be warranted (though I personally would use a rubber chicken), conflicts are problematic.

> Does your mapping really need to be guaranteed reversible? If you have an
> artist called "JoeBlow", and another artist called "Joe\0Blow", and a
> third called "Joe\nBlow", does it *really* matter if your application
> conflates them?

Yes and yes. Some artists like to be real cute with their names and make witch house artist names look tame in comparison, and some may
choose to use names similar to some very popular artists. I've also seen people scrobble fake artists with names that look like real artist
names (using things like a non-breaking space instead of a regular space) with different artist pictures in order to confuse and troll
people. If I could remember the user profiles with this, I'd link them. Last.fm is a silly place.
As I said before though, I don't think control characters are even allowed in artist names (likely for technical reasons).

Message has been deleted

Roy Smith

unread,

May 8, 2013, 8:16:25 PM5/8/13

to

In article <mailman.1465.1368056...@python.org>,
Dennis Lee Bieber <wlf...@ix.netcom.com> wrote:

> On Tue, 07 May 2013 18:10:25 -0500, Andrew Berg
> <bahamut...@gmail.com> declaimed the following in
> gmane.comp.python.general:

>
> > None of these would work because I would have no idea which file stores
> > data for which artist without writing code to figure it out. If I
> > were to end up writing a bug that messed up a few of my cache files and
> > noticed it with a specific artist (e.g., doing a "now playing" and
> > seeing the wrong tags), I would either have to manually match up the hash
> > or base64 encoding in order to delete just that file so that it
> > gets regenerated or nuke and regenerate my entire cache.
> >

> And now you've seen why music players don't show the user the
> physical file name, but maintain a database mapping the internal data
> (name, artist, track#, album, etc.) to whatever mangled name was needed
> to satisfy the file system.

Yup. At Songza, we deal with this crap every day. It usually bites us
the worst when trying to do keyword searches. When somebody types in
"Blue Oyster Cult", they really mean "Blue Oyster Cult", and our search
results need to reflect that. Likewise for Ke$ha, Beyonce, and I don't
even want to think about the artist formerly known as an unpronounceable
glyph.

Pro-tip, guys. If you want to form a band, and expect people to be able
to find your stuff in a search engine some day, don't play cute with
your name.

Chris Angelico

unread,

May 8, 2013, 8:27:14 PM5/8/13

to pytho...@python.org

On Thu, May 9, 2013 at 10:16 AM, Roy Smith <r...@panix.com> wrote:
> Pro-tip, guys. If you want to form a band, and expect people to be able
> to find your stuff in a search engine some day, don't play cute with
> your name.

It's the modern equivalent of names like Catherine Withekay.

ChrisA

Steven D'Aprano

unread,

May 8, 2013, 9:49:22 PM5/8/13

to

On Wed, 08 May 2013 20:16:25 -0400, Roy Smith wrote:

> Yup. At Songza, we deal with this crap every day. It usually bites us
> the worst when trying to do keyword searches. When somebody types in
> "Blue Oyster Cult", they really mean "Blue Oyster Cult",

Surely they really mean Blue Öyster Cult.

> and our search
> results need to reflect that. Likewise for Ke$ha, Beyonce, and I don't
> even want to think about the artist formerly known as an unpronounceable
> glyph.

Dropped or incorrect accents are no different from any other misspelling,
and good search engines (whether online or in a desktop application)
should be able to deal with a tolerable number of misspellings.

Googling for "Blue Oyster Cult" brings up four of the top ten hits
spelled correctly with the accent, "Blue Öyster Cult". Even misspelled as
"blew oytser cult", Google does the right thing.

Even Bing manages to find Ke$ha's wikipedia page, her official website,
youtube channel, facebook and myspace pages from the misspelling "kehsha".

> Pro-tip, guys. If you want to form a band, and expect people to be able
> to find your stuff in a search engine some day, don't play cute with
> your name.

Googling for "the the" (including quotes) brings up 145 million hits,
nine of the first ten hits being relevant to the band.

On the other hand, I wouldn't want to be in a band called "The Beetles".

--
Steven

Roy Smith

unread,

May 8, 2013, 9:56:41 PM5/8/13

to

In article <518b00a2$0$29997$c3e8da3$5496...@news.astraweb.com>,

Steven D'Aprano <steve+comp....@pearwood.info> wrote:

> > When somebody types in
> > "Blue Oyster Cult", they really mean "Blue Oyster Cult",
>
> Surely they really mean Blue Öyster Cult.

Yes. The oomlaut was there when I typed it. Who knows what happened to
it by the time it hit the wire.

Andrew Berg

unread,

May 8, 2013, 10:11:28 PM5/8/13

to comp.lang.python

On 2013.05.08 19:16, Roy Smith wrote:
> Yup. At Songza, we deal with this crap every day. It usually bites us
> the worst when trying to do keyword searches. When somebody types in
> "Blue Oyster Cult", they really mean "Blue Oyster Cult", and our search
> results need to reflect that. Likewise for Ke$ha, Beyonce, and I don't
> even want to think about the artist formerly known as an unpronounceable
> glyph.
>
> Pro-tip, guys. If you want to form a band, and expect people to be able
> to find your stuff in a search engine some day, don't play cute with
> your name.

It's a thing (especially in witch house) to make names with odd glyphs in order to be harder to find and be more "underground". Very silly.
Try doing searches for these artists with names like these:
http://www.last.fm/music/%E2%96%BC%E2%96%A1%E2%96%A0%E2%96%A1%E2%96%A0%E2%96%A1%E2%96%A0
http://www.last.fm/music/ki%E2%80%A0%E2%80%A0y+c%E2%96%B2t

Steven D'Aprano

unread,

May 8, 2013, 11:08:43 PM5/8/13

to

On Wed, 08 May 2013 21:11:28 -0500, Andrew Berg wrote:

> It's a thing (especially in witch house) to make names with odd glyphs
> in order to be harder to find and be more "underground". Very silly. Try
> doing searches for these artists with names like these:

Challenge accepted.

> http://www.last.fm/music/%E2%96%BC%E2%96%A1%E2%96%A0%E2%96%A1%E2%96%A0%
E2%96%A1%E2%96%A0
> http://www.last.fm/music/ki%E2%80%A0%E2%80%A0y+c%E2%96%B2t

The second one is trivial. Googling for "kitty cat" "witch
house" (including quotes) gives at least 3 relevant links out of the top
4 hits are relevant. (I'm not sure about the Youtube page.) That gets you
the correct spelling, "ki††y c△t", and googling for that brings up many
more hits.

The first one is a tad trickier, since googling for "▼□■□■□■" brings up
nothing at all, and "mourning star" doesn't give any relevant hits on the
first page. But "mourning star" "witch house" (inc. quotes) is successful.

I suspect that the only way to be completely ungoogleable would be to
name yourself something common, not something obscure. Say, if you called
yourself "Hard Rock Band", and did hard rock. But then, googling for
"Heavy Metal" alone brings up the magazine as the fourth hit, so if you
get famous enough, even that won't work.

--
Steven

Roy Smith

unread,

May 9, 2013, 8:55:35 AM5/9/13

to

In article <518b133b$0$29997$c3e8da3$5496...@news.astraweb.com>,

Steven D'Aprano <steve+comp....@pearwood.info> wrote:

> I suspect that the only way to be completely ungoogleable would be to
> name yourself something common, not something obscure.

http://en.wikipedia.org/wiki/The_band

Gregory Ewing

unread,

May 9, 2013, 8:04:58 PM5/9/13

to

Nope... googling for "the band" brings that up as the
very first result.

The Google knows all. You cannot escape The Google...

--
Greg

Tim Chase

unread,

May 9, 2013, 8:17:07 PM5/9/13

to Gregory Ewing, pytho...@python.org

On 2013-05-10 12:04, Gregory Ewing wrote:

> Roy Smith wrote:
> > http://en.wikipedia.org/wiki/The_band
>
> Nope... googling for "the band" brings that up as the
> very first result.
>
> The Google knows all. You cannot escape The Google...

That does it. I'm naming my band "Google". :-)

-tkc

Andrew Berg

unread,

May 8, 2013, 10:12:42 PM5/8/13

to comp.lang.python

On 2013.05.08 18:37, Dennis Lee Bieber wrote:
> And now you've seen why music players don't show the user the
> physical file name, but maintain a database mapping the internal data
> (name, artist, track#, album, etc.) to whatever mangled name was needed
> to satisfy the file system.

Tags are used mainly for organization but a nice benefit of tags is that they are not subject to file system or URL or whatever other
limits. If an audio file has no metadata, most players will show the file name.

Chris Angelico

unread,

May 8, 2013, 11:53:27 PM5/8/13

to pytho...@python.org

On Thu, May 9, 2013 at 1:08 PM, Steven D'Aprano
<steve+comp....@pearwood.info> wrote:
> I suspect that the only way to be completely ungoogleable would be to
> name yourself something common, not something obscure. Say, if you called
> yourself "Hard Rock Band", and did hard rock. But then, googling for
> "Heavy Metal" alone brings up the magazine as the fourth hit, so if you
> get famous enough, even that won't work.

Yeah, so why are ubergeneric domain names worth so much? Whatevs.

The best way to be findable in a web search is to have content on your
web site. Real crawlable content. I guarantee you'll be found. Even if
you're some tiny thing tucked away in a corner of teh interwebs, you
can be found.

http://www.google.com/search?q=minstrel+hall

The song is there, but so is an obscure little D&D MUD.

ChrisA

Albert van der Horst

unread,

May 28, 2013, 9:44:10 AM5/28/13

to

In article <Lvydneajg7LXNhTM...@westnet.com.au>,

That applies to MS-DOS names. God forbid that this still holds on more modern
Microsoft operating systems?

>http://en.wikipedia.org/wiki/Nul_%28band%29
>
> Neil
--
Albert van der Horst, UTRECHT,THE NETHERLANDS
Economic growth -- being exponential -- ultimately falters.
albert@spe&ar&c.xs4all.nl &=n http://home.hccnet.nl/a.w.m.van.der.horst

Chris Angelico

unread,

May 28, 2013, 9:53:03 AM5/28/13

to pytho...@python.org

On Tue, May 28, 2013 at 11:44 PM, Albert van der Horst
<alb...@spenarnc.xs4all.nl> wrote:
> In article <Lvydneajg7LXNhTM...@westnet.com.au>,
> Neil Hodgson <nhod...@iinet.net.au> wrote:
>> There's also the Windows device name hole. There may be trouble with
>>artists named 'COM4', 'CLOCK$', 'Con', or similar.
>>
>>http://support.microsoft.com/kb/74496
>
> That applies to MS-DOS names. God forbid that this still holds on more modern
> Microsoft operating systems?

Python 3.3.0 (v3.3.0:bd8afb90ebf2, Sep 29 2012, 10:55:48) [MSC v.1600 32 bit (In
tel)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> open("com1","w").write("Test\n")
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
FileNotFoundError: [Errno 2] No such file or directory: 'com1'
>>> open("con","w").write("Test\n")
Test
5
>>>

ChrisA

Grant Edwards

unread,

May 28, 2013, 12:03:16 PM5/28/13

to

On 2013-05-28, Albert van der Horst <alb...@spenarnc.xs4all.nl> wrote:

>> There's also the Windows device name hole. There may be trouble with
>> artists named 'COM4', 'CLOCK$', 'Con', or similar.
>>
>>http://support.microsoft.com/kb/74496
>
> That applies to MS-DOS names. God forbid that this still holds on
> more modern Microsoft operating systems?

There are no more modern Microsoft operating systems. Only more
recent ones. There are still lots of reserved filenames in recent
versions of Windows.

--
Grant Edwards grant.b.edwards Yow! I've got an IDEA!!
at Why don't I STARE at you
gmail.com so HARD, you forget your
SOCIAL SECURITY NUMBER!!