Getting an ifstream for a file with unicode chars in the file name

Zapanaz

unread,

Jan 25, 2008, 7:57:58 PM1/25/08

to

I am working with an open-source library which reads mp3 tags
(id3lib).

It seems to handle unicode OK within the tags ok ... but to
instantiate the central class from a file, it first creates an
ifstream this way:

ID3_Err dami::openReadableFile(string fileName, fstream& file)
{
file.open(fileName.c_str(), ios::in | ios::binary | NOCREATE);

Which fails if the file name contains unicode characters.
std::fstream.open() doesn't support wchar_t characters at all, nor
UTF-8.

I can change the source code and rebuild it, I think I am stuck with
that much, but unless I am going to massively redesign the project, I
need to end up with an fstream. Internally throughout the rest of the
library it is going to expect an fstream as a class member.

But for the life of me I can't find a way to create an fstream from a
file with unicode characters in the file name.

I found a few suggestions through Google, like this one:

std::ifstream stm(_wfopen(pwsz,L"rb"));

Where you use _wfopen() to open the file with wchar_t characters in
the file name, then pass the FILE * returned by that to the ifstream
constructor. But at least in MSVC++ 6.0, that just doesn't work. The
constructor for ifstream won't accept a FILE *.

Does anybody know a way to do this?

Thanks for any help

--
Joe Cosby
http://joecosby.com/
My cat seems to produce enough extraneous hair to create another whole
cat about twice a year. They actually seem to be trying to form under
various tables and bookshelves, but only in uncarpeted areas. They
never get much beyond a sort of misty transparent stage and then they
get blown out into the middle of the room.
- nenslo

:: Currently listening to Reflection, 2004, by Ravi Shankar, from "The Rough Guide To Ravi Shankar"

David Wilkinson

unread,

Jan 25, 2008, 9:02:14 PM1/25/08

to

Zapanaz wrote:
>
> I am working with an open-source library which reads mp3 tags
> (id3lib).
>
> It seems to handle unicode OK within the tags ok ... but to
> instantiate the central class from a file, it first creates an
> ifstream this way:
>
> ID3_Err dami::openReadableFile(string fileName, fstream& file)
> {
> file.open(fileName.c_str(), ios::in | ios::binary | NOCREATE);
>
> Which fails if the file name contains unicode characters.
> std::fstream.open() doesn't support wchar_t characters at all, nor
> UTF-8.
>
> I can change the source code and rebuild it, I think I am stuck with
> that much, but unless I am going to massively redesign the project, I
> need to end up with an fstream. Internally throughout the rest of the
> library it is going to expect an fstream as a class member.
>
> But for the life of me I can't find a way to create an fstream from a
> file with unicode characters in the file name.
>
> I found a few suggestions through Google, like this one:
>
> std::ifstream stm(_wfopen(pwsz,L"rb"));
>
> Where you use _wfopen() to open the file with wchar_t characters in
> the file name, then pass the FILE * returned by that to the ifstream
> constructor. But at least in MSVC++ 6.0, that just doesn't work. The
> constructor for ifstream won't accept a FILE *.
>
> Does anybody know a way to do this?

Zapanaz:

In VC, std::string contains 8-bit (char) characters, and unicode means 16-bit
(wchar_t) characters. So how can your filename contain unicode characters?

In VC9 (and I think VC8) it is possible to open std::fstream by specifying the
file name as a wide string (const wchar_t*). That is there are both

std::fstream::open(const char* filename, ...)

and

std::fstream::open(const wchar_t* filename, ...)

This is a (sane) extension to the C++ standard, which for some reason I have
never understood insists on only having only const char* filenames. On earlier
versions of VC, you have to convert the wchar_t file name to an 8-bit string
using the current code page.

--
David Wilkinson
Visual C++ MVP

Giovanni Dicanio

unread,

Jan 26, 2008, 4:03:41 AM1/26/08

to

"David Wilkinson" <no-r...@effisols.com> ha scritto nel messaggio
news:%23MEd$97XIH...@TK2MSFTNGP04.phx.gbl...

> In VC9 (and I think VC8) it is possible to open std::fstream by specifying
> the file name as a wide string (const wchar_t*). That is there are both
>
> std::fstream::open(const char* filename, ...)
>
> and
>
> std::fstream::open(const wchar_t* filename, ...)
>
> This is a (sane) extension to the C++ standard, which for some reason I
> have never understood insists on only having only const char* filenames.

Is it possible that the C++ standard people wanted UTF-8 as Unicode
encoding?

Giovanni

Alex Blekhman

unread,

Jan 26, 2008, 4:13:48 AM1/26/08

to

"Zapanaz" wrote:
> But for the life of me I can't find a way to create an fstream
> from a
> file with unicode characters in the file name.

> [...]

> Does anybody know a way to do this?

One of ways to cope with it is to upgrade to newer version of
VC++, as David pointed. However, if this is not an option, then
you can try to open a file via its short name. Look for
description of `GetShortPathName' function in MSDN. There is
theoretical possibility that user will disable short name
generation for certain NTFS volumes, though.

HTH
Alex

Zapanaz

unread,

Jan 26, 2008, 3:54:34 PM1/26/08

to

On Sat, 26 Jan 2008 11:13:48 +0200, "Alex Blekhman"
<tkfx....@yahoo.com> wrote:

>One of ways to cope with it is to upgrade to newer version of
>VC++, as David pointed. However, if this is not an option, then
>you can try to open a file via its short name. Look for
>description of `GetShortPathName' function in MSDN. There is
>theoretical possibility that user will disable short name
>generation for certain NTFS volumes, though.

On Fri, 25 Jan 2008 21:02:14 -0500, David Wilkinson
<no-r...@effisols.com> wrote:

>This is a (sane) extension to the C++ standard, which for some reason I have
>never understood insists on only having only const char* filenames. On earlier
>versions of VC, you have to convert the wchar_t file name to an 8-bit string
>using the current code page.

Thanks David and Alex,

I can try both the code page and short path name version of the file
name, I'd forgotten the short path name approach, that will probably
cover a large percentage of cases until I can get a later version of
visual studio.

I am wondering though, I know this is an impossible question to make a
definitive answer to, but roughly, how likely do you think it would be
that a file on a user's hard drive would not be representable in their
current code page?

These files would be MP3's. I know in my own library of MP3's,
something like 57 out of 8500 have unicode in the file names at all.

It -seems- like ifstream.open() should be able to open any file on the
user's hard drive ... how are the file names actually stored by the
file system, UTF-16 in NTFS? If I assume that some percentage of a
user's MP3's could have foreign characters in the file name ... I
guess the question is, can there be characters which could be in a
file name on a user's hard drive which would not be representable in
that same user's code page?

It seems like the answer would have to be yes. In fact in formulating
the question I think I have answered it myself. If file names are
stored in UTF-16, there could definitely be file names that could not
be represented in the user's code page.

--
Joe Cosby
http://joecosby.com/

When you're running down the street on fire, people get out of your way.
- Richard Pryor

:: Currently listening to Ghost Riders on the Storm, 2004, by California Guitar Trio, from "Whitewater"

Alex Blekhman

unread,

Jan 26, 2008, 5:50:22 PM1/26/08

to

"Zapanaz" wrote:
> I am wondering though, I know this is an impossible question to
> make a
> definitive answer to, but roughly, how likely do you think it
> would be
> that a file on a user's hard drive would not be representable in
> their
> current code page?

I think that nowadays it's pretty likely. With Windows XP/2K
everywhere the code page is almost irrelevant. For example, I
never change my machine's regional settings. It's always US
English both for user and system locale. However, I have a lot of
files in "My Documents" that have non-English names.

With system as 2K/XP, which supports Unicode natively people just
don't care about "system friendly" names anymore. They name files
and folders just as they name physical files in their native
language. And there are more non-English users out there in the
world than English speakers/computer geeks.

Moreover, even among English speakers there are enough pedants
that write "coöperative" or "fiancé". This castrated charset that
we used to employ a couple of last decades in computers (and
before then in telegraphs) is finally fades away. People again
write their texts as they were always used to, with funny
characters and in many languages.

> how are the file names actually stored by the
> file system, UTF-16 in NTFS?

Yes, NTFS stores a filename as a sequence of 16-bit values.
However, no validation of actual UTF-16 conformance is performed.
Filename may contain any value (except the reserved ones).

> If file names are
> stored in UTF-16, there could definitely be file names that
> could not
> be represented in the user's code page.

You're correct.

Alex