Dealing with files with non-ASCII characters in the name

Arjen Markus

unread,

Sep 30, 2019, 7:57:11 AM9/30/19

to

We are more ad more often confronted with users who put their files in directories with names whose characters go beyond the basic ASCII table and the names of these files also use accented characters or even Chinese and other characters.

I have been experimenting with such names - Français.txt or Smørrebrød.txt to give two very simple examples - and have had very mixed results. Mixed in the sense that reading these names from a file and then opening files using those names sometimes works, sometimes not.

! create_names.f90 --
! Open files with a non-ASCII name
!
program create_names
implicit none

integer, parameter :: ucs4 = selected_char_kind( 'iso_10646' )
!character(len=40, kind=ucs4) :: name
character(len=40) :: name

open( 10, file = 'names.txt', encoding = 'utf-8' )

read( 10, '(3x,a)' ) name
write( *, * ) '>>', trim(name),'<<'
open( 20, file = name )
write( 20, * ) 'Success!'
close( 20 )

read( 10, '(a)' ) name
write( *, * ) '>>', trim(name),'<<'
open( 20, file = name )
write( 20, * ) 'Again success!'
close( 20 )
end program create_names

The file read in the above program is UTF-8 encoded and - viewed the right way (not via the Windows console) - I get the right names in the output, but not the right file names, at least not on Windows. I tried it with both Intel Fortran and gfortran. The resulting names either contain the characters corresponding to the bytes of the non-ASCII characters or an unidentifiable centred dot.

On Linux I got the expected file names using Intel Fortran, but not using gfortran (though that is very old version, unfortunately).

I have tried using the names directly in the program code (to avoid confusion over UTF-8) and to some extent that worked - but only on Windows.

And I have not even tried Chinese characters yet.

So, what is the recommended way of using non-ASCII characters in file names? (Mind you, we cannot continue to tell our cusomers that they should stick to the plain ASCII table.)

Regards,

Arjen

ga...@u.washington.edu

unread,

Sep 30, 2019, 8:25:50 AM9/30/19

to

On Monday, September 30, 2019 at 4:57:11 AM UTC-7, Arjen Markus wrote:
> We are more ad more often confronted with users who put their
> files in directories with names whose characters go beyond the
> basic ASCII table and the names of these files also use accented
> characters or even Chinese and other characters.

As far as I know, Fortran removes trailing blanks from file
names given to OPEN, but C doesn't. Assuming a file system that
allows trailing blanks, this means that there are some files
that C can write, but Fortran can't read.

Other than that, I don't see that Fortran should modify the
supplied file name before passing it to the OS. The OS might
itself have restrictions on file naming.

Arjen Markus

unread,

Sep 30, 2019, 8:41:21 AM9/30/19

to

The restrictions posed by the OS are another factor here, indeed. Perhaps what I am seeing is the consequence of those restrictions and the limitations of the Fortran compilers to properly deal with non-ASCII characters.

Regards,

Arjen

Gary Scott

unread,

Sep 30, 2019, 10:06:23 PM9/30/19

to

All compilers I've ever used support any binary value without
alteration. Exception of the unfortunate processing of c-style escapes
as a default behavior in some cases. Otherwise, I would think this an
OS interaction (e.g. an intermediate OS API getting in the way).

>
> Regards,
>
> Arjen
>

FortranFan

unread,

Sep 30, 2019, 11:07:43 PM9/30/19

to

On Monday, September 30, 2019 at 7:57:11 AM UTC-4, Arjen Markus wrote:

> ..

> So, what is the recommended way of using non-ASCII characters in file names? (Mind you, we cannot continue to tell our cusomers that they should stick to the plain ASCII table.)

> ..

Have you tried a C wrapper to consume a suitable class for file IO (C++ ifstream?) or one that calls API(s) from the OS vendor (e.g., Microsoft/Linux) and use that from your Fortran code?

Arjen Markus

unread,

Oct 1, 2019, 2:30:22 AM10/1/19

to

Yes, quite likely. The whole problem is convoluted, because what you get to see on the console/command window is something different than what you see if you redirect the output and view the result with a file viewer/text editor.

Regards,

Arjen

Arjen Markus

unread,

Oct 1, 2019, 2:31:20 AM10/1/19

to

Well, that might be an alternative - but it would mean that all I/O has to go via that wrapper and that is not a trivial matter.

Regards,

Arjen

FortranFan

unread,

Oct 1, 2019, 3:11:45 PM10/1/19

to

On Tuesday, October 1, 2019 at 2:31:20 AM UTC-4, Arjen Markus wrote:

> ..

> Well, that might be an alternative - but it would mean that all I/O has to go via that wrapper and that is not a trivial matter.

> ..

It's not trivial only if the design is not simple!! You can - if you let yourself to be so inclined - follow any number of simple options:

* say use C++ or other APIs to suitably *mirror* an ASCII-named equivalent of customer files and operate on these from the Fortran code and *make it appear* (e.g., copy-in/copy-out) to users as if their locale-based (Mandarin, Japanese, Arabic, European, etc.) files and folders are handled directly, or

* or use C++ or other APIs to work with file (and data) 'streams' in memory and operate on them as 'internal files' in Fortran, etc.

Eugene Epshteyn

unread,

Oct 2, 2019, 5:56:18 PM10/2/19

to

On Monday, September 30, 2019 at 7:57:11 AM UTC-4, Arjen Markus wrote:

> ...

Intel compiler supports USEROPEN specifier, which allows one to pass a special function (e.g., written in C) that would do the actual open and return a file handle or file descriptor:

https://software.intel.com/en-us/fortran-compiler-developer-guide-and-reference-open-useropen-specifier

This way, you could open the files using OS native function, which would presumably be friendly to UTF-8 encoded names.

--Eugene

Lynn McGuire

unread,

Oct 2, 2019, 8:37:55 PM10/2/19

to

My favorite non-ascii ia UTF-8 path and UTF-8 file name such as "π\pi π
.psd".

Lynn

ga...@u.washington.edu

unread,

Oct 2, 2019, 9:16:15 PM10/2/19

to

On Wednesday, October 2, 2019 at 2:56:18 PM UTC-7, Eugene Epshteyn wrote:

(snip)

> This way, you could open the files using OS native function,
> which would presumably be friendly to UTF-8 encoded names.

I haven't thought about this for many years.

HP-UX Fortran has some non-standard functions to help here.

There is FNUM(unit) which gets the OS file descriptor that
goes with an open unit.

FSET(UNIT, NEWFD, OLDFD). Attach a system file descriptor to
a logical unit.

FSTREAM(unit) which gets the C file stream pointer.

In f77 days, I used these along with popen() in a C program
to allow writing from Fortran into a pipe. (Specifically,
for input to lpr for print spooled output.)

Many Fortran systems now use underlying C library to do
actual I/O operations, such that these functions can work.

Earlier (and not HP-UX) I did mixed C and Fortran programming,
such that each had its own I/O buffers, resulting in a strange
mix of data in the stdout file.

ga...@u.washington.edu

unread,

Oct 2, 2019, 9:20:26 PM10/2/19

to

On Wednesday, October 2, 2019 at 5:37:55 PM UTC-7, Lynn McGuire wrote:

(snip)

> My favorite non-ascii ia UTF-8 path and UTF-8 file name such as "π\pi π
> .psd".

My favorite, and very easy to do in Unix, is file names with
a backspace character in them. Some Unix users set their erase
character to ^?, that is the ASCII DEL character. If used with
terminal that send the BS (backspace) character, it goes into
file names. I think you can also get CR into file names.

Ev. Drikos

unread,

Oct 3, 2019, 6:53:14 AM10/3/19

to

On 30/09/2019 2:57 PM, Arjen Markus wrote:
> ...
>
> So, what is the recommended way of using non-ASCII characters in file names? ...
>

Likely not the recommended way but once I needed to read from a file the
payload of RPM packages with spaces in paths. My choice was to use the
star character, functional in both my Bash script and the RPM utilities.

Whatsoever, you need to find a way to represent well defined paths with
spaces in your file.

The program below worked in macOS with gcc-4.8 and on Win-8.1 with a
binary built by gfortran in Cygwin.

Regards

------------------------------------------------------------------

! create_names.f90 --
! Open files with a non-ASCII name
!
program create_names
implicit none

integer, parameter :: ucs4 = selected_char_kind( 'iso_10646' )
!character(len=40, kind=ucs4) :: name

character(len=40) :: name, iname
integer i

open( 10, file = 'names.txt' )

read( 10, * ) name

do i=1,40
if (name(i:i)=='*') then
iname(i:i)= ' '
else
iname(i:i)=name(i:i)
end if
end do

write( *, * ) '>>', iname,'<<'
open( 20, file = iname )

write( 20, * ) 'Success!'
close( 20 )

close( 10 )

close( 20 )

end program create_names

------------------------------------------------------------------
#To run it in a Windows console, take the required Cygwin libraries:
for t in `/usr/bin/ldd a.exe|grep -o "/usr/bin/cyg.*.dll"` ; \
do cp $t . ; done ;

------------------------------------------------------------------
$ cat names.txt
pi*π.psd
pi*π.psd
------------------------------------------------------------------
$ cat pi\ π.psd
Success!

Arjen Markus

unread,

Oct 3, 2019, 8:45:59 AM10/3/19

to

On Thursday, October 3, 2019 at 12:53:14 PM UTC+2, Ev. Drikos wrote:

Thanks for all the suggestions, but I think the problem is actually a composition of three unrelated problems that conspire.

First of all, how to fill a string with the appropriate bytes tp define a string with the right non-ASCII characters. In my naïvety I used two methods: a UTF-8 encoded external file and accented characters in the source code.

Second, the character set used by the operating system. IIUIC, Windows uses an extended ASCII set (at least the version I have) and Linux uses UTF-8. This transpires quite probably to the source code.

Third, displaying these characters is challenging in itself. A text editor or viewer may use UTF-8 or it may use some extended character set and the Windows command window certainly uses a different character set than my editor/viewer.

Well, I have never used USEROPEN, it might be a good opportunity to see how it works and if it solves the immediate problem (no. 2, I'd say).

For now, my next step is to analyse more closely what is actually happening, and especially to get rid of the "trompes l'oeuil".

Regards,

Arjen

Ev. Drikos

unread,

Oct 3, 2019, 11:27:32 AM10/3/19

to

On 03/10/2019 3:45 PM, Arjen Markus wrote:
> ...

> Second, the character set used by the operating system. IIUIC, Windows uses an extended ASCII set (at least the version I have) and Linux uses UTF-8. This transpires quite probably to the source code.
>

In my Windows 8.1 (cp437), non ASCII file names are apparently saved in
utf-8. Just repeated the test with Korean characters in file names but
it's properly displayed only in Windows Explorer and Cygwin. Of course
this is a conclusion that needs some kind of confirmation at MSDN.

> Third, displaying these characters is challenging in itself. A text editor or viewer may use UTF-8 or it may use some extended character set and the Windows command window certainly uses a different character set than my editor/viewer.
>

> ...

A general purpose solution that converts ie utf-8 to single-byte code
pages, would need several conversion tables, not available in several
Fortran compilers I guess. So this would be a large project on its own.
In example, the Oracle JDK 8 provides 169 code pages and perhaps there
are more.

--------------------------------------------------------
$ cat names.txt
你好.created.txt
--------------------------------------------------------
$ cat 你好.created.txt
Success!
--------------------------------------------------------

Eugene Epshteyn

unread,

Oct 3, 2019, 12:01:12 PM10/3/19

to

On Thursday, October 3, 2019 at 8:45:59 AM UTC-4, Arjen Markus wrote:

> Second, the character set used by the operating system. IIUIC, Windows uses an extended ASCII set (at least the version I have) and Linux uses UTF-8. This transpires quite probably to the source code.

Windows "under the hood" has supported some form of Unicode for ages. (2 bytes per Unicode char.) It's just default development settings for new apps have been to use some form of "multibyte character string", which included ASCII, extended ASCII, but could sort of translate to other multibyte encodings, if you are careful. This also resulted in a horrible situation, where many of the Windows API functions have "ASCII" version and "wide string" version (e.g., SetWindowTextA, SetWindowTextW), hiding behind a macro (SetWindowText), see here: https://docs.microsoft.com/en-us/windows/win32/intl/conventions-for-function-prototypes

You can read some of the gory details here: https://docs.microsoft.com/en-us/windows/win32/intl/unicode-in-the-windows-api

It should be possible to convert from UTF-8 to local code page using MultiByteToWideChar() (pick UTF-8 "code page", convert UTF-8 to "wide string") and then WideCharToMultiByte() (pick "ANSI" code page).

> Third, displaying these characters is challenging in itself. A text editor or viewer may use UTF-8 or it may use some extended character set and the Windows command window certainly uses a different character set than my editor/viewer.

In many cases, it helps to differentiate the encoding of text in a file, vs. the encoding used by an application to store such text in memory, vs. the capabilities of the font used to display text. For example, a file may have text in UTF-8, the application may read this file and convert it to some form of UCS-2 or UTF-16 (2 bytes per char, with sometimes 2 chars per Unicode code point). The application may then pick a display font that has glyphs for the Unicode chars that need to be displayed. On Windows, there's "Character Map" application (may be a different name in earlier versions of Windows), which shows all glyphs supported by a particular font. Naturally, most fonts cover only a small subset of Unicode.

Yes, this can be very confusing, so it helps to know exact encodings of text in all stages (in file, read in memory, written to a different file) and all the required conversions that need to be done on the text in all operating environments.

Hope that was at least a bit helpful.

--Eugene

Ev. Drikos

unread,

Oct 7, 2019, 6:10:59 AM10/7/19

to

On 03/10/2019 6:27 PM, Ev. Drikos wrote:
>> ...

> In my Windows 8.1 (cp437), non ASCII file names are apparently saved in
> utf-8. Just repeated the test with Korean characters in file names but
> it's properly displayed only in Windows Explorer and Cygwin. Of course
> this is a conclusion that needs some kind of confirmation at MSDN.
>

>> ...

Just for the record,

This page explains that Cygwin converts UTF-8 filenames to/from UTF-16:
https://cygwin.com/cygwin-ug-net/using-specialnames.html

When I run the first test on a USB floppy, I confirmed that Greek names
are displayed properly on macOS, which labels the USB as a FAT32 disk.
They are also displayed fine on CentOS-7.6 which reports VFAT, which in
turn uses UCS-2, not UTF-16 (ie https://en.wikipedia.org/wiki/Filename).

As it seems, any conversions in file names are transparent to the user.