Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

reading filenames from stdin - with umlauts?

20 views
Skip to first unread message

Dan Stromberg

unread,
Jul 27, 2008, 6:54:46 PM7/27/08
to

I wrote a small java program to read filenames from stdin (produced by
Linux' "find"), and then to divide those files up into like groups.

Actually, it was originally a python program, but I've been wanting to
expand my horizons a little, so I rewrote it in perl, and now I'm trying
to redo it in java to celebrate java going opensource, and I'll likely
rewrite it in Haskell and/or Objective Caml after the java version.

The java version of the program seems to work pretty well, and I have a
feeling it's going to prove faster than the python or perl versions
(which are at http://stromberg.dnsalias.org/~strombrg/equivalence-
classes.html - and I hope to put the java version there too after it's
working a little better).

However, to my disappointment, the java version of the program can't seem
to deal with filenames that have umlauts in them. Filenames using only
characters in the English alphabet seem fine.

I suspect the problem is that the file_name_, as it appears in a Linux
ext3 filesystem, has an 8 bit per character representation, but java
wants to convert the string I read from stdin to a 16 bit per character
representation, and then doesn't reverse the conversion when I go to open
the file by its name.

I've googled about this for around 4 hours now, and found little but
other people having similar issues - sometimes with files, sometimes with
files inside zip archives.

The error looks like:

find /home/dstromberg/Sound/Music/mp3/Bjork -type f -print | LANG=en_US
java -jar equivs.jar equivs.main
Encoding on isr is ISO8859_1
IO error 1: java.io.FileNotFoundException: /home/dstromberg/Sound/Music/
mp3/Bjork/Bj?rk_The Music From Drawing Restraint 9_06_Shimenawa.mp3 (No
such file or directory)
java.io.FileNotFoundException: /home/dstromberg/Sound/Music/mp3/Bjork/Bj?
rk_The Music From Drawing Restraint 9_06_Shimenawa.mp3 (No such file or
directory)
at java.io.FileInputStream.open(Native Method)
at java.io.FileInputStream.<init>(FileInputStream.java:106)
at Sortable_file.get_prefix(Sortable_file.java:63)
at Sortable_file.compareTo(Sortable_file.java:266)
at Sortable_file.compareTo(Sortable_file.java:1)
at java.util.Arrays.mergeSort(Arrays.java:1144)
at java.util.Arrays.mergeSort(Arrays.java:1155)
at java.util.Arrays.sort(Arrays.java:1079)
at equivs.main(equivs.java:54)

The code I'm reading filenames with looks like:

InputStreamReader isr = null;
try
{
isr = (new InputStreamReader(System.in, "ISO-8859-1"));
}
catch (UnsupportedEncodingException uee)
{
System.err.println("UnsupportedEncodingException: " + uee);
uee.printStackTrace();
java.lang.System.exit(1);
}
System.err.println("Encoding on isr is " + isr.getEncoding());
BufferedReader stdin = new BufferedReader (isr);
String line;

try
{
while((line = stdin.readLine()) != null)
{
// System.out.println(line);
// System.out.flush();
lst.add(new Sortable_file(line));
}
}
catch(java.io.IOException e)
{
System.err.println("IO error 0.5: " + e);
e.printStackTrace();
java.lang.System.exit(1);
}

...and the code I'm opening the filenames with looks like:

byte[] buffer = new byte[128];
java.io.File this_file;
try
{
this_file = new java.io.File(this.filename);
java.io.FileInputStream file = new java.io.FileInputStream
(this_file);
file.read(buffer);
// System.out.println("this.prefix.length " +
this.prefix.length);
file.close();
}
catch (java.io.IOException ioe)
{
System.out.println( "IO error 1: " + ioe );
ioe.printStackTrace();
java.lang.System.exit(1);
}

(this is just one small part of the compareTo function - the goal was to
make things fast, and one of the optimizations is to compare just the
first 128 bytes of a file early in the comparison, and keep it cached in
memory to make the sort fast. Only if two files have the same prefix do
we do the expensive md5 hash - etc.).

Has anyone found a way to do:

find <options> -print | ./java-prog

...and have java-prog act on the files coming from stdin - including
opening them?

Thanks!

PS: I suspect I could write a class to read bytes and piece together
strings, but 1) That'd probably be slow and 2) I want to use the
established java class hierarchy where possible and 3) the byte arrays
still might get upconverted to a different encoding upon converting them
to a string anyway. But if that's the only way, that's fine.

Message has been deleted

Arne Vajhøj

unread,
Jul 27, 2008, 7:27:29 PM7/27/08
to

> Has anyone found a way to do:


>
> find <options> -print | ./java-prog
>
> ...and have java-prog act on the files coming from stdin - including
> opening them?

Have you tried "UTF-8" instead of "ISO-8859-1" ?

Arne

Message has been deleted

Dan Stromberg

unread,
Jul 28, 2008, 1:05:23 AM7/28/08
to

I had tried a handful of encodings but not UTF-8. I've now tried it, and
found that I got the same result as with other encodings - file not found.

Dan Stromberg

unread,
Jul 28, 2008, 1:04:41 AM7/28/08
to
On Sun, 27 Jul 2008 23:25:01 +0000, Stefan Ram wrote:

> Dan Stromberg <dstromb...@gmail.com> writes:
>>The error looks like:
>
> We need to isolate the problem (as in »SSCCE«).
>
> Try this:
>
> echo "\0344" | java Main
>
> With
>
> public class Main
> { public static void main( final java.lang.String[] args )
> throws java.lang.Throwable
> { final java.io.InputStreamReader inputStreamReader
> = new java.io.InputStreamReader( System.in, "ISO8859_1" ); final
> java.io.BufferedReader bufferedReader = new java.io.BufferedReader(
> inputStreamReader ); final java.lang.String string =
> bufferedReader.readLine(); java.lang.System.out.println(
> "\u00E4".equals( string.substring( 0, 1 ))); }}
>
> If prints »false«, post the output of
>
> echo "\0344" | od -h

It printed false, but this prints true:

printf '\344' | ./foo

(This was with the gcj implementation of java. I get the same result
using OpenJDK though).

> and also the hexadecimal codes of the String »string« at the end of
> the block above.
>
> Additional information:
>
> 344 is the octal code of the letter LATIN SMALL LETTER A WITH
> DIAERESIS in ISO 8859-1.
>
> "\u00E4" is a Java String containing only the letter LATIN SMALL
> LETTER A WITH DIAERESIS.

Dan Stromberg

unread,
Jul 28, 2008, 1:12:21 AM7/28/08
to
On Mon, 28 Jul 2008 00:32:23 +0000, Stefan Ram wrote:

> Dan Stromberg <dstromb...@gmail.com> writes:
>>isr = (new InputStreamReader(System.in, "ISO-8859-1")
>

> Now, I become aware of another fact:
> »java.lang.System.in« already has an encoding.
>
> You might try to use this as a base instead:
>
> http://download.java.net/jdk7/docs/api/java/io/FileDescriptor.html#in

I tried this but still I get file not found with OpenJDK. gcj seems fine
though:

FileReader fr = null;
// isr = (new InputStreamReader(System.in, "ISO-8859-1"));
// isr = (new InputStreamReader(System.in, "UTF-8"));
fr = (new FileReader(java.io.FileDescriptor.in));
System.err.println("Encoding on fr is " + fr.getEncoding());
//BufferedReader stdin = new BufferedReader (fr);
StringBuffer line;

char ch;
int int_char;
try
{
while (true)
{
line = new StringBuffer("");
while(true)
{
int_char = fr.read();
if (int_char == -1)
{
break;
}
ch = (char)int_char;
System.out.println("" + ch);
if (ch == (char)10)
{
break;
}
line.append(ch);
}
if (int_char == -1)
{
break;
}
System.out.println(new String(line));
lst.add(new Sortable_file(new String(line)));
}
}
catch(java.io.IOException e)
{

BTW, this code says the encoding is ASCII when I run it, whether using
OpenJDK or gcj.

Is the java String type -always- 16 bits per character? That is, if I
try to stick an 8 bit value into a String, is it always going to be
converted to a different encoding that maps back most of the time, but
not always?

Do java strings of any sort have an associated but variable encoding?
Are there different string types that have different encodings?

Is there any way of opening a filename that isn't stored in a String?
Short of something like SWIG, JNI or ctypes that is?

Message has been deleted

Daniele Futtorovic

unread,
Jul 28, 2008, 10:38:47 AM7/28/08
to
On 28/07/2008 07:05, Dan Stromberg allegedly wrote:
> I had tried a handful of encodings but not UTF-8. I've now tried it, and
> found that I got the same result as with other encodings - file not found.

Have you tried not using any "encoding"? As others pointed out,
System.in is a Reader, that is something which already has some kind of
byte-to-char handling. Furthermore, if your solution ought to be
portable, it would seem to me as a bad idea to hardcode the charset. You
should rather rely on proper system configuration (java's file.encoding
being the same as the shell's) -- or maybe a runtime parameter.

--
DF.

Message has been deleted

John W Kennedy

unread,
Jul 28, 2008, 8:17:19 PM7/28/08
to
Dan Stromberg wrote:
> However, to my disappointment, the java version of the program can't seem
> to deal with filenames that have umlauts in them. Filenames using only
> characters in the English alphabet seem fine.
>
> I suspect the problem is that the file_name_, as it appears in a Linux
> ext3 filesystem, has an 8 bit per character representation, but java
> wants to convert the string I read from stdin to a 16 bit per character
> representation, and then doesn't reverse the conversion when I go to open
> the file by its name.

No. Java /always/ uses 16-bit characters; if it did that, it couldn't
open files at all.

Try running this program:

import java.io.File;

public final class DirScan {

public static void main(final String[] args) {
for (final String dirName : args) {
System.out.println(dirName);
final File dir = new File(dirName);
final File[] files = dir.listFiles();
for (final File file : files) {
final String fileName = file.toString();
System.out.printf(" %-25s ", fileName);
for (int i = 0; i < fileName.length(); ++i)
System.out.printf(" %04X", (int) fileName.charAt(i));
System.out.println();
}
}

}

}

...specifying one or more directories as arguments.


--
John W. Kennedy
"Never try to take over the international economy based on a radical
feminist agenda if you're not sure your leader isn't a transvestite."
-- David Misch: "She-Spies", "While You Were Out"

Lew

unread,
Jul 28, 2008, 8:25:13 PM7/28/08
to
Daniele Futtorovic wrote:
> Have you tried not using any "encoding"? As others pointed out,
> System.in is a Reader, that is something which already has some kind of
> byte-to-char handling.

Ahem:
> public static final InputStream in
<http://java.sun.com/javase/6/docs/api/java/lang/System.html#in>

--
Lew

Daniele Futtorovic

unread,
Jul 28, 2008, 9:11:19 PM7/28/08
to

<scratches head, walks to the nearest wall, bangs>

--
DF.

Message has been deleted
Message has been deleted

Daniele Futtorovic

unread,
Jul 29, 2008, 12:31:49 AM7/29/08
to
On 29/07/2008 03:41, Stefan Ram allegedly wrote:

> Daniele Futtorovic <da.fut...@laposte.invalid> writes:
>>> Daniele Futtorovic wrote:
>>>> Have you tried not using any "encoding"? As others pointed out,
>>>> System.in is a Reader, that is something which already has some kind of
>>>> byte-to-char handling.
>> <scratches head, walks to the nearest wall, bangs>
>
> My fault. It seems as if I would have assumed that there
> is a symmetry between System.in and System.out.

No, mine really -- I should know the class of System.in by heart --, as
well as accumulated frustration over too many mistakes in posts lately,
perplexing me. I hate making mistakes. Especially in public. :)


> Still, allegedly java.lang.System.in sometimes /has/ some
> transcoding magic in it (based on a native method).
>
> For example:
>
> »Data read from [...] System.in, [...] are handled
> differently than data read from [...] other sources [...].
>
> [A] conversion is performed by the JVM on the data to
> convert from the normal character encoding of
> file.encoding to a CCSID matching the System i job CCSID.
>
> When System.in [...][is] redirected [...], this additional
> data conversion is not performed and the data remains in a
> character encoding matching file.encoding.«
>
> http://publib.boulder.ibm.com/infocenter/iseries/v5r4/topic/rzaha/charenc.htm

This appears to be specific to the iSeries. I can't find any other
reference to System.in and encoding on the Sun site. Furthermore, the
fact that System.in is an InputStream speaks squarely against any type
of byte-to-char conversion (<=> "encoding"), doesn't it? Or should there
be some magic hidden in the JVM that decides whether the process' input
is text? I don't think that's likely. I don't think even see why that
would be a good idea.

--
DF.

Dan Stromberg

unread,
Jul 30, 2008, 10:33:57 PM7/30/08
to
On Mon, 28 Jul 2008 05:53:20 +0000, Stefan Ram wrote:

> Dan Stromberg <dstromb...@gmail.com> writes:
>>Is the java String type -always- 16 bits per character?
>

> Yes (if we ignore surrogate pairs, which are rare and not used for
> umlauts).


>
>>That is, if I try to stick an 8 bit value into a String, is it always
>>going to be converted to a different encoding that maps back most of the
>>time, but not always?
>

> The Reader objects already take care to convert between raw bytes and
> characters. Strings contain characters, stricly speaking, they have no
> »encoding«. They might be converted to/from byte[] or streams to en-
> or decode them.


>
>>Do java strings of any sort have an associated but variable encoding?
>

> No. Ignoring surrogate pairs, a string is a sequence of characters;
> the value of each character /always/ is the corresponding Unicode code
> point.


>
>>Are there different string types that have different encodings?
>

> No (for the strings of the standard class »java.lang.String«).


>
>>Is there any way of opening a filename that isn't stored in a String?
>

> Not with the standard classes AFAIK.
>
> ~~
>
> To debug, try this:
>
> $mkdir d0
> $touch d0/ä
> $find d0 -name ä -print | od -h
> 0000000 6430 2fe4 0a00
> 0000005
>
> If the filesystem uses ISO 8859-1, you should see »e4« as above
> (»64302fe4« is »d0/ä«).
>
> Then, read the output of this find from Java and debug print it from
> Java to a sequence of hex codes.
>
> If it is »6430sfe4«, then you have read it correctly (ISO 8859-1 code
> points agree with Unicode code points here). Otherwise, you might post
> here what it is instead.
>
> You can also bypass the Reader class, read the »raw bytes« from the
> stream, and use their hex dump to get an idea of the apparent encoding
> of the stream (post the hexdump here).

Often, at least on *ix, strace/truss/par/trace are a more direct route to
a solution than endless test programs.

I ran the OpenJDK version of my program under strace, and found that this
is what's being read:

[pid 11252] read(0, "/home/dstromberg/Sound/Music/mp3/Bjork/Bj\366rk_The
Music From Drawing Restraint 9_06_Shimenawa.mp3\n/home/dstromberg/Sound/
Music/mp3/Bjork/Bj\366rk_The Music From Drawing Restraint 9_10_Cetacea.mp3
\n/home/dstromberg/Sound/Music/mp3/Bjork/Bj\366rk_The Music From Drawing
Restraint 9_04_Bath.mp3\n/home/dstromberg/Sound/Music/mp3/Bjork/Bj
\366rk_The Music From Drawing Restraint 9_05_Hunter Vessel.mp3\n/home/
dstromberg/Sound/Music/mp3/Bjork/Bj\366rk_The Music From Drawing
Restraint 9_01_Gratitude.mp3\n/home/dstromberg/Sound/Music/mp3/Bjork/Bj
\366rk_The Music From Drawing Restraint 9_03_Ambergris March.mp3\n/home/
dstromberg/Sound/Music/mp3/Bjork/Bj\366rk_The Music From Drawing
Restraint 9_02_Pearl.mp3\n/home/dstromberg/Sound/Music/mp3/Bjork/Bj
\366rk_The Music From Drawing Restraint 9_09_Bolographic Entrypoint.mp3\n/
home/dstromberg/Sound/Music/mp3/Bjork/Bj\366rk_The Music From Drawing
Restraint 9_08_Storm.mp3\n/home/dstromberg/Sound/Music/mp3/Bjork/Bj
\366rk_The Music From Drawing Restraint 9_11_Antarctic Return.mp3\n/home/
dstromberg/Sound/Music/mp3/Bjork/"..., 8192) = 1089

...and this is what it's trying to open:

[pid 11252] open("/home/dstromberg/Sound/Music/mp3/Bjork/Bj�rk_The
Music From Drawing Restraint 9_06_Shimenawa.mp3", O_RDONLY|O_LARGEFILE) =
-1 ENOENT (No such file or directory)

In case your newsreader unmunged that for you, the read has one non-ASCII
byte for o+umlaut, and the open has 3 non-ASCII bytes for o+umlaut.

Any further suggestions, folks?

stro...@gmail.com

unread,
Sep 14, 2008, 5:06:41 PM9/14/08
to

I found some good help with this over on OpenJDK's i18n-dev mailing
list.

it turns out that in java (and perhaps other languages with
localization support) many locales do not guarantee correct round-trip
conversion from 8 bit filenames to 16 bit and back to 8 bit - so
you'll seem to get phantom files that seem to be there for one purpose
but not another. en_US.ISO-8859-1 is one of the few that does make
this guarantee - that is, no phantom files. I'd been trying that
locale among a handful of others, but it wasn't working because I
didn't have that locale configured on my system.

The python, perl and java versions of the program are now at
http://stromberg.dnsalias.org/~strombrg/equivalence-classes.html

Thanks to all who took an interest in the project!

On Jul 27, 3:54 pm, Dan Stromberg <dstrombergli...@gmail.com> wrote:
> I wrote a small java program to read filenames from stdin (produced by
> Linux' "find"), and then to divide those files up into like groups.
>
> Actually, it was originally a python program, but I've been wanting to
> expand my horizons a little, so I rewrote it in perl, and now I'm trying
> to redo it in java to celebrate java going opensource, and I'll likely
> rewrite it in Haskell and/or Objective Caml after the java version.
>
> The java version of the program seems to work pretty well, and I have a
> feeling it's going to prove faster than the python or perl versions

> (which are athttp://stromberg.dnsalias.org/~strombrg/equivalence-

Roedy Green

unread,
Sep 15, 2008, 7:59:43 PM9/15/08
to
On Sun, 27 Jul 2008 22:54:46 GMT, Dan Stromberg
<dstromb...@gmail.com> wrote, quoted or indirectly quoted someone
who said :

>
>I suspect the problem is that the file_name_, as it appears in a Linux
>ext3 filesystem, has an 8 bit per character representation, but java
>wants to convert the string I read from stdin to a 16 bit per character
>representation, and then doesn't reverse the conversion when I go to open
>the file by its name.

For background on your problem, see
http://mindprod.com/jgloss/encoding.html

I suggest you put your filenames in a file with UTF-8 encoding or some
encoding that supports umlauts. Then read it with a Reader. See
http://mindprod.com/applet/fileio.html for sample code.

Alternatively encode your umlauts is some weird way for the console :
eg. u^, and convert them back.

--

Roedy Green Canadian Mind Products
The Java Glossary
http://mindprod.com

Andreas Leitgeb

unread,
Sep 16, 2008, 5:02:10 AM9/16/08
to
> I suggest you put your filenames in a file with UTF-8 encoding or some
> encoding that supports umlauts. Then read it with a Reader. See
> http://mindprod.com/applet/fileio.html for sample code.

to the OP:

My suggestion is, that you "migrate" your system to utf-8, by renaming
all files with iso-8859-whatever umlauts to utf-8 encoded filenames,
and having system's LANG set to something like de_AT.utf-8 or
en_US.utf-8 or whatever applies to your location.

When I did that a couple of years ago, I wrote some TCL-script to
do the renaming. The script is available, but isn't optimized for
fool-proof usage. (no GUI, no "usage:"-screen). Also, no warranties
and whatsoever.
Anyway, (if still not scared/bored away) it's here:
<http://www.logic.at/people/avl/stuff/convertNamesToUtf8.tcl>
(tclsh should be available (if not preinstalled) on all linux-
distributions, anyway.) Just go to the root of a tree that contains
files with umlauts in their names, and run the script from there,
but of course only after having had a look at the script to verify
it doesn't install a trojan.

0 new messages