TextStreamReader with transparent unicode BOM Support

X_AWemner_X

unread,

Jul 2, 2003, 5:27:56 AM7/2/03

to

Ok, here is a teaser for all java io coders. Make us all happy and create a
filterreader with proper unicode bom support.

As you know, _we_ have tell InputStreamReader what unicode charset to use
for read operations. (UTF-8, UTF-16, ....). Reader does support BOM mark for
UTF-16 keyword and skip first bytes, but still we must tell it to use
UTF-16. but fails with UTF-8 files.

Win2k Notepad stores BOM mark at the start of UTF-8 files, and currently ISR
cannot read it properly.

http://www.unicode.org/unicode/faq/utf_bom.html#22

Now, do you have a streamreader which support BOMs fully transparently,
something like?

String defaultEnc = "UTF-8"; // java default is ISO-8859-1
Reader in = new BestUnicodeTextStreamReader(new
FileInputStream("myfile.txt"), defaultEnc);
-> this class would recognize all BOM marks automatically and used it. If no
BOM were found, then use given defaultEnc value.

I am sure we n00b coders would love to use such reader implementation.

Thomas Weidenfeller

unread,

Jul 2, 2003, 7:47:01 AM7/2/03

to

"X_AWemner_X" <ma...@mail.com> writes:
> Ok, here is a teaser for all java io coders. Make us all happy and create a
> filterreader with proper unicode bom support.

[...]

>
> I am sure we n00b coders would love to use such reader implementation.
>

Doesn't your company (that's ZenPark, isn't it?) have an own software
development department that can do such remittance work? Please tell
me where I should send the bill for the following rough and inefficient
sketch to? :-)

I have left out all exception handling and minor details:

class UnicodeReader implements Reader {
PushbackInputStream internalIn;
InputStreamReader internalOut = null;
String defaultEnc;

private static final int BOM_SIZE = 3; // enought for UTF8 and UTF16

UnicodeReader(InputStream in, String defaultEnc) {
internalIn = new PushbackInputStream(in, BOM_SIZE);
this.defaultEnc = defaultEnc;
}

protected void init() {
if(internalOut != null) {
return;
}

byte bom[] = new byte[BOM_SIZE];
int n;
int pos = 0;
while(pos < BOM_SIZE &&
(n = internalIn.read(bom, pos, BOM_SIZE - pos)) != -1)
{
pos += n;
}
internalIn.unread(bom, 0, pos);
String encoding = ... // evaluate the content of bom[] here
// revert to defaultEnc if nothing found
internalOut = new InputStreamReader(internalIn, encoding);
}

//
// For all methods in interface Reader, implement each method as:
//
// method(...) {
// init();
// internalOut.method(...);
// }
//
}

/Thomas

Roedy Green

unread,

Jul 2, 2003, 10:47:44 AM7/2/03

to

On Wed, 2 Jul 2003 12:27:56 +0300, "X_AWemner_X" <ma...@mail.com> wrote
or quoted :

>Win2k Notepad stores BOM mark at the start of UTF-8 files, and currently ISR
>cannot read it properly.

see http://mindprod.com/jgloss/encoding.html

What happens if you use UTF-8 or UTF-16 encoding on the code
suggested by the File IO amanuensis at
http://mindprod.com/fileio.html?

Java is not smart enough to flip between 8-16 automatically, but is it
smart enough to deal with endian markers, both BE and LE.

Ideally this should be implemented as yet another encoding:
Unicode-8-16. Does anyone know how you insert your own encoding into
the official list? You can't pass any parameters to the encoding such
as your preferred default big/little endian, so you must create
variant names for all the combinations.

--
Canadian Mind Products, Roedy Green.
Coaching, problem solving, economical contract programming.
See http://mindprod.com/jgloss/jgloss.html for The Java Glossary.

NoName NoName

unread,

Jul 3, 2003, 5:38:53 PM7/3/03

to

Thx for the good tip, I was not aware of PushPackInputStream class. It
made everything really simple to do. Here is the implementation what you
suggested.

/**
Original pseudocode : Thomas Weidenfeller
Implementation tweaked: Aki Nieminen

http://www.unicode.org/unicode/faq/utf_bom.html
BOMs:
00 00 FE FF = UTF-32, big-endian
FF FE 00 00 = UTF-32, little-endian
FE FF = UTF-16, big-endian
FF FE = UTF-16, little-endian
EF BB BF = UTF-8

Win2k Notepad:
Unicode format = UTF-16LE
***/

import java.io.*;

/**
* Generic unicode textreader, which will use BOM mark
* to identify the encoding to be used.
*/
public class UnicodeReader extends Reader {
PushbackInputStream internalIn;
InputStreamReader internalIn2 = null;
String defaultEnc;

private static final int BOM_SIZE = 4;

UnicodeReader(InputStream in, String defaultEnc) {
internalIn = new PushbackInputStream(in, BOM_SIZE);
this.defaultEnc = defaultEnc;
}

public String getDefaultEncoding() {
return defaultEnc;
}

public String getEncoding() {
if (internalIn2 == null) return null;
return internalIn2.getEncoding();
}

/**
* Read-ahead four bytes and check for BOM marks. Extra bytes are
* unread back to the stream, only BOM bytes are skipped.
*/
protected void init() throws IOException {
if (internalIn2 != null) return;

String encoding;

byte bom[] = new byte[BOM_SIZE];

int n, unread;
n = internalIn.read(bom, 0, bom.length);

if ( (bom[0] == (byte)0xEF) && (bom[1] == (byte)0xBB) &&
(bom[2] == (byte)0xBF) ) {
encoding = "UTF-8";
unread = n - 3;
} else if ( (bom[0] == (byte)0xFE) && (bom[1] == (byte)0xFF) ) {
encoding = "UTF-16BE";
unread = n - 2;
} else if ( (bom[0] == (byte)0xFF) && (bom[1] == (byte)0xFE) ) {
encoding = "UTF-16LE";
unread = n - 2;
} else if ( (bom[0] == (byte)0x00) && (bom[1] == (byte)0x00) &&
(bom[2] == (byte)0xFE) && (bom[3] == (byte)0xFF)) {
encoding = "UTF-32BE";
unread = n - 4;
} else if ( (bom[0] == (byte)0xFF) && (bom[1] == (byte)0xFE) &&
(bom[2] == (byte)0x00) && (bom[3] == (byte)0x00)) {
encoding = "UTF-32LE";
unread = n - 4;
} else {
// Unicode BOM mark not found, unread all bytes
encoding = defaultEnc;
unread = n;
}
// System.out.println("read=" + n + ", unread=" + unread);

if (unread > 0) internalIn.unread(bom, (n - unread), unread);
else if (unread < -1) internalIn.unread(bom, 0, 0);

// Use given encoding
if (encoding == null) {
internalIn2 = new InputStreamReader(internalIn);
} else {
internalIn2 = new InputStreamReader(internalIn, encoding);
}
}

public void close() throws IOException {
init();
internalIn2.close();
}

public int read(char[] cbuf, int off, int len) throws IOException {
init();
return internalIn2.read(cbuf, off, len);
}

}

> I have left out all exception handling and minor details:
>
> class UnicodeReader implements Reader {
> PushbackInputStream internalIn;
> InputStreamReader internalOut = null;
> String defaultEnc;

<...clip clip...>