As you know, _we_ have tell InputStreamReader what unicode charset to use
for read operations. (UTF-8, UTF-16, ....). Reader does support BOM mark for
UTF-16 keyword and skip first bytes, but still we must tell it to use
UTF-16. but fails with UTF-8 files.
Win2k Notepad stores BOM mark at the start of UTF-8 files, and currently ISR
cannot read it properly.
http://www.unicode.org/unicode/faq/utf_bom.html#22
Now, do you have a streamreader which support BOMs fully transparently,
something like?
String defaultEnc = "UTF-8"; // java default is ISO-8859-1
Reader in = new BestUnicodeTextStreamReader(new
FileInputStream("myfile.txt"), defaultEnc);
-> this class would recognize all BOM marks automatically and used it. If no
BOM were found, then use given defaultEnc value.
I am sure we n00b coders would love to use such reader implementation.
Doesn't your company (that's ZenPark, isn't it?) have an own software
development department that can do such remittance work? Please tell
me where I should send the bill for the following rough and inefficient
sketch to? :-)
I have left out all exception handling and minor details:
class UnicodeReader implements Reader {
PushbackInputStream internalIn;
InputStreamReader internalOut = null;
String defaultEnc;
private static final int BOM_SIZE = 3; // enought for UTF8 and UTF16
UnicodeReader(InputStream in, String defaultEnc) {
internalIn = new PushbackInputStream(in, BOM_SIZE);
this.defaultEnc = defaultEnc;
}
protected void init() {
if(internalOut != null) {
return;
}
byte bom[] = new byte[BOM_SIZE];
int n;
int pos = 0;
while(pos < BOM_SIZE &&
(n = internalIn.read(bom, pos, BOM_SIZE - pos)) != -1)
{
pos += n;
}
internalIn.unread(bom, 0, pos);
String encoding = ... // evaluate the content of bom[] here
// revert to defaultEnc if nothing found
internalOut = new InputStreamReader(internalIn, encoding);
}
//
// For all methods in interface Reader, implement each method as:
//
// method(...) {
// init();
// internalOut.method(...);
// }
//
}
/Thomas
>Win2k Notepad stores BOM mark at the start of UTF-8 files, and currently ISR
>cannot read it properly.
see http://mindprod.com/jgloss/encoding.html
What happens if you use UTF-8 or UTF-16 encoding on the code
suggested by the File IO amanuensis at
http://mindprod.com/fileio.html?
Java is not smart enough to flip between 8-16 automatically, but is it
smart enough to deal with endian markers, both BE and LE.
Ideally this should be implemented as yet another encoding:
Unicode-8-16. Does anyone know how you insert your own encoding into
the official list? You can't pass any parameters to the encoding such
as your preferred default big/little endian, so you must create
variant names for all the combinations.
--
Canadian Mind Products, Roedy Green.
Coaching, problem solving, economical contract programming.
See http://mindprod.com/jgloss/jgloss.html for The Java Glossary.
/**
Original pseudocode : Thomas Weidenfeller
Implementation tweaked: Aki Nieminen
http://www.unicode.org/unicode/faq/utf_bom.html
BOMs:
00 00 FE FF = UTF-32, big-endian
FF FE 00 00 = UTF-32, little-endian
FE FF = UTF-16, big-endian
FF FE = UTF-16, little-endian
EF BB BF = UTF-8
Win2k Notepad:
Unicode format = UTF-16LE
***/
import java.io.*;
/**
* Generic unicode textreader, which will use BOM mark
* to identify the encoding to be used.
*/
public class UnicodeReader extends Reader {
PushbackInputStream internalIn;
InputStreamReader internalIn2 = null;
String defaultEnc;
private static final int BOM_SIZE = 4;
UnicodeReader(InputStream in, String defaultEnc) {
internalIn = new PushbackInputStream(in, BOM_SIZE);
this.defaultEnc = defaultEnc;
}
public String getDefaultEncoding() {
return defaultEnc;
}
public String getEncoding() {
if (internalIn2 == null) return null;
return internalIn2.getEncoding();
}
/**
* Read-ahead four bytes and check for BOM marks. Extra bytes are
* unread back to the stream, only BOM bytes are skipped.
*/
protected void init() throws IOException {
if (internalIn2 != null) return;
String encoding;
byte bom[] = new byte[BOM_SIZE];
int n, unread;
n = internalIn.read(bom, 0, bom.length);
if ( (bom[0] == (byte)0xEF) && (bom[1] == (byte)0xBB) &&
(bom[2] == (byte)0xBF) ) {
encoding = "UTF-8";
unread = n - 3;
} else if ( (bom[0] == (byte)0xFE) && (bom[1] == (byte)0xFF) ) {
encoding = "UTF-16BE";
unread = n - 2;
} else if ( (bom[0] == (byte)0xFF) && (bom[1] == (byte)0xFE) ) {
encoding = "UTF-16LE";
unread = n - 2;
} else if ( (bom[0] == (byte)0x00) && (bom[1] == (byte)0x00) &&
(bom[2] == (byte)0xFE) && (bom[3] == (byte)0xFF)) {
encoding = "UTF-32BE";
unread = n - 4;
} else if ( (bom[0] == (byte)0xFF) && (bom[1] == (byte)0xFE) &&
(bom[2] == (byte)0x00) && (bom[3] == (byte)0x00)) {
encoding = "UTF-32LE";
unread = n - 4;
} else {
// Unicode BOM mark not found, unread all bytes
encoding = defaultEnc;
unread = n;
}
// System.out.println("read=" + n + ", unread=" + unread);
if (unread > 0) internalIn.unread(bom, (n - unread), unread);
else if (unread < -1) internalIn.unread(bom, 0, 0);
// Use given encoding
if (encoding == null) {
internalIn2 = new InputStreamReader(internalIn);
} else {
internalIn2 = new InputStreamReader(internalIn, encoding);
}
}
public void close() throws IOException {
init();
internalIn2.close();
}
public int read(char[] cbuf, int off, int len) throws IOException {
init();
return internalIn2.read(cbuf, off, len);
}
}
> I have left out all exception handling and minor details:
>
> class UnicodeReader implements Reader {
> PushbackInputStream internalIn;
> InputStreamReader internalOut = null;
> String defaultEnc;
<...clip clip...>