Issue 467 in zxing: Improving the charset guessing method for the QR code decoder.

26 views
Skip to first unread message

zx...@googlecode.com

unread,
Jul 5, 2010, 4:36:28 AM7/5/10
to zx...@googlegroups.com
Status: New
Owner: ----
Labels: Type-Defect Priority-Medium

New issue 467 by sangkkim78: Improving the charset guessing method for the
QR code decoder.
http://code.google.com/p/zxing/issues/detail?id=467

To solve the charset problem in the QR code decoding, I revised the
com.google.zxing.qrcode.decoder.DecodedBitStreamParser class like below.

----------------------------------------------------------------------
package com.google.zxing.qrcode.decoder;

final class DecodedBitStreamParser {

private static void decodeByteSegment(BitSource bits,
StringBuffer result,
int count,
CharacterSetECI
currentCharacterSetECI,
Vector byteSegments,
Hashtable hints) throws
FormatException {
byte[] readBytes = new byte[count];
if (count << 3 > bits.available()) {
throw FormatException.getFormatInstance();
}
for (int i = 0; i < count; i++) {
readBytes[i] = (byte) bits.readBits(8);
}
String encoding;
if (currentCharacterSetECI == null) {

// I changed the charset guessing method from the guessEncoding() to
the myGuessEncoding()
encoding = myGuessEncoding(readBytes);

} else {
encoding = currentCharacterSetECI.getEncodingName();
}
try {
String decodedString = new String(readBytes, encoding);
result.append(decodedString);
} catch (UnsupportedEncodingException uce) {
throw FormatException.getFormatInstance();
}
byteSegments.addElement(readBytes);
}

private static String myGuessEncoding(byte[] readBytes) {
try {
CharsetDecoder decoder = Charset.forName("utf-8").newDecoder();
decoder.onUnmappableCharacter(CodingErrorAction.REPORT);
decoder.onMalformedInput(CodingErrorAction.REPORT);

CharBuffer charBuffer = decoder.decode(ByteBuffer.wrap(readBytes));
charBuffer.clear();

return UTF8;
}catch(Exception ex) {
return SHIFT_JIS;
}
}
}
----------------------------------------------------------------------

The previos method, the guessEncoding(), is guessing the charset by
calculating the byte array.
But my method, the myGuessingEncoding(), is validating the byte array with
the CharsetDecoder class.

Actually it's very difficult to guess the charset from the byte array.
So I changed the code to validate the byte array with the CharsetDecoder.
In my case, the result was better than before.

I had a test with the two QR code images below.
1. QR code in UTF-8. (English + Korean)
http://blog.naver.com/sangkkim78/40109826013
2. QR code in SHIFT-JIS. (Japanese)
http://blog.naver.com/sangkkim78/40109826258

Can you consider my approach for the better results?

Thanks.


zx...@googlecode.com

unread,
Jul 5, 2010, 11:16:28 AM7/5/10
to zx...@googlegroups.com
Updates:
Status: NotABug
Labels: -Priority-Medium Priority-Low

Comment #1 on issue 467 by sro...@gmail.com: Improving the charset guessing

Yes, but CharsetDecoder is not available in J2ME, and the code you're
referring to is common to all clients including J2ME.

It would probably be possible to be able to plug-in detection logic so that
the Android client can insert logic like this because it can use this
class. It might be fixing something that isn't broken though -- do you have
a test case that shows this fixes some guess? and have you validated that
this approach doesn't break unit tests?

This approach still does not work 100% because in both cases you are basing
the guess off the byte array, and some byte sequences are valid in multiple
encodings.

But most of all, this version never returns ISO-8859-1, which can't be
correct. That is the default and almost always the correct encoding to use.

zx...@googlecode.com

unread,
Jul 6, 2010, 4:50:42 AM7/6/10
to zx...@googlegroups.com

Comment #2 on issue 467 by sangkkim78: Improving the charset guessing

1. I'm sorry that the CharsetDecoder is not available in J2ME.
I was curious why you don't use it to solve this problem...;)

2. I have the test case which you want.
When I had a test with your code, I couldn't decode the above two qrcode
images without broken characters at the same time.

2-1. When I set the charset to "UTF-8", the first qrcode in
UTF-8(english + korean) was decoded succussfully.
But the other qrcode in SHIFT-JIS(japanese) was decoded with some
broken characters.

2-2. When I set the charset to nothing, the first qrcode was decoded
with some broken characters.
But the other qrocode was decoded successfully.

2-3. And I couldn't decode the second qrcode with no broken characters
using the BarcodeScanner v3.31.
The result was same with 2-1.

2-4. But I was able to decode them with no broken character at the same
time with my code.

3. I revised my codes like below and I confirmed that it doesn't break any
unit tests.
(I attached my codes.)

----------------------------------------------------------------------

package com.google.zxing.qrcode.decoder;

final class DecodedBitStreamParser {
private static void decodeByteSegment(BitSource bits,
StringBuffer result,
int count,
CharacterSetECI
currentCharacterSetECI,
Vector byteSegments,
Hashtable hints) throws
FormatException {
byte[] readBytes = new byte[count];
if (count << 3 > bits.available()) {
throw FormatException.getFormatInstance();
}
for (int i = 0; i < count; i++) {
readBytes[i] = (byte) bits.readBits(8);
}
String encoding;
if (currentCharacterSetECI == null) {

// I changed the charset guessing method from the guessEncoding() to

the myGuessEncoding2()
encoding = myGuessEncoding2(readBytes);

} else {
encoding = currentCharacterSetECI.getEncodingName();
}
try {

result.append(new String(readBytes, encoding));


} catch (UnsupportedEncodingException uce) {
throw FormatException.getFormatInstance();
}
byteSegments.addElement(readBytes);
}

private static String myGuessEncoding2(byte[] readBytes) {


try {
CharsetDecoder decoder = Charset.forName("utf-8").newDecoder();
decoder.onUnmappableCharacter(CodingErrorAction.REPORT);
decoder.onMalformedInput(CodingErrorAction.REPORT);

CharBuffer charBuffer = decoder.decode(ByteBuffer.wrap(readBytes));

charBuffer.clear();

return UTF8;
}catch(Exception ex) {

// This code was added
try {
CharsetDecoder decoder = Charset.forName("shift_jis").newDecoder();


decoder.onUnmappableCharacter(CodingErrorAction.REPORT);
decoder.onMalformedInput(CodingErrorAction.REPORT);

CharBuffer charBuffer = decoder.decode(ByteBuffer.wrap(readBytes));

charBuffer.clear();

return SHIFT_JIS;
}catch(Exception ex2) {
return ISO88591;
}

}
}
}

----------------------------------------------------------------------

When I ran the unit tests with your code, I met three test failures.
164 test cases were tested and 3 cases were failed.
And the results with my code were same with yours. (164 passed, 3
failures.)
The failed test cases are below.
And I think they are no relationship with our issue.

----------------------------------------------------------------------
Unit test logs.

com.google.zxing.negative.FalsePositivesBlackBoxTestCase
testBlackBox(com.google.zxing.negative.FalsePositivesBlackBoxTestCase)
junit.framework.AssertionFailedError: Rotation 0.0 degrees: Too many false
positives found
at junit.framework.Assert.fail(Assert.java:47)
at junit.framework.Assert.assertTrue(Assert.java:20)
at
com.google.zxing.common.AbstractNegativeBlackBoxTestCase.testBlackBox(AbstractNegativeBlackBoxTestCase.java:95)
...

com.google.zxing.negative.PartialBlackBoxTestCase
testBlackBox(com.google.zxing.negative.PartialBlackBoxTestCase)
junit.framework.AssertionFailedError: Rotation 0.0 degrees: Too many false
positives found
at junit.framework.Assert.fail(Assert.java:47)
at junit.framework.Assert.assertTrue(Assert.java:20)
at
com.google.zxing.common.AbstractNegativeBlackBoxTestCase.testBlackBox(AbstractNegativeBlackBoxTestCase.java:95)
...

com.google.zxing.oned.UPCABlackBox4TestCase
testBlackBox(com.google.zxing.oned.UPCABlackBox4TestCase)
junit.framework.AssertionFailedError: Rotation 0.0 degrees: Too many images
failed
at junit.framework.Assert.fail(Assert.java:47)
at junit.framework.Assert.assertTrue(Assert.java:20)
at
com.google.zxing.common.AbstractBlackBoxTestCase.testBlackBoxCountingResults(AbstractBlackBoxTestCase.java:223)
...

----------------------------------------------------------------------

4. For supporting J2ME, I have an idea.

If ZXing is used in J2ME, use your charset guessing code.
And if it is used in J2SE, use my charset guessing code.

I think it is same with the ASSUME_SHIFT_JIS flag in the
com.google.zxing.qrcode.decoder.DecodedBitStreamParser class.

I think you can get the JVM information from the java.lang.System.

If you accept my idea, you can improve the charset problem with no API
change.

Thanks.

Attachments:
DecodedBitStreamParser.java 15.4 KB

zx...@googlecode.com

unread,
Jul 6, 2010, 11:48:14 AM7/6/10
to zx...@googlegroups.com

Comment #3 on issue 467 by sro...@gmail.com: Improving the charset guessing
Those failures are not unrelated -- look at the detailed output and I
imagine you'll find it's because many barcodes are decoding to the wrong
text.

I am not sure this approach works better; it looks too simple to just try a
series of encodings and take the first one that isn't invalid. You need to
be a bit more sophisticated at guessing, among several possible valid
encodings, which is most likely.

Example: say you have the following bytes:

6D C3 BC 6C 6C 65 72

This would conclude it is Shift JIS, encoding "mテシller". Those bytes do
validly encode that string in Shift JIS. However I think you'd agree it's
much more likely that this is ISO 8859 1 encoding "müller"

zx...@googlecode.com

unread,
Jul 6, 2010, 11:59:47 AM7/6/10
to zx...@googlegroups.com

Comment #4 on issue 467 by sro...@gmail.com: Improving the charset guessing
I messed up my own example. Instead try:

4D DC 4C 4C 45 52

It can be interpreted as Shift JIS as "MワLLER" but is obviously probably
meant to be "MÜLLER", interpreted as ISO 8859 1.

zx...@googlecode.com

unread,
Jul 7, 2010, 2:33:16 AM7/7/10
to zx...@googlegroups.com

Comment #5 on issue 467 by sangkkim78: Improving the charset guessing

1. I fully agree with you for the sophisticated approach is required... ^_^

2. I have another idea.

What do you think about using the CharsetDecoder as a charset
post-validator?

My idea is ...

1) Guess a charset with your code, the guessEncoding().
2) Validate a byte array with the guessed charset using the
CharsetDecoder.
3) If the CharsetDecoder finds a broken character, find another charset.

I think this can improve our charset problem.
At least, we can avoid invalid characters.

I attached my code implementing my idea.

3. If I had to consider all kinds of charsets, I would never try to solve
it.
But fortunately we are interested in ISO8859-1, SHIFT-JIS and UTF-8 only.

I think it's very hard to find a perfect solution.
(It could be impossible.)

As your example, some cases can be valid for several charsets.
But if a string is getting longger, the probability to be valid for
several encodings is getting lower.

If we focus on improvement, I think the CharsetDecoder is useful.

Thank you for listening to my idea.


Attachments:
DecodedBitStreamParser.java 16.1 KB

zx...@googlecode.com

unread,
Jul 7, 2010, 4:39:23 AM7/7/10
to zx...@googlegroups.com

Comment #6 on issue 467 by sro...@gmail.com: Improving the charset guessing

It's not a bad idea to add a 'validator' step. In practice it may not do
much. I think the encoding guessing works surprisingly well in the real
world so it would rarely change the guess.

zx...@googlecode.com

unread,
Jul 7, 2010, 7:26:05 AM7/7/10
to zx...@googlegroups.com

Comment #7 on issue 467 by sangkkim78: Improving the charset guessing

1. I think so, too.

2. I attached my QR codes and their result photos again to help you.

1) TC1-UTF8.png :
This is a qrcode in UTF-8.

2) TC2-SHIFT-JIS.png :
This is a qrcode in SHIFT-JIS.
(Sometimes it is not detected well. But try again and again. Then you
can see a result.)

3. When I had a test with the BarcodeScanner v3.31, the results were

1) The TC1-UTF8.png was successful.
(Please see the attached TC1-UTF8-Result-Success.JPG)

2) The TC2-SHIFT-JIS.png was failed.
(Please see the attached TC2-SHIFT-JIS-Result-Failure.jpg)

4. Both of them must be decoded successfully with no
DecodeHintType.CHARACTER_SET.

When I had a test with ZXing-1.5.zip, I couldn't decode them sucessfully
at the same time.

But after applying the charset validator, I can decode them sucessfully
at the same time without the DecodeHintType.CHARACTER_SET.
(Please see the TC1-UTF8-Result-Success.JPG and the
TC2-SHIFT-JIS-Result-Success.JPG)

Thank you.

Attachments:
TC1-UTF8.png 2.1 KB
TC2-SHIFT-JIS.png 61.9 KB
TC1-UTF8-Result-Success.JPG 630 KB
TC1-UTF8-Result-Failure.JPG 651 KB
TC2-SHIFT-JIS-Result-Success.JPG 900 KB
TC2-SHIFT-JIS-Result-Failure.jpg 275 KB

zx...@googlecode.com

unread,
Jul 7, 2010, 7:58:56 AM7/7/10
to zx...@googlegroups.com

Comment #8 on issue 467 by sro...@gmail.com: Improving the charset guessing

The second test actually works for me. From the picture it looks like this
code has multiple byte segments, some quite short, and so short that the
wrong encoding is guessed.

The real problem here is that you can't encode Shift JIS is a QR code,
technically. Quite technically this is not valid contents; it should be
using Kanji mode, really. But of course in practice people do use byte mode
and Shift JIS. If your device was Japanese the scanner would prefer to
guess Shift JIS and I'm pretty sure it would work.

But, your proposed changed doesn't address this. Even a validator step
doesn't address this. So what do you think can or should be done? This is
more a problem with the QR code..

zx...@googlecode.com

unread,
Jul 7, 2010, 12:53:39 PM7/7/10
to zx...@googlegroups.com

Comment #9 on issue 467 by sangkkim78: Improving the charset guessing

1. I got the second QR code from internet.
I didn't make it by myself.
I don't know the ISO 18004 spec in detail.
Therefore I can't say it is wrong or not technically. Sorry.

2. My device is not Japanese but Korean.
If you are meaning that the ASSUME_SHIFT_JIS flag was applied, it's hard
for me to agree to it.
(Of course, I'll check it again in my office tomorrow morning. It's 1:45
am now... ^^)

3. I think the validator step can handle this.
Because the CharsetDecoder in the validator step will throw an exception
caused by the broken characters,
we can have a chance to choose another charset.

In the DecodedBitStreamParser.java attached in the comment 5,
I tried to find another charset using the CharsetDecoder and found it.

What I mean,
1) The guessEncoding() guesses an encoding.
2) Broken characters are detected in the validator step.
3) The CharsetDecoder guesses another charset except the former one.

Throught these steps, this can be handled.

4. For the second QR code, please see the bottom of the QR code image.
You can see the http://qrcode.sourceforge.jp/
And I found the second QR code on that site.
(I erased the right image of the second QR code to prevent a confusion.)
I think the second QR code was generated by them.
They are saying their library is wildly used in diverse area.

Please let me know if I misunderstood.
Thanks a lot.


zx...@googlecode.com

unread,
Jul 7, 2010, 7:50:10 PM7/7/10
to zx...@googlegroups.com

Comment #10 on issue 467 by sro...@gmail.com: Improving the charset
guessing method for the QR code decoder.
http://code.google.com/p/zxing/issues/detail?id=467

No, this is exactly my point: some byte sequences are valid encodings of a
string in several encodings. Guessing by just taking the first one that
isn't broken does not work. That is what my example is showing.

zx...@googlecode.com

unread,
Jul 7, 2010, 11:03:22 PM7/7/10
to zx...@googlegroups.com

Comment #11 on issue 467 by sangkkim78: Improving the charset guessing

1. I understand what you are saying.
The validator step can't handle your example.
I agree.

What I expect is that the validator step alleviates the mis-guessing by
finding some invalid characters.
I don't expect the perfect solution which can cover your example.
(Actually it's very hard.)

2. To solve your example, I have an idea.

The problem characters between ISO8859-1 and SHIFT-JIS are from 0xC0 to
0xDF.

In ISO8859-1, they are generally vowels. (Some consonants are contained.)
In SHIFT-JIS, they are Katakana and there is no distinguishment between
consonants and vowels.

Therefore if they are used continuosly over three times, the probability
of which they are Katakana(SHIFT-JIS) is higher than ISO8859-1.
If not, they may be ISO8859-1.

And if they are ISO8859-1, they are used with other consonants generally.

I have seen similar codes in your code.
I think this idea can enhance your charset guessing method.

(Of course, it will be very tough to implement this idea.)

Thank you.


zx...@googlecode.com

unread,
Nov 3, 2010, 10:53:22 AM11/3/10
to zx...@googlegroups.com

Comment #12 on issue 467 by k.gnanag...@greatinnovus.com: Improving the
charset guessing method for the QR code decoder.
http://code.google.com/p/zxing/issues/detail?id=467

Hi,

I need your help,

We are used zxing-1.6 version for online qrcode decoder.

Here is the sample path for online decoder using zxing-1.6.

1.This qrcode created by english text

http://demo.greatinnovus.com/zxing-1.6a/decode1.php

2.This qrcode created by korean text

http://demo.greatinnovus.com/zxing-1.6a/decode2.php

Qrcodes:
http://demo.greatinnovus.com/zxing-1.6a/qrgis.png
http://demo.greatinnovus.com/zxing-1.6a/korean.png

When i hit these urls,
The decoded result text for qrcode was correctly in the first url.
But in the second url, the decoding result text was only in question(?)
symbol for korean qrcode decode.

Here is my codes,

java -cp javase/javase.jar:core/core.jar
com.google.zxing.client.j2se.CommandLineRunner qrgis.png

What is the solution for fix that issue?

Thanks in advance.

Reply all
Reply to author
Forward
0 new messages