Issue 108 in zxing: core decoder for Data Matrix should return byte[], not Unicode string.

449 views
Skip to first unread message

codesite...@google.com

unread,
Nov 13, 2008, 11:19:58 AM11/13/08
to zx...@googlegroups.com
Issue 108: core decoder for Data Matrix should return byte[], not Unicode
string.
http://code.google.com/p/zxing/issues/detail?id=108

New issue report by sanfordsquires:
What steps will reproduce the problem?
1. Look at DecodeBitStreamParser.java source code - Line 76, et. seq.
2.
3.

What is the expected output? What do you see instead?
Data Matrix standard defines DM code as a simple binary/ASCII byte
carrier, not a Unicode content carrier. The current zxing Java source
code however, accumulates the decoded stream of bytes into a Unicode
StringBuffer, possibly introducing unwarranted character translations
depending on JVM internals, thus potentially corrupting binary byte
content. The data type would be better kept as a simple byte[], rather
than a StringBuffer.

Interpretation of a binary byte[] as a Java Unicode string can be done at
a later point by the application's processing, rather than here in the
core decoder processing. The conversion is simply a call like 'new String
(byteArray)', with several optional additional parameters allow
application conrol of how the translation is done. Note that allowing the
application to do the translation allows easy extraction of specific
subsets of bytes for translation to Unicode, while keeping other bytes as
binary entities or other translation (e.g. - as Integer, Short, or Hex
format)

What version of the product are you using? On what operating system?
Current source code.

Please provide any additional information below.
Change requires very little effort in zxing, and will save effort and
possible insidious bugs for calling applications.

Issue attributes:
Status: New
Owner: ----
Labels: Type-Defect Priority-Medium

--
You received this message because you are listed in the owner
or CC fields of this issue, or because you starred this issue.
You may adjust your issue notification preferences at:
http://code.google.com/hosting/settings

codesite...@google.com

unread,
Nov 13, 2008, 12:24:35 PM11/13/08
to zx...@googlegroups.com
Issue 108: core decoder for Data Matrix should return byte[], not Unicode
string.
http://code.google.com/p/zxing/issues/detail?id=108

Comment #1 by srowen:
The payload of QR Codes and Datamatrix codes are bytes, yes. The
interpretation of
these bytes is almost always as text. The QR Code spec implies that even
in 'byte'
mode, the content should be interpreted as an ISO88591 string. This is of
course by
far the primary use case, to encode strings. So, I would have to strongly
disagree
that it is not intended to carry text (not necessarily UCS-2 Unicode, no,
but a
string, yes), since that is in fact precisely what they were designed for.

But more importantly it is not really possible to defer the interpretation
of the
bytes either. They have no meaning per se; interpretation depends on the
format that
was used. For instance a QR Code that only has a byte payload still
contains a short
"byte mode" header. It does not seem reasonable to couple other parts of
the client
downstream to knowledge about how to interpret this byte stream.

But all that said I 100% agree, no reason not to give access to the raw
bytes. The
Result object already does this, see the 'raw bytes' property. Already done!


Issue attribute updates:
Status: WontFix
Owner: srowen

codesite...@google.com

unread,
Nov 14, 2008, 7:38:54 AM11/14/08
to zx...@googlegroups.com
Issue 108: core decoder for Data Matrix should return byte[], not Unicode
string.
http://code.google.com/p/zxing/issues/detail?id=108

Comment #2 by sanfordsquires:
The 'raw bytes' returned by DecoderResult.getRawBytes() are bytes prior to
applying
decompression. Decompression is done by DecodeBitStreamParser.decode(). So
that
isn't it ... but adding a simple new public method 'getBytes()' would
suffice for
now. All the new method needs to do is 'return text.getBytes()'

... and you may want to fix the bug on line 426 of
DecodeBitStreamParser.java that
currently produces incorrect decodes for all binary compression
occurrences. The
line should pass the loop index 'i' instead of the byte count in the call to
unrandomize255State().

codesite...@google.com

unread,
Nov 14, 2008, 7:48:55 AM11/14/08
to zx...@googlegroups.com
Issue 108: core decoder for Data Matrix should return byte[], not Unicode
string.
http://code.google.com/p/zxing/issues/detail?id=108

Comment #3 by sanfordsquires:
Clarification: the new method 'getBytes()' would be in DecoderResult.

codesite...@google.com

unread,
Nov 14, 2008, 7:52:56 AM11/14/08
to zx...@googlegroups.com
Issue 108: core decoder for Data Matrix should return byte[], not Unicode
string.
http://code.google.com/p/zxing/issues/detail?id=108

Comment #4 by srowen:
Yes you are right, but that is the only meaningful 'raw bytes' available
from the
barcode.

I do think there is a good point in here... say you really want to encode
binary data
in a QR Code. You would use one segment, in byte mode, which is a bit of a
hack
despite its name since it is really "text encoded as ISO88591, sort of"
mode. Not all
bytes are even legal in this mode. So from that perspective the spec
doesn't have any
real support for binary data.

But, assume the encoder/decoder overlook that last technicality. Seems like
it would
be nice to be able to get, directly, the actual bytes used in byte segments
only. I
agree.

I need to go back and read the Datamatrix spec to review how Datamatrix
behaves in
this regard. I can see a potential argument for a variation on what you are
proposing.

If the QR Code really was just one byte segment, then calling
getBytes("ISO88591") on
the resulting String should give you those bytes back ("should" because this
technically shouldn't work out, but probably does in practice). But
otherwise, the
result of getBytes() isn't meaningful -- was not necessarily anything to do
with the
bytes in the QR Code. Hence I don't think the result of getBytes() should
be exposed
in the Result object.

I'll also take a look at the bug you mention, I am guessing you are right
though I
have never looked at this part of the algorithm. Thanks for that.

codesite...@google.com

unread,
Nov 15, 2008, 8:51:41 AM11/15/08
to zx...@googlegroups.com
Issue 108: core decoder for Data Matrix should return byte[], not Unicode
string.
http://code.google.com/p/zxing/issues/detail?id=108

Comment #5 by srowen:
I fixed the bug, good catch.

I also added a new ResultMetadataType called BYTE_SEGMENTS. You can look
for this in the Result metadata --
if there were byte-mode segments in a QR Code or Datamatrix code, it will
contain their raw bytes. This is a
step better than looking at the raw bytes, which are already available,
since these are just the bytes from the byte
segments.


Issue attribute updates:
Status: Fixed

codesite...@google.com

unread,
Nov 15, 2008, 9:39:58 AM11/15/08
to zx...@googlegroups.com
Issue 108: core decoder for Data Matrix should return byte[], not Unicode
string.
http://code.google.com/p/zxing/issues/detail?id=108

Comment #6 by sanfordsquires:
Huh!? It seems like you expect that the method used to compress a message
to be put
into a QR or DM code carries semantic significance when the barcode is
decoded...
I've not seen that before... and I'm curious if that's something that Zxing
has
decided to implement or if I've missed something recent in barcode
standards, or ...?

Simple hypothetical example of the issue:

Suppose my application wants to simply embed a common 32-bit binary integer
into a
barcode, and my application decides that it will pass for bytes (say,
little-endian)
to an encoder to create the DataMatrix code (or QR).

If my application encodes the integer 0x0001, it's likely the encoder will
simply
use text encoding rather than binary encoding, (since 0x00 and 0x01 are
valid as
normal ASCII characters in Data Matrix), since text encoding gives a
shorter binary
message length than encoding the bytes in Data Matrix's binary format.

In other words, the encoder is under no obligation to maintain any
semantics of what
my application meant by the four byte message. The encoder doesn't know
whether my
application thinks of those four byts as a binary integer in little-endian,
an ASCII
message, or even, say two 16-bit integers in big endian format. The
semantics of
the message are known only to the application, and are not carried in the
barcode.
The method used by the barcode encoder to compress the message is not
normally
something that should be visible outside the decoder, (except possibly to
testing
software). Or am I missing something here?

codesite...@google.com

unread,
Nov 15, 2008, 10:00:00 AM11/15/08
to zx...@googlegroups.com
Issue 108: core decoder for Data Matrix should return byte[], not Unicode
string.
http://code.google.com/p/zxing/issues/detail?id=108

Comment #7 by srowen:
Here's an example that may illustrate my thinking. Let's say you want to
encode UTF-8 text in a QR Code. I've
seen people do it, and it's not clearly within or outside the spec. You
encode the text as bytes in UTF-8, then
put them into a 'byte mode' segment for the QR Code. The resulting stream
of bytes is

[byte mode header] [ UTF-8-encoded text ... ] [terminator header]

Right now, the Result from decoding would return from getText() the String
representing the text that was
encoded. It would return from getRawBytes() the entire byte stream above.

I had interpreted your suggestion to mean, please return the result of
getText().getBytes(encoding) in the API
-- to retrieve the original bytes in the byte segment above. This begins to
uncover all kinds of problems.
Which encoding? If the intent is to support applications that actually just
stuck in binary data, then we'd have
to assume they used ISO88591 (or another character set with a 1-1 mapping
between characters and bytes
for values 0-255... so this is even ambiguous, and not even 100% true for
ISO88591). If I decode the text as
ISO88591, I do not get the original bytes. It doesn't necessarily work at
all since the String is Unicode. The list
of problems goes on from there.

The raw bytes above do contain the bytes of the byte segment that you are
interested in. So that's why I said
"it already exists". But then you would have to re-parse the stream above
to pick out the byte segments. That
seems less than ideal

The same thing generally goes for Datamatrix though there the Base-256
encodation seems legitimately
appropriate for binary data, though it does still suggest this is to be
interpreted as text encoded as ISO88591.

I am not suggesting the entire byte stream above has any sensible
interpretation other than data encoded
according to the QR Code or Datamatrix spec, no.

I am acknowledging the use case you describe, where you might want to get
the raw bytes from just the byte
segment above -- that is, without the headers. That is what I have
implemented.


In your example, to be semantically correct, you need to use byte mode in
QR Code or Base-256 in
Datamatrix. You / the encoder need to encode what you mean. If you do not
intend this to be interpreted as
text, and the encoder interprets it as text, and encodes as ASCII, it is no
surprise that you get the wrong thing
out: the decoder (correctly) interprets the encoded data as text.

If you have an encoder that is forcing you to treat all input as text, then
I suppose this illustrates my over-
arching point, that both these formats "really" operate on text, period.
They have ambiguous, half-baked
provision for carrying true binary data. Even using byte mode / Base-256,
the specs imply that the decoder
should impute an interpretation as text. But hopefully you are in a
position to use the encoder to encode
exactly what you intend.

(What the text, or bytes, mean is indeed up to the application. But the
question of text vs. bytes is relevant to
encoding/decoding.)

codesite...@google.com

unread,
Nov 16, 2008, 9:58:00 AM11/16/08
to zx...@googlegroups.com
Issue 108: core decoder for Data Matrix should return byte[], not Unicode
string.
http://code.google.com/p/zxing/issues/detail?id=108

Comment #8 by sanfordsquires:
The Zxing application wants text input and ouput for QR and DM codes,
because the
structure of messages and their interpretations that are supported in the
application are all based on text format. This is fine, and the design
will clearly
fulfill the primary mission of ZXing to support e-commerce and Web access
using this
text based strategy.

It seems plausible that Zxing goals also include:

1. Zxing application support for use in all countries with any language.

2. Provide decoder support for any application, not just the Zxing
application -
that is, the decoder should support decoding of any valid QR or DM code,
not just
codes that carry payload that the Zxing application will recognize.

By providing a simple means to recover the bytes that were put into the
code, the
decoder now supports #2, and folks like courrier services that want to put
mixed
binary and text information into codes or folks that want encrypted code
content are
satisfied and will hop on the Zxing bandwagon (at least the decoder part,
and help
in the push towards the goal of e-commerce and internet access from cell
phones).

For #1, however, there's still a thorny problem (as we all recognize, but
from
different viewpoints.) The key is that #1 is only about the Zxing
application and
its ability to enable e-commerce and Web access globally.

Strawman solution of #1 - for discussion:

Suppose the Zxing application were to take the stance "Zxing always
interprets the
bytes of DM, QR, and other codes as UTF-8 text."

That would seem to solve #1 entirely, with no impact whatsoever on any other
applications.

That would seem to not require any major changes to existing Zxing
application
code. (? is that right?)

That could be implemented with a trivial change to the Zxing decoder that
translates
the StringBuffer containing bytes from the code into a Unicode string (by
specifying
UTF-8 as the byte interpretation.)

codesite...@google.com

unread,
Nov 16, 2008, 11:37:14 AM11/16/08
to zx...@googlegroups.com
Issue 108: core decoder for Data Matrix should return byte[], not Unicode
string.
http://code.google.com/p/zxing/issues/detail?id=108

Comment #9 by srowen:
I am quite confused, since I think you are still operating under mistaken
impressions of how Datamatrix and
QR Code work, which I tried to clarify above, and yet, I have also
implemented exactly the functionality you
want, so I am not sure what the remaining issue is? Maybe the issue is that
you're using a different copy of
the Datamatrix spec from 1997? we are using ISO 16004:2006, and using the
ECC200 format (which is
Datamatrix as we know it today), not ECC100 or ECC140.

First, and most importantly, no, it is entirely incorrect to say it is this
library choice to treat the input and
output of QR Code and Datamatrix as strings rather than bytes. Both formats
explicitly operate on character
data. This has absolutely nothing to do with project goals; it has
everything to do with the specifications.

QR Code includes four modes: numeric, alphanumeric, kanji and "byte" mode.
The latter, however, is
supposed to be interpreted as the encoding of a string in ISO88591. Or
according to the encoding in force
from an earlier ECI segment. Datamatrix has a quite similar story - section
4.1.e says it quite clearly. Even its
"byte mode", Base-256 encodation, is supposed to be interpreted as ISO88591.

If we disagree here, I can go over the specs with you in more detail, but I
think you'll see what I mean by
reading them. Stop here unless you agree with the above.


Let me repeat: these symbologies *do not technically support binary data*.
There is no way to signal to a
reader that part of the payload is just uninterpreted bytes. If you
disagree, please show me in either of the
specs.

Therefore: a strictly correct API would not include any notion of "just the
raw data bytes" since there is no
such notion in the specifications. The API *already* let you get at the raw
QR Code or Datamatrix encoding,
which includes its internal headers and signal bits and so on, which is not
the same thing.

Again let me repeat: all of this means that no, it would be incorrect to
implement what you are saying,
technically, according the specifications.


All that said -- it is not such a gross hack to consider sticking binary
data in a "byte mode" segment of a QR
Code. In general this will be meaningless to readers, since they are
supposed to read it as text.

But presumably you are thinking of a specific application using custom
readers -- these codes would never
be consumes by standard reader software. OK.

Strawman: no, this is completely wrong. The raw bytes of the code include
short headers (4-bit segment
identifiers) which have no relationship to UTF-8 text. No QR Code is
encoded this way. This would cause the
reader to fail 100% of the time!

Do you mean, assume that the bytes in the "byte mode" segments are UTF-8?
OK, but, they aren't! Read the
spec. They are ISO88591. If you did this, you would misread some text.
(Note that the decoder does try to
intelligently guess when someone has, in fact, used Shift_JIS or UTF-8
encoding in a byte mode segment. It's
wrong, but people do it, so we try to accommodate. But you certainly can't
assume UTF-8.)


But the thing that really bewilders me is, you already have access to the
raw bytes at two levels -- what are
you asking for then? You can get the complete raw QR Code or Datamatrix
encoding bytes from the existing
getRawBytes() method. You can get the raw bytes from byte mode segments
from the new mechanism I
added.

Please review my changes and the discussion above, and the specs, and then
let's talk more if you have questions.

codesite...@google.com

unread,
Nov 17, 2008, 7:58:26 AM11/17/08
to zx...@googlegroups.com
Issue 108: core decoder for Data Matrix should return byte[], not Unicode
string.
http://code.google.com/p/zxing/issues/detail?id=108

Comment #10 by sanfordsquires:
Perhaps we could get on the same wavelength faster with a simple phone
conversation.

I do understand both QR and DM at a technical level. I've implemented
several QR
and DM decoders targetted for cell phone usage in generic and proprietary
applications. I've also implemented encoders for both. So your technical
descriptions are fully understood, and seem perfectly correct...

One big difference in thinking is your belief that the method by which the
contents
of a code is compressed (e.g base-256 encodation, vs. ASCII/ISO88591 etc.)
I think
this is a 'red herring' issue, and may not be important to resolve for
now... but to
outline the difference, if I understand you correctly, you believe that the
means
used to compress portions of the message in a QR or DM code:

1. has significance (semantic value) to the application calling a
decoder, and
2. has semantics that are preserved by most encoders.

My view is that the method by which the contents of a code is compressed is
like a
private method or field in Java ... an internal detail that should NOT be
visible
outside the encoder or decoder... and I have not seen existing commercial
encoders
and decoders that will guarantee support for #2... (but maybe they do
exist, and are
becoming much more common than I'm aware of...)

Unless there's more evidence to discuss on this topic, I don't think this
particular
difference in viewpoints needs to be resolved immediately.

Thinking about the issue, though, leads to the bigger question of how Zxing
will
cope with (support) various countries and languages (Greek, Russian,
Chinese, etc.)
that are problematic with ASCII/ISO8859-1. I'm sorry if I introduced more
misunderstanding by shifting the discussion to this bigger picture topic...

It probably would be constructive for someone to outline how Zxing will (or
won't?)
support the worldwide problem of character sets. If there is already a
strategy for
this, then discussion of the UTF-8 idea is not needed.

The UTF-8 idea depends on accepting a philosophy:

1. The decoder needs to support decoding of codes non-Zxing (i.e. - that
are not
UTF-8). By supporting byte[], you guarantee this, so no problem.

2. It is perfectly fine for Zxing (as an individual application) to
always
interpret code content as UTF-8. This is saying, if you have a
communications
channel (e.g. - a QR or DM code) that ensures transmission of any byte
pattern, it's
ok to use that channel to communicate whatever byte pattern you want, and
place
whatever interpretation semantics you want on that byte pattern.

Just because the standard talks about ASCII and ISO8895-1, doesn't require
any
particular application to use that interpretation... it just requires a
decoder to
respect and support that interpretation (if the decoder is going to support
the
standard, and not be totally proprietary.)

... But, the important disucssion is how Zxing will support worldwide
character
sets, not whether a UTF-8 messaging idea is the right solution...

I am not asking for a change here. I'm seeking understanding of Zxings
goals,
viewpoints, and any constraints those imply. Perhaps I'm jumping the gun
throwing
out an idea to solve a problem that may already have been thought about and
resolved, and I just haven't happened to run across that discussion
anywhere on the
Zxing site. If so, my appologies!

codesite...@google.com

unread,
Nov 17, 2008, 8:26:07 AM11/17/08
to zx...@googlegroups.com
Issue 108: core decoder for Data Matrix should return byte[], not Unicode
string.
http://code.google.com/p/zxing/issues/detail?id=108

Comment #11 by sanfordsquires:
Oh - I see from issue 103 that you have thought about this issue and feel
it should
be addressed by the three byte prefix to indicate UTF-8, or possibly by
ECI...

So, no need for further discussion. That's the answer I was seeking, and I
should
have read 103, rather than just looked at the title and jumped to
conclusions with
the 'Won't fix' label... :-)

codesite...@google.com

unread,
Nov 17, 2008, 12:48:02 PM11/17/08
to zx...@googlegroups.com
Issue 108: core decoder for Data Matrix should return byte[], not Unicode
string.
http://code.google.com/p/zxing/issues/detail?id=108

Comment #12 by srowen:
I think we're getting to the real central question, which is character
encoding issues.

About internal representation -- indeed, I generally agree, which is why
the API didn't even expose the raw
bytes until a few versions ago, and why I initially didn't want to add more
access to the internal encoding.

Do the bytes have semantic value that readers understand? the raw bytes do,
yes. The 4-bit mode header in
QR Code that says "following this is kanji-encoded stuff" -- yes, that has
meaning to the reader. Does the
Kanji itself have meaning to the reader? no. (Interpreting the contents as
a URL, for example, is something a
reader might do but is outside the scope of QR Code per se.) Do the bytes
in a byte mode have any meaning
to the reader? *yes*, insofar as the reader is supposed to construe them as
characters according to some
encoding.

And here we come to the real problem: what encoding? It is supposed to be
ISO-8859-1. What are we going
to do about other character sets? we're not making up standards, so the
real question is what does QR Code /
Datamatrix provide, and the answers aren't so great.

Yes, the answer they both provide is "use ECI indicators". The bad news is
that I cannot find a single reference
on which ECI values map to which encodings. Even the for-pay specification
for ECI *does not specify this!*
We reverse engineered and guessed a few values. We do support ECI, but
nobody really uses it.

What's worse, the specs seem to leave some wiggle room about whether you
really have to use ISO-8859-1 in
byte mode. In practice I can tell you that Japan puts Shift-JIS-encoded
text in byte mode segments in QR
Codes all the time. I see UTF-8 too. So in practice we try to guess the
encoding -- using clues like, yes, a
byte-order marker header for UTF-8. But this is less than ideal.

I suppose the right answer is use ISO-8859-1 if you can, and if you
absolutely can't, at least include an ECI
segment.


Always assuming UTF-8 doesn't work because...
- Well, it's supposed to be ISO-8859-1, and reading as UTF-8 is only partly
compatible with that. Shift_JIS is
definitely not going to be read the same as UTF-8
- Not all sequences of bytes are valid encodings of a string in UTF-8. So
this outright fails in some cases.
- It is possible for two byte sequences to encode the same character string
in UTF-8. So reversing the
encoding might not give you the original.

ISO-8859-1, the proper default, has only the first problem, but, if it's
binary data you're encoding, the
resulting interpretation as text is garbage anyhow. Readers are going to
read it as garbage. But, presumably
you are thinking of specialized readers which understand the interpretation
you wish them to apply somehow.

I guess to repeat a point -- yes, readers *do* need to interpret the
contents as text, according to the spec. I
don't think this affects your reasoning, or means that readers can't
provide additional info, of course.

So this is why I give you the raw contents of the byte mode segments
themselves. I think this plus giving you
access to the raw bytes of the whole QR Code payload, should enable you to
create any specialized reader
application you want from this library.


Issue attribute updates:
Labels: I

Message has been deleted

srowen

unread,
Nov 17, 2008, 5:06:16 PM11/17/08
to zxing


On Nov 17, 7:33 pm, "sanfordsqui...@gmail.com"
<sanfordsqui...@gmail.com> wrote:
> The decoder should not prevent use of anyone's client by corrupting
> any message, although certain features in the standard may not (or
> may) be supported by Zxing.  For example, is FNC1 supported?  The
> early implementations of 2D codes and part of the standard were
> oriented towards transmission of the codes across a serial
> communications channel after decoding, so required adding bytes before
> the code content.  I would think the Zxing decoder and Zxing
> application have no need for those, but the Zxing decoder could
> legitimately support indicators of barcode type that would be useful
> to applications needing to support these transmission oriented bytes,
> etc.

Yeah we ignore FNC1. There is nothing meaningful to do with it -- it
doesn't form part of a printable string. We could (should?) at least
record their presence and report them to the caller but don't at the
moment. In practical terms, Datamatrix as used in the consumer/
commercial sphere does not use FNC1. You would be right in saying that
this project does not focus on industrial or specialized applications
of barcodes, no. That's why nobody has bothered to write, say, a
PDF417 decoder here.


> The standards have many optional features like 'structured append'
> which should not, in my opion, be thought of as part of the decoder.
> Structured append is a way of stitching together the payload from
> several different barcode symbols into a single message.  I think if
> that's to be supported, it should be a client function, not a decoder
> function  - otherwise you'd have to support multiple image scans via a
> single call to the decoder, etc.  The decoder might (or might not)
> sense and indicate that the byte sequence it read contains a
> 'structured append' indicator...

Your point is that decoding a single QR Code symbol is one level of
functionality, while building more layers on top, like interpreting
the string that results from the decode, is another thing? Indeed I
agree. In the code the Reader abstraction is your "decoder" -- image
in, String (+ some other metadata) out. There is code to assess
whether the payload appears to be a URL, etc., but that is separated
out and called by client code only, like the Android client.


> For kanji, I would make that an issue for localization of the Zxing
> application for the Japanese market.  Pragmatically, the decoder only
> has to pass on the bytes to be interpreted (without destroying or
> changing the message) and might post an indicator flag saying 'kanji
> detected', without interpreting the bytes - leaving that to the
> localized client.  The localized client can be totally adapted to
> local conventions like you cite for Japan.

I do not follow this part of the argument. There are several steps in
decoding:

1) read the bits out of the code, unmask, error-correct, etc. to get a
byte stream
2) parse the byte stream into segments (e.g. byte mode header, count,
then bytes)
3) interpret the segments and produce the string they encode

These are all part of the spec. Why would the decoder do only part of
this? Parsing a QR Code entails all of them. The result of parsing is
*not* bytes. It is a string -- 3).

If you are saying, could I get the output of 1) or 2), then fine, I
have agreed with you already. The code lets you get these intermediate
results. So again, I am not following what else you want me to do? You
are free to ignore the decoder's result from 2) or 3) if you want.



>
> Opinion - I think you are thinking the wrong way in saying 'UTF-8
> doesn't work because... it's supposed to be ISO-8859-1'...  you are
> letting the tail wag the dog.  Zxing needs to support all character
> sets simply and easily.  Zxing is using a 'standardized' code format
> as the vehicle for carrying it's messages.  It should 'respect' the
> standard, in that it should not encode the modules in a way that would
> cause other readers to crash, or create huge confusion in the
> marketplace, but it does not have to make it's data payload conform to
> how everyone else might want to view it with an ISO-8859-1 default
> viewer.  I don't view any harm in interpreting bytes differently in
> the ZXing client.  Other clients are not affected by how Zxing uses
> codes, and Zxing is very open about how to view and interpret it's
> codes.  You should think of Zxing as establishing an optional
> extension of the standard that has a lot of valuable, practical uses!

This doesn't make sense to me. I'm saying that one is supposed to
encode with ISO-8859-1 in byte mode -- therefore, believe me, lots of
QR Codes do. You accept that interpreting such an encodation as UTF-8
does not lead to the right character sequence in some cases? then, you
see at least one of the problems here?

To recap, there are more problems here. Interpreting the bytes of an
image file, say, as an ISO-8859-1-encoded string may result in
gibberish, but always results in a string. Not so for UTF-8. There is
a further problem with your suggestion to merely de-code the string to
get the bytes back, if the encoding is UTF-8 -- you are not guaranteed
to get the original bytes back.

Again, I think you are asking for functionality that already exists,
so I am getting really confused. You want the raw bytes of the QR
Code? also available. The raw bytes of the byte segments? available
already to you as a caller. Use them if you want. Ignore the string
interpretation if you like.

This renders your opinion about whether one can reasonably assume
UTF-8 encoding -- and the answer is definitely no -- moot, since you
are free to do something else with the bytes if you like.


> As for 'not all sequences of bytes are valid encodings of a string in
> UTF-8'.  I believe you are right, but the problem is easily dealt
> with, by a 'try-catch' statement in Java, and rejecting the code as a
> non-Zxing code.  No problem.

... but then I reject some of the very codes you want to encode? Not
good.
I could just ignore such byte segments... but a much easier solution
is to not assume UTF-8! doubly so since the spec says ISO-8859-1 is
right. There seems to be only disadvantage to assuming UTF-8. I am
really not following this.

>
> Yes, there's more than one way to encode the same (human interpreted)
> message into UTF-8.  If the client can only handle one way (non-
> economical to support all possible ways), then it's simply a matter of
> declaring how an encoder must encode the bytes, in order for the Zxing
> client to understand them.  The encoders I'm familiar with all have an
> option to allow you to specify any arbitrary byte message, so there
> shouldn't be many people who have a problem producing any particular
> UTF-8 message.  (That's not to say that some decoders don't want input
> in some particular character set, and make you do a lot of escape
> sequences for particular binary bytes, if your typing in a one-off
> message!)  Again, not a hard problem, as I see it.

Don't follow. For some Unicode strings, the UTF-8 encoding is not
unique. The client just calls out to Java to decode bytes into a
Unicode string, so the issue is not what the client can handle. It's
that then re-encoding that string does not necessarily give you the
original bytes. Those several encodings are all valid, so how/why
would you declare some of them invalid?

What the QR Code encoder does is irrelevant. The issue is in your
proposed change to the decoder to produce the "original" bytes by re-
encoding the result string. That doesn't work, but, again, I have
already provided you direct access to the raw bytes anyway so this is
not needed.

> Bottom line, I would respect the standards and their character set
> requirements by a 'do no harm' philosophy, but I would not let the
> 'default' interpretation of ISO-8859-1 etc., stand in the way of a
> compelling need for Zxing to support all character sets worldwide.

See above, I think your arguments don't yet make full sense to me,
but, I also think they are moot as you have access to everything you
seem to want from this decoder.

Please answer clearly if we are to continue: what can you *not* do,
that you want to do, given the current state of the code? what bytes
do you want to get that you can't?
Reply all
Reply to author
Forward
0 new messages