New issue report by sanfordsquires:
What steps will reproduce the problem?
1. Look at DecodeBitStreamParser.java source code - Line 76, et. seq.
2.
3.
What is the expected output? What do you see instead?
Data Matrix standard defines DM code as a simple binary/ASCII byte
carrier, not a Unicode content carrier. The current zxing Java source
code however, accumulates the decoded stream of bytes into a Unicode
StringBuffer, possibly introducing unwarranted character translations
depending on JVM internals, thus potentially corrupting binary byte
content. The data type would be better kept as a simple byte[], rather
than a StringBuffer.
Interpretation of a binary byte[] as a Java Unicode string can be done at
a later point by the application's processing, rather than here in the
core decoder processing. The conversion is simply a call like 'new String
(byteArray)', with several optional additional parameters allow
application conrol of how the translation is done. Note that allowing the
application to do the translation allows easy extraction of specific
subsets of bytes for translation to Unicode, while keeping other bytes as
binary entities or other translation (e.g. - as Integer, Short, or Hex
format)
What version of the product are you using? On what operating system?
Current source code.
Please provide any additional information below.
Change requires very little effort in zxing, and will save effort and
possible insidious bugs for calling applications.
Issue attributes:
Status: New
Owner: ----
Labels: Type-Defect Priority-Medium
--
You received this message because you are listed in the owner
or CC fields of this issue, or because you starred this issue.
You may adjust your issue notification preferences at:
http://code.google.com/hosting/settings
Comment #1 by srowen:
The payload of QR Codes and Datamatrix codes are bytes, yes. The
interpretation of
these bytes is almost always as text. The QR Code spec implies that even
in 'byte'
mode, the content should be interpreted as an ISO88591 string. This is of
course by
far the primary use case, to encode strings. So, I would have to strongly
disagree
that it is not intended to carry text (not necessarily UCS-2 Unicode, no,
but a
string, yes), since that is in fact precisely what they were designed for.
But more importantly it is not really possible to defer the interpretation
of the
bytes either. They have no meaning per se; interpretation depends on the
format that
was used. For instance a QR Code that only has a byte payload still
contains a short
"byte mode" header. It does not seem reasonable to couple other parts of
the client
downstream to knowledge about how to interpret this byte stream.
But all that said I 100% agree, no reason not to give access to the raw
bytes. The
Result object already does this, see the 'raw bytes' property. Already done!
Issue attribute updates:
Status: WontFix
Owner: srowen
Comment #2 by sanfordsquires:
The 'raw bytes' returned by DecoderResult.getRawBytes() are bytes prior to
applying
decompression. Decompression is done by DecodeBitStreamParser.decode(). So
that
isn't it ... but adding a simple new public method 'getBytes()' would
suffice for
now. All the new method needs to do is 'return text.getBytes()'
... and you may want to fix the bug on line 426 of
DecodeBitStreamParser.java that
currently produces incorrect decodes for all binary compression
occurrences. The
line should pass the loop index 'i' instead of the byte count in the call to
unrandomize255State().
Comment #3 by sanfordsquires:
Clarification: the new method 'getBytes()' would be in DecoderResult.
Comment #4 by srowen:
Yes you are right, but that is the only meaningful 'raw bytes' available
from the
barcode.
I do think there is a good point in here... say you really want to encode
binary data
in a QR Code. You would use one segment, in byte mode, which is a bit of a
hack
despite its name since it is really "text encoded as ISO88591, sort of"
mode. Not all
bytes are even legal in this mode. So from that perspective the spec
doesn't have any
real support for binary data.
But, assume the encoder/decoder overlook that last technicality. Seems like
it would
be nice to be able to get, directly, the actual bytes used in byte segments
only. I
agree.
I need to go back and read the Datamatrix spec to review how Datamatrix
behaves in
this regard. I can see a potential argument for a variation on what you are
proposing.
If the QR Code really was just one byte segment, then calling
getBytes("ISO88591") on
the resulting String should give you those bytes back ("should" because this
technically shouldn't work out, but probably does in practice). But
otherwise, the
result of getBytes() isn't meaningful -- was not necessarily anything to do
with the
bytes in the QR Code. Hence I don't think the result of getBytes() should
be exposed
in the Result object.
I'll also take a look at the bug you mention, I am guessing you are right
though I
have never looked at this part of the algorithm. Thanks for that.
Comment #5 by srowen:
I fixed the bug, good catch.
I also added a new ResultMetadataType called BYTE_SEGMENTS. You can look
for this in the Result metadata --
if there were byte-mode segments in a QR Code or Datamatrix code, it will
contain their raw bytes. This is a
step better than looking at the raw bytes, which are already available,
since these are just the bytes from the byte
segments.
Issue attribute updates:
Status: Fixed
Comment #6 by sanfordsquires:
Huh!? It seems like you expect that the method used to compress a message
to be put
into a QR or DM code carries semantic significance when the barcode is
decoded...
I've not seen that before... and I'm curious if that's something that Zxing
has
decided to implement or if I've missed something recent in barcode
standards, or ...?
Simple hypothetical example of the issue:
Suppose my application wants to simply embed a common 32-bit binary integer
into a
barcode, and my application decides that it will pass for bytes (say,
little-endian)
to an encoder to create the DataMatrix code (or QR).
If my application encodes the integer 0x0001, it's likely the encoder will
simply
use text encoding rather than binary encoding, (since 0x00 and 0x01 are
valid as
normal ASCII characters in Data Matrix), since text encoding gives a
shorter binary
message length than encoding the bytes in Data Matrix's binary format.
In other words, the encoder is under no obligation to maintain any
semantics of what
my application meant by the four byte message. The encoder doesn't know
whether my
application thinks of those four byts as a binary integer in little-endian,
an ASCII
message, or even, say two 16-bit integers in big endian format. The
semantics of
the message are known only to the application, and are not carried in the
barcode.
The method used by the barcode encoder to compress the message is not
normally
something that should be visible outside the decoder, (except possibly to
testing
software). Or am I missing something here?
Comment #7 by srowen:
Here's an example that may illustrate my thinking. Let's say you want to
encode UTF-8 text in a QR Code. I've
seen people do it, and it's not clearly within or outside the spec. You
encode the text as bytes in UTF-8, then
put them into a 'byte mode' segment for the QR Code. The resulting stream
of bytes is
[byte mode header] [ UTF-8-encoded text ... ] [terminator header]
Right now, the Result from decoding would return from getText() the String
representing the text that was
encoded. It would return from getRawBytes() the entire byte stream above.
I had interpreted your suggestion to mean, please return the result of
getText().getBytes(encoding) in the API
-- to retrieve the original bytes in the byte segment above. This begins to
uncover all kinds of problems.
Which encoding? If the intent is to support applications that actually just
stuck in binary data, then we'd have
to assume they used ISO88591 (or another character set with a 1-1 mapping
between characters and bytes
for values 0-255... so this is even ambiguous, and not even 100% true for
ISO88591). If I decode the text as
ISO88591, I do not get the original bytes. It doesn't necessarily work at
all since the String is Unicode. The list
of problems goes on from there.
The raw bytes above do contain the bytes of the byte segment that you are
interested in. So that's why I said
"it already exists". But then you would have to re-parse the stream above
to pick out the byte segments. That
seems less than ideal
The same thing generally goes for Datamatrix though there the Base-256
encodation seems legitimately
appropriate for binary data, though it does still suggest this is to be
interpreted as text encoded as ISO88591.
I am not suggesting the entire byte stream above has any sensible
interpretation other than data encoded
according to the QR Code or Datamatrix spec, no.
I am acknowledging the use case you describe, where you might want to get
the raw bytes from just the byte
segment above -- that is, without the headers. That is what I have
implemented.
In your example, to be semantically correct, you need to use byte mode in
QR Code or Base-256 in
Datamatrix. You / the encoder need to encode what you mean. If you do not
intend this to be interpreted as
text, and the encoder interprets it as text, and encodes as ASCII, it is no
surprise that you get the wrong thing
out: the decoder (correctly) interprets the encoded data as text.
If you have an encoder that is forcing you to treat all input as text, then
I suppose this illustrates my over-
arching point, that both these formats "really" operate on text, period.
They have ambiguous, half-baked
provision for carrying true binary data. Even using byte mode / Base-256,
the specs imply that the decoder
should impute an interpretation as text. But hopefully you are in a
position to use the encoder to encode
exactly what you intend.
(What the text, or bytes, mean is indeed up to the application. But the
question of text vs. bytes is relevant to
encoding/decoding.)
Comment #8 by sanfordsquires:
The Zxing application wants text input and ouput for QR and DM codes,
because the
structure of messages and their interpretations that are supported in the
application are all based on text format. This is fine, and the design
will clearly
fulfill the primary mission of ZXing to support e-commerce and Web access
using this
text based strategy.
It seems plausible that Zxing goals also include:
1. Zxing application support for use in all countries with any language.
2. Provide decoder support for any application, not just the Zxing
application -
that is, the decoder should support decoding of any valid QR or DM code,
not just
codes that carry payload that the Zxing application will recognize.
By providing a simple means to recover the bytes that were put into the
code, the
decoder now supports #2, and folks like courrier services that want to put
mixed
binary and text information into codes or folks that want encrypted code
content are
satisfied and will hop on the Zxing bandwagon (at least the decoder part,
and help
in the push towards the goal of e-commerce and internet access from cell
phones).
For #1, however, there's still a thorny problem (as we all recognize, but
from
different viewpoints.) The key is that #1 is only about the Zxing
application and
its ability to enable e-commerce and Web access globally.
Strawman solution of #1 - for discussion:
Suppose the Zxing application were to take the stance "Zxing always
interprets the
bytes of DM, QR, and other codes as UTF-8 text."
That would seem to solve #1 entirely, with no impact whatsoever on any other
applications.
That would seem to not require any major changes to existing Zxing
application
code. (? is that right?)
That could be implemented with a trivial change to the Zxing decoder that
translates
the StringBuffer containing bytes from the code into a Unicode string (by
specifying
UTF-8 as the byte interpretation.)
Comment #9 by srowen:
I am quite confused, since I think you are still operating under mistaken
impressions of how Datamatrix and
QR Code work, which I tried to clarify above, and yet, I have also
implemented exactly the functionality you
want, so I am not sure what the remaining issue is? Maybe the issue is that
you're using a different copy of
the Datamatrix spec from 1997? we are using ISO 16004:2006, and using the
ECC200 format (which is
Datamatrix as we know it today), not ECC100 or ECC140.
First, and most importantly, no, it is entirely incorrect to say it is this
library choice to treat the input and
output of QR Code and Datamatrix as strings rather than bytes. Both formats
explicitly operate on character
data. This has absolutely nothing to do with project goals; it has
everything to do with the specifications.
QR Code includes four modes: numeric, alphanumeric, kanji and "byte" mode.
The latter, however, is
supposed to be interpreted as the encoding of a string in ISO88591. Or
according to the encoding in force
from an earlier ECI segment. Datamatrix has a quite similar story - section
4.1.e says it quite clearly. Even its
"byte mode", Base-256 encodation, is supposed to be interpreted as ISO88591.
If we disagree here, I can go over the specs with you in more detail, but I
think you'll see what I mean by
reading them. Stop here unless you agree with the above.
Let me repeat: these symbologies *do not technically support binary data*.
There is no way to signal to a
reader that part of the payload is just uninterpreted bytes. If you
disagree, please show me in either of the
specs.
Therefore: a strictly correct API would not include any notion of "just the
raw data bytes" since there is no
such notion in the specifications. The API *already* let you get at the raw
QR Code or Datamatrix encoding,
which includes its internal headers and signal bits and so on, which is not
the same thing.
Again let me repeat: all of this means that no, it would be incorrect to
implement what you are saying,
technically, according the specifications.
All that said -- it is not such a gross hack to consider sticking binary
data in a "byte mode" segment of a QR
Code. In general this will be meaningless to readers, since they are
supposed to read it as text.
But presumably you are thinking of a specific application using custom
readers -- these codes would never
be consumes by standard reader software. OK.
Strawman: no, this is completely wrong. The raw bytes of the code include
short headers (4-bit segment
identifiers) which have no relationship to UTF-8 text. No QR Code is
encoded this way. This would cause the
reader to fail 100% of the time!
Do you mean, assume that the bytes in the "byte mode" segments are UTF-8?
OK, but, they aren't! Read the
spec. They are ISO88591. If you did this, you would misread some text.
(Note that the decoder does try to
intelligently guess when someone has, in fact, used Shift_JIS or UTF-8
encoding in a byte mode segment. It's
wrong, but people do it, so we try to accommodate. But you certainly can't
assume UTF-8.)
But the thing that really bewilders me is, you already have access to the
raw bytes at two levels -- what are
you asking for then? You can get the complete raw QR Code or Datamatrix
encoding bytes from the existing
getRawBytes() method. You can get the raw bytes from byte mode segments
from the new mechanism I
added.
Please review my changes and the discussion above, and the specs, and then
let's talk more if you have questions.
Comment #10 by sanfordsquires:
Perhaps we could get on the same wavelength faster with a simple phone
conversation.
I do understand both QR and DM at a technical level. I've implemented
several QR
and DM decoders targetted for cell phone usage in generic and proprietary
applications. I've also implemented encoders for both. So your technical
descriptions are fully understood, and seem perfectly correct...
One big difference in thinking is your belief that the method by which the
contents
of a code is compressed (e.g base-256 encodation, vs. ASCII/ISO88591 etc.)
I think
this is a 'red herring' issue, and may not be important to resolve for
now... but to
outline the difference, if I understand you correctly, you believe that the
means
used to compress portions of the message in a QR or DM code:
1. has significance (semantic value) to the application calling a
decoder, and
2. has semantics that are preserved by most encoders.
My view is that the method by which the contents of a code is compressed is
like a
private method or field in Java ... an internal detail that should NOT be
visible
outside the encoder or decoder... and I have not seen existing commercial
encoders
and decoders that will guarantee support for #2... (but maybe they do
exist, and are
becoming much more common than I'm aware of...)
Unless there's more evidence to discuss on this topic, I don't think this
particular
difference in viewpoints needs to be resolved immediately.
Thinking about the issue, though, leads to the bigger question of how Zxing
will
cope with (support) various countries and languages (Greek, Russian,
Chinese, etc.)
that are problematic with ASCII/ISO8859-1. I'm sorry if I introduced more
misunderstanding by shifting the discussion to this bigger picture topic...
It probably would be constructive for someone to outline how Zxing will (or
won't?)
support the worldwide problem of character sets. If there is already a
strategy for
this, then discussion of the UTF-8 idea is not needed.
The UTF-8 idea depends on accepting a philosophy:
1. The decoder needs to support decoding of codes non-Zxing (i.e. - that
are not
UTF-8). By supporting byte[], you guarantee this, so no problem.
2. It is perfectly fine for Zxing (as an individual application) to
always
interpret code content as UTF-8. This is saying, if you have a
communications
channel (e.g. - a QR or DM code) that ensures transmission of any byte
pattern, it's
ok to use that channel to communicate whatever byte pattern you want, and
place
whatever interpretation semantics you want on that byte pattern.
Just because the standard talks about ASCII and ISO8895-1, doesn't require
any
particular application to use that interpretation... it just requires a
decoder to
respect and support that interpretation (if the decoder is going to support
the
standard, and not be totally proprietary.)
... But, the important disucssion is how Zxing will support worldwide
character
sets, not whether a UTF-8 messaging idea is the right solution...
I am not asking for a change here. I'm seeking understanding of Zxings
goals,
viewpoints, and any constraints those imply. Perhaps I'm jumping the gun
throwing
out an idea to solve a problem that may already have been thought about and
resolved, and I just haven't happened to run across that discussion
anywhere on the
Zxing site. If so, my appologies!
Comment #11 by sanfordsquires:
Oh - I see from issue 103 that you have thought about this issue and feel
it should
be addressed by the three byte prefix to indicate UTF-8, or possibly by
ECI...
So, no need for further discussion. That's the answer I was seeking, and I
should
have read 103, rather than just looked at the title and jumped to
conclusions with
the 'Won't fix' label... :-)
Comment #12 by srowen:
I think we're getting to the real central question, which is character
encoding issues.
About internal representation -- indeed, I generally agree, which is why
the API didn't even expose the raw
bytes until a few versions ago, and why I initially didn't want to add more
access to the internal encoding.
Do the bytes have semantic value that readers understand? the raw bytes do,
yes. The 4-bit mode header in
QR Code that says "following this is kanji-encoded stuff" -- yes, that has
meaning to the reader. Does the
Kanji itself have meaning to the reader? no. (Interpreting the contents as
a URL, for example, is something a
reader might do but is outside the scope of QR Code per se.) Do the bytes
in a byte mode have any meaning
to the reader? *yes*, insofar as the reader is supposed to construe them as
characters according to some
encoding.
And here we come to the real problem: what encoding? It is supposed to be
ISO-8859-1. What are we going
to do about other character sets? we're not making up standards, so the
real question is what does QR Code /
Datamatrix provide, and the answers aren't so great.
Yes, the answer they both provide is "use ECI indicators". The bad news is
that I cannot find a single reference
on which ECI values map to which encodings. Even the for-pay specification
for ECI *does not specify this!*
We reverse engineered and guessed a few values. We do support ECI, but
nobody really uses it.
What's worse, the specs seem to leave some wiggle room about whether you
really have to use ISO-8859-1 in
byte mode. In practice I can tell you that Japan puts Shift-JIS-encoded
text in byte mode segments in QR
Codes all the time. I see UTF-8 too. So in practice we try to guess the
encoding -- using clues like, yes, a
byte-order marker header for UTF-8. But this is less than ideal.
I suppose the right answer is use ISO-8859-1 if you can, and if you
absolutely can't, at least include an ECI
segment.
Always assuming UTF-8 doesn't work because...
- Well, it's supposed to be ISO-8859-1, and reading as UTF-8 is only partly
compatible with that. Shift_JIS is
definitely not going to be read the same as UTF-8
- Not all sequences of bytes are valid encodings of a string in UTF-8. So
this outright fails in some cases.
- It is possible for two byte sequences to encode the same character string
in UTF-8. So reversing the
encoding might not give you the original.
ISO-8859-1, the proper default, has only the first problem, but, if it's
binary data you're encoding, the
resulting interpretation as text is garbage anyhow. Readers are going to
read it as garbage. But, presumably
you are thinking of specialized readers which understand the interpretation
you wish them to apply somehow.
I guess to repeat a point -- yes, readers *do* need to interpret the
contents as text, according to the spec. I
don't think this affects your reasoning, or means that readers can't
provide additional info, of course.
So this is why I give you the raw contents of the byte mode segments
themselves. I think this plus giving you
access to the raw bytes of the whole QR Code payload, should enable you to
create any specialized reader
application you want from this library.
Issue attribute updates:
Labels: I