Barcode Scanner App: Custom Search URL not showing bytes greater than 0x7F

77 views
Skip to first unread message

Christian Leutloff

unread,
Sep 20, 2016, 12:20:05 PM9/20/16
to zxing
The Barcode Scanner App is not showing bytes greater than 0x7F in the Custom Search URL. What can I do to solve this?

The custom search related code is using the URLEncoder.encode(text, "UTF-8") to encode the data for the URL:
https://github.com/zxing/zxing/blob/2f11529aa35e01354f9036c2aa7747ab23a604ef/android/src/com/google/zxing/client/android/result/ResultHandler.java

The contained binary data is obviously not UTF-8. Would be the correct way to handle binary data to add a raw variant (e.g. %p just percent encoding) in addition to the %s?


I have attached a QR-Code with binary data. The first five bytes could be ignored (first four bytes are the length in decimal numbers). Afterwards for each number from 0 to 0xFF the two characters of the hex-number and then the number itself in a single byte is stored. There are 773 bytes contained in the QR-Code.

The first bytes, up to 0x7F, are correct. Then each larger value is shown as %EF%BF%BD. This is the standard replacement character.

Here is the returned Custom Search URL (added some line breaks):
0773%07
00%0001%0102%0203%0304%0405%0506%0607%0708%08
09%09
0A%0A
0B%0B
0C%0C0D%0D0E%0E0F%0F10%10
11%1112%1213%1314%1415%1516%1617%1718%1819%191A%1A1B%1B1C%1C1D%1D1E%1E1F%1F
20+
21%2122%2223%2324%2425%2526%2627%2728%2829%29
2A*
2B%2B2C%2C
2D-2E.2F%2F
300
311
322333344355366377388399
3A%3A3B%3B3C%3C3D%3D3E%3E3F%3F40%40
41A42B43C44D45E46F47G48H49I4AJ4BK4CL4DM4EN4FO50P51Q52R53S54T55U56V57W58X59Y5AZ
5B%5B5C%5C5D%5D5E%5E
5F_
60%60
61a62b63c64d65e66f67g68h69i6Aj6Bk6Cl6Dm6En6Fo70p71q72r73s74t75u76v77w78x79y7Az
7B%7B7C%7C7D%7D7E%7E7F%7F
80%EF%BF%BD
81%EF%BF%BD
82%EF%BF%BD
83%EF%BF%BD
84%EF%BF%BD
[... more removed ...]
FF%EF%BF%BD

TiA
Christian

Bytes00..FF.png

Lachezar Dobrev

unread,
Sep 21, 2016, 7:10:35 AM9/21/16
to Christian Leutloff, zxing
  ZXing tries to _decode_ the binary data as UTF-8 upon reading the QR-Code to present it as a String. This is considered the most common case. Your byte stream contains non-ASCII characters that are not properly formed UTF-8 sequences. Those get replaced with Unicode Code Point 0xFFFD (Replacement Character), which gets encoded as 0xEF 0xBF 0xBD sequence for an URL representation.

Christian

--
You received this message because you are subscribed to the Google Groups "zxing" group.
To unsubscribe from this group and stop receiving emails from it, send an email to zxing+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Lachezar Dobrev

unread,
Sep 21, 2016, 8:10:29 AM9/21/16
to Christian Leutloff, zxing
  I was about to offer you to replace the Custom Search with a 'Scan' page that sets up a Character Set for the parsing, but it didn't work, because CHARACTER_SET is a hint that is explicitly *ignored*.

Sean Owen

unread,
Sep 22, 2016, 10:24:31 AM9/22/16
to zxing
Not quite sure what you're asking because the contents of the barcode have nothing to do with the app's custom search URL. Any URL you use should be correctly URL-escaped. 

QR codes contain text, not bytes. The default encoding is ISO 8859 1, which means you can usually get away with encoding bytes as 'text'. It's not going to be read as anything but text though.

None of that is related to URL-escaping though.

Christian Leutloff

unread,
Sep 22, 2016, 11:41:55 AM9/22/16
to zxing
Thank you all for your support.


> Not quite sure what you're asking because the contents of the barcode have nothing to do with the app's custom search URL. Any URL you use should be correctly URL-escaped. 

The Custom URL Search is used for testing purposes, sofar.

We are using QR-Codes to encode and decode binary information. Using handheld scanner this process works smoothly. Now we want to read the QR-Codes using the camera of an Android Tablet. On the tablet we are using the App KioskBrowser. They have integrated zxing to read QR-Codes and forward the related information to a provided URL, similar to the Custom Search URL of the zxing App. But the forwarded information is not as expected. They asked to check our codes with the zxing App.


> QR codes contain text, not bytes. The default encoding is ISO 8859 1, which means you can usually get away with encoding bytes as 'text'. It's not going to be read as anything but text though.

That is my understanding, too.

> None of that is related to URL-escaping though.



Would you accept a patch with another encoding that can be applied in addition to the existing format characters (%s,%t), e.g. %r or %p. The encoding would just replace each byte value with a prefixed percent, aka percent-encoding (https://en.wikipedia.org/wiki/Percent-encoding).

Regards
Christian

Sean Owen

unread,
Sep 22, 2016, 3:44:49 PM9/22/16
to zxing
If I understand the issue, no that's not the problem. Are you saying URLEncoder doesn't work? I doubt that; I'd suspect the contents of the barcode aren't read as bytes in the way you expect.

Lachezar Dobrev

unread,
Sep 26, 2016, 8:53:07 AM9/26/16
to Sean Owen, zxing
  Sean, I took a stab at this.
  * URL Encoder works correctly.
  * QR Code is properly read (at least the BYTE_SEGMENTS result meta-data contains what Christian expects)

  The problem (seemingly) comes from successive decoding byte-string as UTF-8 (which fails and puts lots of \uFFFD characters) and then encoding the string as UTF-8 to bytes, which leads to data loss.

  If one specifies ISO-8859-1 as CHARACTER_SET in the decoding hints the Result.getText() actually returns \u007F … \u00FF characters in the text (still not binary, but getBytes("ISO-8859-1") returns what looks like the expected content).

  After some blunt debugging I hit the c.g.z.common.StringUtils.guessEncoding, which due to missing hints tries to detect the character set. That fails, because the ISO-8859-1 detection does not allow for 0x7F to 0x9F bytes. Then it resorts to the PLATFORM_DEFAULT_ENCODING, which ends up being UTF-8…

  Also Sean, I noticed, that encoding binary and specifying ISO-8859-1 as character set does *not* add the ECI segment with the character set, and when decoding uses PLATFORM_DEFAULT_ENCODING that might not be ISO-8859-1.
 
  It looks like… Maybe returning ISO-8859-1 instead of the PLATFORM_DEFAULT_ENCODING might be a better match since it assumes ISO-8859-1 when encoding?

2016-09-22 22:44 GMT+03:00 Sean Owen <sro...@gmail.com>:
If I understand the issue, no that's not the problem. Are you saying URLEncoder doesn't work? I doubt that; I'd suspect the contents of the barcode aren't read as bytes in the way you expect.

--

Lachezar Dobrev

unread,
Sep 26, 2016, 8:55:42 AM9/26/16
to Sean Owen, zxing
  Whops.
  Returning ISO-8859-1 might be the wrong choice, since that is used in more than one place.

  Probably adding a guessEncoding() with a default character set would be a better approach.

Sean Owen

unread,
Sep 26, 2016, 9:27:08 AM9/26/16
to zxing, sro...@gmail.com
Yeah, that's a good point about those bytes not being allowed in ISO-8859-1. You could consider base-64 encoding of course, though that makes it 33% bigger.

I think it would be better to set the ECI segment if any non-default segment is specified, not just if it doesn't match the default of ISO-8859-1. If that solves your problem I can do that.

Lachezar Dobrev

unread,
Sep 26, 2016, 10:11:04 AM9/26/16
to Sean Owen, zxing
  Christian: Sean is asking if you're using ZXing to encode the QR codes, or are you getting them from somewhere else. Because the QR Code you showed does not contain an ECI part that would specify ISO-8859-1.

  I managed to /hack/ the QR Code generation by specifying character set "ISO8859_1" as an encoding hint, which is an alias for ISO-8859-1, but does not hit the 'default encoding' check and explicitly adds an ECI segment with the ISO-8859-1 encoding.

2016-09-26 16:27 GMT+03:00 Sean Owen <sro...@gmail.com>:
Yeah, that's a good point about those bytes not being allowed in ISO-8859-1. You could consider base-64 encoding of course, though that makes it 33% bigger.

I think it would be better to set the ECI segment if any non-default segment is specified, not just if it doesn't match the default of ISO-8859-1. If that solves your problem I can do that.

--
Reply all
Reply to author
Forward
0 new messages