Question on Character Conversion Behavior (ICU 3.2 vs 7.8)

Issei Ikejiri

unread,

Jun 22, 2026, 5:52:53 AMJun 22

to icu-support

Hi ICU Support,

I have a question regarding character conversion behavior when using ICU versions 3.2 and 7.8 for converting from EBCDIC to code page 943.

Input encoding: ibm-16684_P110-2003

Output encoding: ibm-943_P130-1999

Original source data (GRAPHIC(10)):

Hex '5440 4040 4040 4040 4040 4040 4040 4040' (Length: 20)

(Note: Hex '5440' represents a garbled/invalid EBCDIC character.)

Converted data:

ICU 3.2:

Hex 'FCFC 4080 4080 4080 4080 4080 4080 4080' (Length: 20)

ICU 7.8:

Hex 'FCFC 4080 4080 4080 4080 4080 4080 4080 FCFC' (Length: 22)

In ICU 7.8, the invalid EBCDIC character ('5440') is replaced with the DBCS substitution character ('FCFC'), and an additional 'FCFC' appears at the end, resulting in a longer output.

Could you please confirm the following:

1. Has the specification or behavior of ICU changed between versions 3.2 and 7.8 regarding this type of conversion?

2. Is the behavior observed in ICU 7.8 expected?

Best regards,

Issei Ikejiri

Markus Scherer

unread,

Jun 22, 2026, 6:03:57 PMJun 22

to Issei Ikejiri, icu-support

On Mon, Jun 22, 2026 at 2:52 AM Issei Ikejiri <isseyj...@gmail.com> wrote:

I have a question regarding character conversion behavior when using ICU versions 3.2 and 7.8 for converting from EBCDIC to code page 943.

FYI: ICU 3.2 was released about 20 years ago. ICU 78 (not 7.8) is from last year.

Input encoding: ibm-16684_P110-2003

which is a "DBCS" charset, that is, every character is encoded with 2 bytes

Original source data (GRAPHIC(10)):
Hex '5440 4040 4040 4040 4040 4040 4040 4040' (Length: 20)
(Note: Hex '5440' represents a garbled/invalid EBCDIC character.)

Converted data:
ICU 3.2:
Hex 'FCFC 4080 4080 4080 4080 4080 4080 4080' (Length: 20)

ICU 7.8:
Hex 'FCFC 4080 4080 4080 4080 4080 4080 4080 FCFC' (Length: 22)

In ICU 7.8, the invalid EBCDIC character ('5440') is replaced with the DBCS substitution character ('FCFC'), and an additional 'FCFC' appears at the end, resulting in a longer output.

Could you please confirm the following:
1. Has the specification or behavior of ICU changed between versions 3.2 and 7.8 regarding this type of conversion?

Evidently it has.

For details, you could look over the download pages, or at the history of the implementation code:

https://github.com/unicode-org/icu/commits/main/icu4c/source/common/ucnvmbcs.cpp

You could also look at ICU Jira tickets with component=conversion, status=Done, resolution=Fixed.

There have not been a lot of changes in the conversion code (other than cleanups) in the last 15 years, partly because the industry has nearly fully switched to Unicode.

2. Is the behavior observed in ICU 7.8 expected?

I suspect that it is.

I believe that the original code read two bytes at a time, and if the combination was ill-formed or unassigned, it did error handling for the two bytes.

This kind of behavior has raised security concerns, because one bad byte could lead to "swallowing" part of a well-formed following character. In particular, in ASCII-based charsets, ASCII syntax characters could be consumed in the error handling. This was especially bad when inconsistent: Some implementations may consume fewer bytes, and then the following character could be syntactically relevant.

The industry, and ICU, have generally moved to not include a byte in the error handling if it could not continue a well-formed sequence. The converter should stop before such a byte.

In this case, it looks like modern ICU treats only the first single byte 0x54 as an ill-formed sequence, then it's off by one byte but still sees "40 40" pairs, then there is only one single 0x40 byte left, which is another ill-formed sequence.

Hope this helps / best regards,

markus

Issei Ikejiri

unread,

Jun 23, 2026, 5:52:55 AMJun 23

to Markus Scherer, icu-support

Hi Markus,

Thank you very much for your response.
The statement you provided appears to be correct. In fact, when we perform the conversion after removing the last 0x40, the final converted x'FCFC' is no longer present.
---
In this case, it looks like modern ICU treats only the first single byte 0x54 as an ill-formed sequence. Then it becomes offset by one byte but still processes the "40 40" pairs, and finally only one single 0x40 byte remains, which is treated as another ill-formed sequence.
---
Regarding the original issue, the converted length becomes longer, which causes the application to fail with an SQL error when inserting data into a database column. This has a significant impact on many customers' production systems. The same scenario worked without issue in ICN 3.2, and in general, an increase in length after conversion is not expected.

Since characters in a GRAPHIC column should be handled as DBCS characters, the length ideally should not change. Would it be possible to revert to the behavior of the previous version, even if this change has gone unnoticed for many years?
Alternatively, would it be possible to provide an option to preserve the previous behavior so that the length does not change?

So far, I cannot find anything that explains this change in behavior in the history.

Regards,
Issei Ikejiri

2026年6月23日(火) 7:03 Markus Scherer <marku...@gmail.com>:

Issei Ikejiri

unread,

Jun 24, 2026, 8:27:54 PMJun 24

to icu-support, Issei Ikejiri, icu-support, Markus Scherer

Hi Markus,

I hope this message finds you well.
I apologize for reaching out again, but I would greatly appreciate it if you could provide any updates or responses regarding my previous questions.
If this matter falls outside your area of responsibility, I would be grateful if you could kindly advise me on the appropriate contact or next steps to obtain the necessary information.
Thank you very much.

Kind regards,
Issei Ikejiri

2026年6月23日火曜日 18:52:55 UTC+9 Issei Ikejiri:

Steven R. Loomis

unread,

Jun 24, 2026, 8:49:54 PMJun 24

to Issei Ikejiri, icu-support, Markus Scherer, Issei Ikejiri

Hello,

Please note that it is ICU 78.1, not 7.8.

A couple of notes:

- If you are interested in which ICU version had a change, you might try other versions.

- I notice you are using the code page-to-code page APIs. Internally, these convert through Unicode (UTF-16). You might instead try using a two step conversion, that is, ibm-16684 to UTF-16, and then from UTF-16 to 943 — and compare those results.

- If you use the setFromUCallback and setToUCallbacks https://unicode-org.github.io/icu-docs/apidoc/released/icu4c/ucnv_8h.html#acf5d877019d10500135f3baa95aa94b4 then you can precisely control the behavior in case of an invalid character. If you would like pointers to sample code let me know.

Hope this helps,

Steven

Enviado desde mi iPad

El jun 24, 2026, a la(s) 7:27 p.m., Issei Ikejiri <isseyj...@gmail.com> escribió:

Hi Markus,

--
You received this message because you are subscribed to the Google Groups "icu-support" group.
To unsubscribe from this group and stop receiving emails from it, send an email to icu-support...@unicode.org.
To view this discussion visit https://groups.google.com/a/unicode.org/d/msgid/icu-support/b423ebd6-f8d7-42cd-a9a6-db6ac313f066n%40unicode.org.

Markus Scherer

unread,

Jul 2, 2026, 8:37:50 PMJul 2

to Issei Ikejiri, icu-support

On Tue, Jun 23, 2026 at 2:52 AM Issei Ikejiri <isseyj...@gmail.com> wrote:

Regarding the original issue, the converted length becomes longer, which causes the application to fail with an SQL error when inserting data into a database column. This has a significant impact on many customers' production systems. The same scenario worked without issue in ICN 3.2, and in general, an increase in length after conversion is not expected.

I agree that the output (and its length) should follow the usual expectations as long as the input is well-formed.

When the input is broken, then these expectations compete with security considerations.

The problem is that we cannot know why the input is ill-formed. Was a byte omitted and we are out of sync? Are all the bytes there but one had a bit flip? Does the selected encoding not match the input?

ICU has followed evolving industry practice in this case, which is to avoid consuming more than a prefix of a well-formed subsequence (but always at least one byte).

To give another example, this change has affected decoding of UTF-8 as well (and I remember this more distinctly, since the older behavior seemed "obvious" looking at the UTF-8 spec).

Consider the bytes E0 80 80 -- a lead byte indicating a 3-byte sequence, and two trail bytes -- but it is a "non-shortest form" (for code point U+0000). Old versions of ICU would report the 3-byte sequence as one error. With the newer logic, only the initial byte E0 is reported for the first error, the decoding restarts after that, and ends up reporting two more errors. If your error handling emitted a U+FFFD each time and you re-encoded the result into UTF-8, then an old ICU version would have output 3 bytes for one U+FFFD, while a newer version would give you 9 bytes for 3 U+FFFDs.

Would it be possible to revert to the behavior of the previous version, even if this change has gone unnoticed for many years?

It has not gone unnoticed. While it would take me a little time to "blame" the code and dig up the ticket and the commit(s), I am quite sure that this was a deliberate change in all of the not-just-single-byte input decoders (both in C++ and in Java) in order to align the behavior with the shifting industry practice.

Alternatively, would it be possible to provide an option to preserve the previous behavior so that the length does not change?

Maybe, but (a) that would again be a complex change across many functions and files, and (b) since the industry has nearly completely given up (before about 2010) on storing and processing non-Unicode text, this kind of significant work would have trouble getting prioritized.

Steven's idea of some kind of workaround using the error-callback API could help. I don't think it lets you customize the number of ill-formed bytes consumed, but it might work well enough to check, in your case, if the input is two bytes or one. If two, you could let the code continue with the usual substitution-character behavior. If one, you could stop the conversion, have the calling code (a workaround wrapper) consume one more byte, and restart the conversion after that.

You can look at the unit tests code for examples of custom callback functions.

Issei Ikejiri

unread,

Jul 6, 2026, 9:04:00 AMJul 6

to icu-support, Markus Scherer, icu-support, Issei Ikejiri

Hi markus,

Thank you for the explanation.
If we have additional questions, we will ask you again.

Regards,
Issei Ikejiri

2026年7月3日金曜日 9:37:50 UTC+9 Markus Scherer:

Reply all

Reply to author

Forward