Question on Character Conversion Behavior (ICU 3.2 vs 7.8)

19 views
Skip to first unread message

Issei Ikejiri

unread,
Jun 22, 2026, 5:52:53 AM (4 days ago) Jun 22
to icu-support

Hi ICU Support, 

 I have a question regarding character conversion behavior when using ICU versions 3.2 and 7.8 for converting from EBCDIC to code page 943. 

Input encoding: ibm-16684_P110-2003 
 Output encoding: ibm-943_P130-1999 

 Original source data (GRAPHIC(10)): 
 Hex '5440 4040 4040 4040 4040 4040 4040 4040' (Length: 20) 
 (Note: Hex '5440' represents a garbled/invalid EBCDIC character.) 

Converted data: 
ICU 3.2: 
Hex 'FCFC 4080 4080 4080 4080 4080 4080 4080' (Length: 20) 

ICU 7.8: 
 Hex 'FCFC 4080 4080 4080 4080 4080 4080 4080 FCFC' (Length: 22) 

In ICU 7.8, the invalid EBCDIC character ('5440') is replaced with the DBCS substitution character ('FCFC'), and an additional 'FCFC' appears at the end, resulting in a longer output. 

Could you please confirm the following: 
1. Has the specification or behavior of ICU changed between versions 3.2 and 7.8 regarding this type of conversion? 
2. Is the behavior observed in ICU 7.8 expected? 

Best regards, 
 Issei Ikejiri

Markus Scherer

unread,
Jun 22, 2026, 6:03:57 PM (4 days ago) Jun 22
to Issei Ikejiri, icu-support
On Mon, Jun 22, 2026 at 2:52 AM Issei Ikejiri <isseyj...@gmail.com> wrote:
 I have a question regarding character conversion behavior when using ICU versions 3.2 and 7.8 for converting from EBCDIC to code page 943. 

FYI: ICU 3.2 was released about 20 years ago. ICU 78 (not 7.8) is from last year.

Input encoding: ibm-16684_P110-2003

which is a "DBCS" charset, that is, every character is encoded with 2 bytes

 Original source data (GRAPHIC(10)): 
 Hex '5440 4040 4040 4040 4040 4040 4040 4040' (Length: 20) 
 (Note: Hex '5440' represents a garbled/invalid EBCDIC character.) 

Converted data: 
ICU 3.2: 
Hex 'FCFC 4080 4080 4080 4080 4080 4080 4080' (Length: 20) 

ICU 7.8: 
 Hex 'FCFC 4080 4080 4080 4080 4080 4080 4080 FCFC' (Length: 22) 

In ICU 7.8, the invalid EBCDIC character ('5440') is replaced with the DBCS substitution character ('FCFC'), and an additional 'FCFC' appears at the end, resulting in a longer output. 

Could you please confirm the following: 
1. Has the specification or behavior of ICU changed between versions 3.2 and 7.8 regarding this type of conversion? 

Evidently it has.
For details, you could look over the download pages, or at the history of the implementation code:

You could also look at ICU Jira tickets with component=conversion, status=Done, resolution=Fixed.

There have not been a lot of changes in the conversion code (other than cleanups) in the last 15 years, partly because the industry has nearly fully switched to Unicode.

2. Is the behavior observed in ICU 7.8 expected? 

I suspect that it is.

I believe that the original code read two bytes at a time, and if the combination was ill-formed or unassigned, it did error handling for the two bytes.

This kind of behavior has raised security concerns, because one bad byte could lead to "swallowing" part of a well-formed following character. In particular, in ASCII-based charsets, ASCII syntax characters could be consumed in the error handling. This was especially bad when inconsistent: Some implementations may consume fewer bytes, and then the following character could be syntactically relevant.

The industry, and ICU, have generally moved to not include a byte in the error handling if it could not continue a well-formed sequence. The converter should stop before such a byte.

In this case, it looks like modern ICU treats only the first single byte 0x54 as an ill-formed sequence, then it's off by one byte but still sees "40 40" pairs, then there is only one single 0x40 byte left, which is another ill-formed sequence.

Hope this helps / best regards,
markus

Issei Ikejiri

unread,
Jun 23, 2026, 5:52:55 AM (3 days ago) Jun 23
to Markus Scherer, icu-support
Hi Markus,

Thank you very much for your response.
The statement you provided appears to be correct. In fact, when we perform the conversion after removing the last 0x40, the final converted x'FCFC' is no longer present.
---
In this case, it looks like modern ICU treats only the first single byte 0x54 as an ill-formed sequence. Then it becomes offset by one byte but still processes the "40 40" pairs, and finally only one single 0x40 byte remains, which is treated as another ill-formed sequence.
---
Regarding the original issue, the converted length becomes longer, which causes the application to fail with an SQL error when inserting data into a database column. This has a significant impact on many customers' production systems. The same scenario worked without issue in ICN 3.2, and in general, an increase in length after conversion is not expected.

Since characters in a GRAPHIC column should be handled as DBCS characters, the length ideally should not change. Would it be possible to revert to the behavior of the previous version, even if this change has gone unnoticed for many years?
Alternatively, would it be possible to provide an option to preserve the previous behavior so that the length does not change?
 
So far, I cannot find anything that explains this change in behavior in the history.
 
Regards,
Issei Ikejiri

2026年6月23日(火) 7:03 Markus Scherer <marku...@gmail.com>:

Issei Ikejiri

unread,
Jun 24, 2026, 8:27:54 PM (2 days ago) Jun 24
to icu-support, Issei Ikejiri, icu-support, Markus Scherer
Hi Markus,

I hope this message finds you well.
I apologize for reaching out again, but I would greatly appreciate it if you could provide any updates or responses regarding my previous questions.
If this matter falls outside your area of responsibility, I would be grateful if you could kindly advise me on the appropriate contact or next steps to obtain the necessary information.
Thank you very much.

Kind regards,
Issei Ikejiri

2026年6月23日火曜日 18:52:55 UTC+9 Issei Ikejiri:

Steven R. Loomis

unread,
Jun 24, 2026, 8:49:54 PM (2 days ago) Jun 24
to Issei Ikejiri, icu-support, Markus Scherer, Issei Ikejiri
Hello,

Please note that it is ICU 78.1, not 7.8.

A couple of notes:

 - If you are interested in  which ICU version had a change, you might try other versions.
-  I notice you are using the code page-to-code page APIs.  Internally, these convert through Unicode (UTF-16).  You might instead try using a two step conversion, that is, ibm-16684 to UTF-16, and then from UTF-16 to 943 — and compare those results. 
- If you use the setFromUCallback and setToUCallbacks https://unicode-org.github.io/icu-docs/apidoc/released/icu4c/ucnv_8h.html#acf5d877019d10500135f3baa95aa94b4 then you can precisely control the behavior in case of an invalid character.  If you would like pointers to sample code let me know. 

Hope this helps, 
Steven

Enviado desde mi iPad

El jun 24, 2026, a la(s) 7:27 p.m., Issei Ikejiri <isseyj...@gmail.com> escribió:

Hi Markus,
--
You received this message because you are subscribed to the Google Groups "icu-support" group.
To unsubscribe from this group and stop receiving emails from it, send an email to icu-support...@unicode.org.
To view this discussion visit https://groups.google.com/a/unicode.org/d/msgid/icu-support/b423ebd6-f8d7-42cd-a9a6-db6ac313f066n%40unicode.org.
Reply all
Reply to author
Forward
0 new messages