CP-1357 background

67 views
Skip to first unread message

Mathieu Malaterre

unread,
Oct 21, 2025, 10:39:21 AMOct 21
to DICOM Forum
Hi gurus,

I see that CP-1357 is going in the right direction in adding back some symbols that were missed. But still I fail to understand the original scope of the sentence:

[...]
but shall be drawn from the code points U+0020 through U+1FFF of [ISO/IEC 10646]
[...]

What was the point of the original CP (Supp?). -could someone track the diff for me-

I see identical problem with '€' symbol being valid in latin9 but sudenly becoming invalid in UTF-8. I /think/ the original intent was to remove the unicode groups: punctation, super/sub scripts, current symbols ... since they make no sense for PN. But why would restrict this to multi-byte Unicode familiy (utf-8, gbk & gb18030) only ?

Thanks,

David Gobbi

unread,
Oct 30, 2025, 6:10:57 PM (7 days ago) Oct 30
to DICOM Forum
Hi Mathieu,

On Tuesday, 21 October 2025 at 08:39:21 UTC-6 Mathieu Malaterre wrote:
I see that CP-1357 is going in the right direction in adding back some symbols that were missed. But still I fail to understand the original scope of the sentence:

[...]
but shall be drawn from the code points U+0020 through U+1FFF of [ISO/IEC 10646]
[...]

What was the point of the original CP (Supp?). -could someone track the diff for me-

You mean CP-964 "Correct alphabetic name encoding for Unicode"?  According to its "Rationale" section, at the time it was written the "alphabetic" part of PN was restricted to single-byte encodings.  But since conversion of any single-byte encoding other than "ISO_IR 6" to UTF-8 results in multi-byte UTF-8 chars,  at least some multi-byte UTF-8 chars had to be allowed in the "alphabetic" part, hence CP-964.

I see identical problem with '€' symbol being valid in latin9 but sudenly becoming invalid in UTF-8.

DICOM did not add latin9 (ISO_IR 203) until 2021, several years after CP-964, so '€' was not an issue when CP-964 was written.  I notice that '€' exists within a block of currency symbols U+20A0 through U+20BF, and the use of currency symbols in names does seem to be in vogue, so maybe they should all be allowed... (yes, that's a joke).
  
I /think/ the original intent was to remove the unicode groups: punctation, super/sub scripts, current symbols ... since they make no sense for PN. But why would restrict this to multi-byte Unicode familiy (utf-8, gbk & gb18030) only ?

Before, there had been no restriction on any single-byte character sets, so adding a restriction on them would have made many old DICOM data sets suddenly become invalid.

It looks like the current conflict is limited to one character in one encoding: '€' in "ISO_IR 203", and the number of datasets with this encoding and with this character in PN is probably tiny or even nonexistent.  I'd keep the restriction as narrow as possible: disallow '€' in "ISO_IR 203" in PN alphabetic part.

Mathieu Malaterre

unread,
Oct 31, 2025, 5:01:04 AM (6 days ago) Oct 31
to DICOM Forum
On Thursday, October 30, 2025 at 11:10:57 PM UTC+1 David Gobbi wrote:
Hi Mathieu,

On Tuesday, 21 October 2025 at 08:39:21 UTC-6 Mathieu Malaterre wrote:
I see that CP-1357 is going in the right direction in adding back some symbols that were missed. But still I fail to understand the original scope of the sentence:

[...]
but shall be drawn from the code points U+0020 through U+1FFF of [ISO/IEC 10646]
[...]

What was the point of the original CP (Supp?). -could someone track the diff for me-

You mean CP-964 "Correct alphabetic name encoding for Unicode"?  According to its "Rationale" section, at the time it was written the "alphabetic" part of PN was restricted to single-byte encodings.  But since conversion of any single-byte encoding other than "ISO_IR 6" to UTF-8 results in multi-byte UTF-8 chars,  at least some multi-byte UTF-8 chars had to be allowed in the "alphabetic" part, hence CP-964.

I see identical problem with '€' symbol being valid in latin9 but sudenly becoming invalid in UTF-8.

DICOM did not add latin9 (ISO_IR 203) until 2021, several years after CP-964, so '€' was not an issue when CP-964 was written.  I notice that '€' exists within a block of currency symbols U+20A0 through U+20BF, and the use of currency symbols in names does seem to be in vogue, so maybe they should all be allowed... (yes, that's a joke).
Now that I see CP-964 and wording of "alphabetic representation"  that answer my original concern. All symbols $, £ or even € are disallowed in the alphabetic group. I think it is fair that symbols hyphen '-', period '.', space ' ' and apostrophe (’) falls under the "alphabetic representation" defition of CP-964. 
  
I /think/ the original intent was to remove the unicode groups: punctation, super/sub scripts, current symbols ... since they make no sense for PN. But why would restrict this to multi-byte Unicode familiy (utf-8, gbk & gb18030) only ?

Before, there had been no restriction on any single-byte character sets, so adding a restriction on them would have made many old DICOM data sets suddenly become invalid.

It looks like the current conflict is limited to one character in one encoding: '€' in "ISO_IR 203", and the number of datasets with this encoding and with this character in PN is probably tiny or even nonexistent.  I'd keep the restriction as narrow as possible: disallow '€' in "ISO_IR 203" in PN alphabetic part.

As explained above I believe this was already the intentation of CP-964. So the only missing bit in CP-1357 are the halfwidth katakana now. And if one wants to be overly pedantic, I suspect it would make sense to add the typographically correct apostrophe U+2019.

Will write a CP ASAP.

Thanks -again- for your kind help !

David Gobbi

unread,
Oct 31, 2025, 6:06:58 PM (6 days ago) Oct 31
to DICOM Forum
On Friday, 31 October 2025 at 03:01:04 UTC-6 Mathieu Malaterre wrote:

Now that I see CP-964 and wording of "alphabetic representation"  that answer my original concern. All symbols $, £ or even € are disallowed in the alphabetic group. I think it is fair that symbols hyphen '-', period '.', space ' ' and apostrophe (’) falls under the "alphabetic representation" defition of CP-964.

Even though I see "alphabetic representation" in the standard, it isn't so narrowly defined.  That is, I do not see any wording that indicates that all symbols are disallowed.  Isn't the entire Default Repertoire, including its symbols, allowed apart from '=', '^', '\', and ESC TAB CR LF FF?
  
I /think/ the original intent was to remove the unicode groups: punctation, super/sub scripts, current symbols ... since they make no sense for PN. But why would restrict this to multi-byte Unicode familiy (utf-8, gbk & gb18030) only ?

Before, there had been no restriction on any single-byte character sets, so adding a restriction on them would have made many old DICOM data sets suddenly become invalid.

It looks like the current conflict is limited to one character in one encoding: '€' in "ISO_IR 203", and the number of datasets with this encoding and with this character in PN is probably tiny or even nonexistent.  I'd keep the restriction as narrow as possible: disallow '€' in "ISO_IR 203" in PN alphabetic part.

As explained above I believe this was already the intentation of CP-964. So the only missing bit in CP-1357 are the halfwidth katakana now. And if one wants to be overly pedantic, I suspect it would make sense to add the typographically correct apostrophe U+2019.

It's a fine point, but CP-1357 was very specific to UTF-8, and I can't see that it had intentions regarding other character sets.  So a dataset with SpecificCharacterSet=ISO_IR 203 can use the EURO symbol in the alphabetic representation.  I think it should be disallowed, since it impedes ISO_IR 203 to UTF8 conversions of PN if present, but currently the standard does allow it.

David Gobbi

unread,
Oct 31, 2025, 6:12:20 PM (6 days ago) Oct 31
to DICOM Forum
I meant to say "CP-964 was very specific to UTF-8" (CP-1357 is, too, but it is CP-964 that we were discussing).

Thanks for writing the new CP, by the way.
 

Mathieu Malaterre

unread,
Nov 2, 2025, 11:12:55 AM (4 days ago) Nov 2
to DICOM Forum
The only funky example I know from the top of my is `X Æ A-12` turns out the birth certificate (AFAIK) defines it as ''X,AE A-XII". So I'd like to keep the non-"alphabetic representation" outside the scope of this CP. I believe some other IT standard use the wording "UB" for Undefined Behavior. It is acceptable for an implementation to preserve '€' (or number such as '12') throughout compatible conversion, but this does not need to be required IMHO since those should not be used in the first place for proper PN values. Thus it is also acceptable for an implementation to discard those characters that are not "alphabetic representation" during charset conversion.

I have not checked the wording in IHE or other standard, but "alphabetic representation" (with a limited set of symbols) is clear to me.
 

Thanks for writing the new CP, by the way.

Sure, thanks for catching the half width issue in the first place.
 
Reply all
Reply to author
Forward
0 new messages