I see that CP-1357 is going in the right direction in adding back some symbols that were missed. But still I fail to understand the original scope of the sentence:[...]but shall be drawn from the code points U+0020 through U+1FFF of [ISO/IEC 10646][...]What was the point of the original CP (Supp?). -could someone track the diff for me-
I see identical problem with '€' symbol being valid in latin9 but sudenly becoming invalid in UTF-8.
I /think/ the original intent was to remove the unicode groups: punctation, super/sub scripts, current symbols ... since they make no sense for PN. But why would restrict this to multi-byte Unicode familiy (utf-8, gbk & gb18030) only ?
Hi Mathieu,On Tuesday, 21 October 2025 at 08:39:21 UTC-6 Mathieu Malaterre wrote:I see that CP-1357 is going in the right direction in adding back some symbols that were missed. But still I fail to understand the original scope of the sentence:[...]but shall be drawn from the code points U+0020 through U+1FFF of [ISO/IEC 10646][...]What was the point of the original CP (Supp?). -could someone track the diff for me-You mean CP-964 "Correct alphabetic name encoding for Unicode"? According to its "Rationale" section, at the time it was written the "alphabetic" part of PN was restricted to single-byte encodings. But since conversion of any single-byte encoding other than "ISO_IR 6" to UTF-8 results in multi-byte UTF-8 chars, at least some multi-byte UTF-8 chars had to be allowed in the "alphabetic" part, hence CP-964.I see identical problem with '€' symbol being valid in latin9 but sudenly becoming invalid in UTF-8.DICOM did not add latin9 (ISO_IR 203) until 2021, several years after CP-964, so '€' was not an issue when CP-964 was written. I notice that '€' exists within a block of currency symbols U+20A0 through U+20BF, and the use of currency symbols in names does seem to be in vogue, so maybe they should all be allowed... (yes, that's a joke).
I /think/ the original intent was to remove the unicode groups: punctation, super/sub scripts, current symbols ... since they make no sense for PN. But why would restrict this to multi-byte Unicode familiy (utf-8, gbk & gb18030) only ?Before, there had been no restriction on any single-byte character sets, so adding a restriction on them would have made many old DICOM data sets suddenly become invalid.It looks like the current conflict is limited to one character in one encoding: '€' in "ISO_IR 203", and the number of datasets with this encoding and with this character in PN is probably tiny or even nonexistent. I'd keep the restriction as narrow as possible: disallow '€' in "ISO_IR 203" in PN alphabetic part.
Now that I see CP-964 and wording of "alphabetic representation" that answer my original concern. All symbols $, £ or even € are disallowed in the alphabetic group. I think it is fair that symbols hyphen '-', period '.', space ' ' and apostrophe (’) falls under the "alphabetic representation" defition of CP-964.
I /think/ the original intent was to remove the unicode groups: punctation, super/sub scripts, current symbols ... since they make no sense for PN. But why would restrict this to multi-byte Unicode familiy (utf-8, gbk & gb18030) only ?Before, there had been no restriction on any single-byte character sets, so adding a restriction on them would have made many old DICOM data sets suddenly become invalid.It looks like the current conflict is limited to one character in one encoding: '€' in "ISO_IR 203", and the number of datasets with this encoding and with this character in PN is probably tiny or even nonexistent. I'd keep the restriction as narrow as possible: disallow '€' in "ISO_IR 203" in PN alphabetic part.As explained above I believe this was already the intentation of CP-964. So the only missing bit in CP-1357 are the halfwidth katakana now. And if one wants to be overly pedantic, I suspect it would make sense to add the typographically correct apostrophe U+2019.
Thanks for writing the new CP, by the way.