UnicodeString and equivalent to iconv's TRANSLIT / IGNORE options?

49 views
Skip to first unread message

Joachim Tuchel

unread,
May 6, 2026, 3:40:47 AMMay 6
to VAST Community Forum
Hi,

I am in my early exeperiments of converting an existing Seaside Application to full utf-8 support. Feels a bit scary as there are so many layers to change, test and check, from Seaside to file access down to the database...

I just encountered something that surprised me, and I am almost sure I am missing something.

Here is the situation:
My Application currently runs in the local codepage ISO8859-15 and people can upload files via their web browser. There is zero reliable indication as to what encoding the file is in, but it is relatively probable that it is utf-8 or some western european code page like Windows-1252.

Until one isn't. 

In the concrete case, the file has no BOM and contains German umlauts as well as a Czech character. 

One of the ideas for our transition to pure unicode is to work with UnicodeString for everything from/to Seaside and files as well as the database an not tocuh too much in the "middle" for a start. (There is so much WriteStream on: String new and such that it feels scary to touch everything in one step). But I am digressing..

So here is what I tried: 

"Create a unicode String with only German Umlauts"
(UnicodeString escaped: '\u{FC}\u{E4}\u{F6}') asSBString --> 'üäö'
"Now add a characer not present in the local codepage"
(UnicodeString escaped: '\u{FC}\u{E4}\u{F6}\u{10D}') asSBString-->  nil
(UnicodeString escaped: '\u{FC}\u{E4}\u{F6}\u{10D}') isAscii --> false
(note: \u{10D} is not in ISO-8859-1, it's a Czech character)

The idea being that there must be a way like in the old convertFromCodePage:... methods to keep as much as possible from the original and reduce the amount of "broken" characters in the result.

But: if my local codepage doesn't support a character, UnicodeString simply gives up, returning nil instead of a String resembling a close approximation of the original.
 
I couldn't find a way to reproduce what the iconv library provides with options like this:

aString convertFromCodePage: 'UTF-8' toCodePage: 'ISO-8859-15//TRANSLIT'

What am I missing? Is there really no way of converting a UnicodeString to a "local" String with TRANSLIT or IGNORE?

Joachim

  
 





 


Seth Berman

unread,
May 6, 2026, 8:16:16 AMMay 6
to VAST Community Forum
Hi Joachim,

You're not missing anything about #asSBString. That method is intentionally strict and answers a single-byte <String> only when the whole <UnicodeString> can be represented in the current code page. If one character cannot be represented, #asSBString answers nil

For the lossy/transliterating case, you can use the following code page conversion method:

-------
| u |

u := UnicodeString escaped: '\u{FC}\u{E4}\u{F6}\u{10D}'.

u
    convertToSingleByteCodePage: EsAbstractCodePageConverter currentCodePage
    strict: false
-------

The 'strict:' argument is the key bit here. 'strict: true' means "convert only if no transliteration is needed." 'strict: false' uses the non-strict policy, which defaults to transliteration.
On Linux this maps to iconv with '//TRANSLIT'. 

- Seth

Joachim Tuchel

unread,
May 7, 2026, 3:49:43 AMMay 7
to VAST Community Forum
Hi Seth,

thanks, that's exactly what I am looking for:


(UnicodeString escaped: '\u{FC}\u{E4}\u{F6}\u{10D}') convertToSingleByteCodePage: EsAbstractCodePageConverter currentCodePage strict: false "--> 'üäöc'"
(UnicodeString escaped: '\u{FC}\u{E4}\u{F6}\u{10D}') convertToSingleByteCodePage: EsAbstractCodePageConverter currentCodePage strict: true "--> nil"

It is a private method, however. How safe is it to use it? Or th eother way round: would it make sense to either make it public or add a public method that delegates here (something like #convertedToLocalCodePage)

Is there an equivalent for //IGNORE as well that would simply return a String with only the Characters that can be converted? 

My idea behind the questuon: I think it might be helpful to use the semantics of //IGNORE for testing if a UnicodeString can be converted to the local codepage without loss.
The test #isAscii obviuosly is not enough, because my local codepage does support German umlauts and '\u{FC}\u{E4}\u{F6}' can be translated to my local codepage without problems, but #isAscii is false nevertheless (which is correct).
So a method like #isFullyTranslatableToCodePage: or #canLosslesslyBeConvertedToCodePage: would probably make a lot of sense, at least for as long as we are not all runing on Unicode natively.

Note: I am not eager to adapt the iconv concept of returning a converted String and throwing an error at the same time when using //IGNORE. A call either works or not, but not both at the same time... 

Joachim

Henry Johansen

unread,
May 19, 2026, 12:44:02 PMMay 19
to va-sma...@googlegroups.com
On Thu, May 7, 2026 at 9:49 AM Joachim Tuchel <jtu...@objektfabrik.de> wrote:
Hi Seth,

thanks, that's exactly what I am looking for:


(UnicodeString escaped: '\u{FC}\u{E4}\u{F6}\u{10D}') convertToSingleByteCodePage: EsAbstractCodePageConverter currentCodePage strict: false "--> 'üäöc'"
(UnicodeString escaped: '\u{FC}\u{E4}\u{F6}\u{10D}') convertToSingleByteCodePage: EsAbstractCodePageConverter currentCodePage strict: true "--> nil"

It is a private method, however. How safe is it to use it? Or th eother way round: would it make sense to either make it public or add a public method that delegates here (something like #convertedToLocalCodePage)

Is there an equivalent for //IGNORE as well that would simply return a String with only the Characters that can be converted? 

My idea behind the questuon: I think it might be helpful to use the semantics of //IGNORE for testing if a UnicodeString can be converted to the local codepage without loss.
The test #isAscii obviuosly is not enough, because my local codepage does support German umlauts and '\u{FC}\u{E4}\u{F6}' can be translated to my local codepage without problems, but #isAscii is false nevertheless (which is correct).
So a method like #isFullyTranslatableToCodePage: or #canLosslesslyBeConvertedToCodePage: would probably make a lot of sense, at least for as long as we are not all runing on Unicode natively.

Note: I am not eager to adapt the iconv concept of returning a converted String and throwing an error at the same time when using //IGNORE. A call either works or not, but not both at the same time... 

Joachim


Converting to the local codepage *without loss* is what you get using asSBString - //IGNORE without an error thrown would silently drop characters :)

I'm not sure an additional option in the base system would be valuable.
Either you set the conversion to not be strict and accept that some characters will be transliterated,
or, you must handle the case where there are unrepresentable characters (asSBString returns nil) in whatever custom manner you see fit (warn users, drop characters calling iconv directly with //IGNORE, etc).

Cheers,
Henry
Reply all
Reply to author
Forward
0 new messages