UnicodeString and equivalent to iconv's TRANSLIT / IGNORE options?

34 views
Skip to first unread message

Joachim Tuchel

unread,
May 6, 2026, 3:40:47 AM (12 days ago) May 6
to VAST Community Forum
Hi,

I am in my early exeperiments of converting an existing Seaside Application to full utf-8 support. Feels a bit scary as there are so many layers to change, test and check, from Seaside to file access down to the database...

I just encountered something that surprised me, and I am almost sure I am missing something.

Here is the situation:
My Application currently runs in the local codepage ISO8859-15 and people can upload files via their web browser. There is zero reliable indication as to what encoding the file is in, but it is relatively probable that it is utf-8 or some western european code page like Windows-1252.

Until one isn't. 

In the concrete case, the file has no BOM and contains German umlauts as well as a Czech character. 

One of the ideas for our transition to pure unicode is to work with UnicodeString for everything from/to Seaside and files as well as the database an not tocuh too much in the "middle" for a start. (There is so much WriteStream on: String new and such that it feels scary to touch everything in one step). But I am digressing..

So here is what I tried: 

"Create a unicode String with only German Umlauts"
(UnicodeString escaped: '\u{FC}\u{E4}\u{F6}') asSBString --> 'üäö'
"Now add a characer not present in the local codepage"
(UnicodeString escaped: '\u{FC}\u{E4}\u{F6}\u{10D}') asSBString-->  nil
(UnicodeString escaped: '\u{FC}\u{E4}\u{F6}\u{10D}') isAscii --> false
(note: \u{10D} is not in ISO-8859-1, it's a Czech character)

The idea being that there must be a way like in the old convertFromCodePage:... methods to keep as much as possible from the original and reduce the amount of "broken" characters in the result.

But: if my local codepage doesn't support a character, UnicodeString simply gives up, returning nil instead of a String resembling a close approximation of the original.
 
I couldn't find a way to reproduce what the iconv library provides with options like this:

aString convertFromCodePage: 'UTF-8' toCodePage: 'ISO-8859-15//TRANSLIT'

What am I missing? Is there really no way of converting a UnicodeString to a "local" String with TRANSLIT or IGNORE?

Joachim

  
 





 


Seth Berman

unread,
May 6, 2026, 8:16:16 AM (12 days ago) May 6
to VAST Community Forum
Hi Joachim,

You're not missing anything about #asSBString. That method is intentionally strict and answers a single-byte <String> only when the whole <UnicodeString> can be represented in the current code page. If one character cannot be represented, #asSBString answers nil

For the lossy/transliterating case, you can use the following code page conversion method:

-------
| u |

u := UnicodeString escaped: '\u{FC}\u{E4}\u{F6}\u{10D}'.

u
    convertToSingleByteCodePage: EsAbstractCodePageConverter currentCodePage
    strict: false
-------

The 'strict:' argument is the key bit here. 'strict: true' means "convert only if no transliteration is needed." 'strict: false' uses the non-strict policy, which defaults to transliteration.
On Linux this maps to iconv with '//TRANSLIT'. 

- Seth

Joachim Tuchel

unread,
May 7, 2026, 3:49:43 AM (11 days ago) May 7
to VAST Community Forum
Hi Seth,

thanks, that's exactly what I am looking for:


(UnicodeString escaped: '\u{FC}\u{E4}\u{F6}\u{10D}') convertToSingleByteCodePage: EsAbstractCodePageConverter currentCodePage strict: false "--> 'üäöc'"
(UnicodeString escaped: '\u{FC}\u{E4}\u{F6}\u{10D}') convertToSingleByteCodePage: EsAbstractCodePageConverter currentCodePage strict: true "--> nil"

It is a private method, however. How safe is it to use it? Or th eother way round: would it make sense to either make it public or add a public method that delegates here (something like #convertedToLocalCodePage)

Is there an equivalent for //IGNORE as well that would simply return a String with only the Characters that can be converted? 

My idea behind the questuon: I think it might be helpful to use the semantics of //IGNORE for testing if a UnicodeString can be converted to the local codepage without loss.
The test #isAscii obviuosly is not enough, because my local codepage does support German umlauts and '\u{FC}\u{E4}\u{F6}' can be translated to my local codepage without problems, but #isAscii is false nevertheless (which is correct).
So a method like #isFullyTranslatableToCodePage: or #canLosslesslyBeConvertedToCodePage: would probably make a lot of sense, at least for as long as we are not all runing on Unicode natively.

Note: I am not eager to adapt the iconv concept of returning a converted String and throwing an error at the same time when using //IGNORE. A call either works or not, but not both at the same time... 

Joachim
Reply all
Reply to author
Forward
0 new messages