Hi Xbliters,
Since it is by tradition a time for resolutions, one of mine is to have XBLite support UTF-16 LE code generation by 2015's year-end.
Since 2008, I have been working on understanding how Windows handles Unicode. I believe that,at this point, I know enough to handle the areas of:
- Wide-character WinAPI
- File formats UTF-8, UTF-16 LE (Windows' native Unicode)
- Conversions between ASCII and Unicode
As a proof of concept, a successful port of Xsed to Wide-characters' support would be the visible part of the iceberg, except that we have to replace our good old ASCII Scintilla custom control with a Unicode-friendly big brother, if available for download.
Here are some of the design decisions I made:
1.To ensure backwards compatibility, the $$STRING type (value of 19) will represent the legacy ASCII type (character size = 1 byte).
2.I added 2 new string types: $$ASCII (= 20) and $$UTF16_LE (= 21).
- $$ASCII is a misnomer as $$UTF8 is more appropriate in the eyes of Purists, but not to my eyes; however, would the majority of the Xbilters settles for $$UTF8, I'll be glad to oblige and can can $$ASCII (“Oh yes, I can can!”).
- $$UTF16_LE is, rather than $$UNICODE, is my preferred choice as Windows' native Unicode is fully defined as UTF-16 LE (Little Endian).
3.The character size of an ASCII string is 1 byte (8 bits), when the character size of a UTF-16 LE string is 2 bytes (16 bits), with the following consequence: the ASCII null terminator is '\0', the UTF-16 LE null terminator is “\0\0”.
Consequences:
- allocated UTF-16 LE strings must have a “\0\0” terminator
- SIZE(strW$) = 2 * LEN(strW$), because SIZE(strW$) returns always a number of bytes
- headW$ + tailW$ is done ASCII and does not strip the first zero byte of “\0\0”
These are 3 main issues that must be carefully addressed.
Bye! Guy