UNICODE in HVM

351 views
Skip to first unread message

Przemysław Czerpak

unread,
Nov 10, 2011, 5:16:43 AM11/10/11
to Harbour developers
Hi all,

I plan to begin some modifications to add basic support for UNICODE to HVM.
I would like to agree some important things now.
Let's imagine that we use UTF8 as internal unistring representation.
Should we change functions like LEN(), SUBSTR(), LEFT(), RIGHT(), PADR(),
PADC(), ... to operate on character indexes or leave them as byte ones?
What should we make with CHR() and ASC() functions? Keep then operating
on ASCII values or switch to UNICODE?
What is your preferred behavior for INKEY() and unicode values?
If we want to keep compatibility then we need to introduce new
inkey flag to retrieve UNICODE values. We can also define one
inkey value K_UNICODE to indicate that there is unicode value
which can be retrieved by HB_UNIKEY() function.

Please also thing about updating upper level core code like GET
system to work with UNICODE values. New PRG API should allow to make
such update easy.

best regards,
Przemek

Viktor Szakáts

unread,
Nov 10, 2011, 5:31:14 AM11/10/11
to harbou...@googlegroups.com
Hi Przemek,


> I would like to agree some important things now.
> Let's imagine that we use UTF8 as internal unistring representation.
> Should we change functions like LEN(), SUBSTR(), LEFT(), RIGHT(), PADR(),
> PADC(), ... to operate on character indexes or leave them as byte ones?
> What should we make with CHR() and ASC() functions? Keep then operating
> on ASCII values or switch to UNICODE?

Ideally I think these should check if they are working on a 
raw byte vector or UNICODE string and behave accordingly.

If such distinction is not made internally, it will be very hard 
to make a single judgment here. Probably character indexes / UNICODE, 
to allow for a sleek .prg level interface for the future, but in 
such case code compatibility will be a concern and we'll need 
equivalent functions to work on raw strings.

> What is your preferred behavior for INKEY() and unicode values?
> If we want to keep compatibility then we need to introduce new
> inkey flag to retrieve UNICODE values. We can also define one
> inkey value K_UNICODE to indicate that there is unicode value
> which can be retrieved by HB_UNIKEY() function.

Probably the former would allow for slicker code on the user's 
side, so I'd prefer a new inkey flag.

Viktor

Massimo Belgrano

unread,
Nov 10, 2011, 5:53:20 AM11/10/11
to harbou...@googlegroups.com
in ads is managed at fiedl level
so will in harbour create a new type nstring
or is neccessary only for UTF-16 encoding because double the space occupation

+1 for inkey flag to retrieve UNICODE values


2011/11/10 Viktor Szakáts <harbo...@syenar.hu>



--
Massimo Belgrano


Massimo Belgrano

unread,
Nov 10, 2011, 6:04:11 AM11/10/11
to harbou...@googlegroups.com
Regarding the get system is possible a evolution to be more modern
now in gt programm is not possible select one or more characher from a get and copy 
try hbmk2 testget in tests and select  "EL" from "HELLO" with shft right+right  
2011/11/10 Massimo Belgrano <mbel...@deltain.it>


--
Massimo Belgrano





--
Massimo Belgrano


Przemysław Czerpak

unread,
Nov 10, 2011, 6:40:51 AM11/10/11
to harbou...@googlegroups.com
On Thu, 10 Nov 2011, Massimo Belgrano wrote:

Hi,

I added support for this fileds in Harbour's RDD ADS over year ago:
2010-10-09 19:07 UTC+0200 Przemyslaw Czerpak (druzus/at/priv.onet.pl)
* harbour/contrib/rddads/ads1.c
+ added support for new ADS 10.0 UNICODE fields: NChar, NVarChar, NMemo
They are supported in all ADS* RDDs.

and also in core DBF* RDDs:
2010-10-13 13:21 UTC+0200 Przemyslaw Czerpak (druzus/at/priv.onet.pl)
* harbour/src/rdd/dbf1.c
* harbour/src/rdd/dbffpt/dbffpt1.c
+ added support for UNICODE fields compatible with the ones used
by ADS

> so will in harbour create a new type nstring
> or is neccessary only for UTF-16 encoding because double the space
> occupation

It's completely independent. The internal representation is invisible
for applications using Harbour STR API so code which uses this API can
work with any HVM encodings.

> > > What is your preferred behavior for INKEY() and unicode values?
> > > If we want to keep compatibility then we need to introduce new
> > > inkey flag to retrieve UNICODE values. We can also define one
> > > inkey value K_UNICODE to indicate that there is unicode value
> > > which can be retrieved by HB_UNIKEY() function.
> > Probably the former would allow for slicker code on the user's
> > side, so I'd prefer a new inkey flag.

> +1 for inkey flag to retrieve UNICODE values

OK, but please remember that it means that we have to introduce
completely new set of K_* macros because current ones creates conflicts
with UNICODE values. It means that in some applications modifications
will be very deep.

Now HB_INKEY_EXTENDED is completly unused.
HB_INKEY_RAW is partially used in GTDOS, GTOS2 and GTSLN but it's old
dummy code which works in different way in these GTs and is not compatible
with upper level GT code so we can safely remove it and introduce new
flag, i.e. HB_INKEY_EXT. When it's used then completely new keycode
values are returned. These new keycode values will be used internally
by all low level GTs and core GT code. If inkey() is called without
HB_INKEY_EXT flag then they are translated to old Clipper INKEY values
and UNICODE values which do not have corresponding character in active
CP are converted to K_UNICODE.
Core PRG code should be updated to work correctly with any _SET_EVENTMASK
Anyhow this unicode value had to be converted to string so this question:


> > > What should we make with CHR() and ASC() functions? Keep then operating
> > > on ASCII values or switch to UNICODE?

is very important.
If we leave CHR() as is then we need to introduce new functions, i.e.:
HB_UNICHR()
HB_UNICODE()

best regards,
Przemek

Maurilio Longo

unread,
Nov 10, 2011, 7:38:29 AM11/10/11
to harbou...@googlegroups.com
Przemyslaw,

I think Asc(), given its name, should not handle unicode chars.

My 2c.

Maurilio.

--
__________
| | | |__| Maurilio Longo
|_|_|_|____| farmaconsult s.r.l.

homar

unread,
Nov 10, 2011, 8:57:40 AM11/10/11
to harbou...@googlegroups.com
Hi all,
In general, I feel no need for unicode in HVM (I'm not know this, and not what it gives wim). When I prepare the website I use UTF-8 (if this is the same as UNICODE?) :

fwrite( www_file, '
<!doctype html>
<head>
    <meta http-equiv="Content-Type" content="text/html; charset=utf-8"/>
    <title>Tytuł strony</title>
</head>
<body>
    <h1>Jeden REGON - r&#243;&#380;ne ID_PODMIOTu</h1>
.........
national chars in field from database I'm translate :

...
aadd( translator, { 'Â', '&#262;'})  // c        'ć'
aadd( translator, { '­', '&#280;'})  // E        'Ę'
aadd( translator, { 'ě', '&#281;'})  // e        'ę'
aadd( translator, { '═', '&#321;'})  // L        'Ł'
....
aeval( translator, {| x| _f1 := StrTran( _f1, x[ 1], x[ 2])})

........

What else can be useful for UNICODE?

Regards,
Marek Horodyski

Massimo Belgrano

unread,
Nov 10, 2011, 9:40:12 AM11/10/11
to harbou...@googlegroups.com
Multilingual and global application
Interoperable with other unicode system whitout conversion

Globalization Step-by-Step


Unicode Enabled

Overview and Description

The complex programming methods required for working with mixed-byte encodings, the involved process of creating new code pages every time another language requires computer support, and the importance of mixing and sharing information in a variety of languages across different systems were some of the factors motivating the creators of the Unicode encoding standard. Unicode originated through collaboration between Xerox and Apple. An ad hoc committee of several companies then formed, and others, including IBM and Microsoft, rapidly joined. In 1991, this group founded the Unicode Consortium whose membership now includes several leading Information Technology (IT) companies. (For more information on Unicode, visit the Unicode Consortium's site athttp://www.Unicode.org.)

Unicode is an especially good fit for the age of the Internet, since the worldwide nature of the Internet demands solutions that work in any language. The World Wide Web Consortium (W3C) has recognized this fact and now expects all new RFCs to use Unicode for text. Many other products and standards now require or allow use of Unicode; for example, XML, HTML, Microsoft JScript, Java, Perl, Microsoft C#, and Microsoft Visual Basic 7 (VB.NET). Today, Unicode is the de facto character encoding standard accepted by all major computer companies, while ISO 10646 is the corresponding worldwide de jure standard approved by all ISO member countries. The two standards include identical character repertoires and binary representations.

Unicode encompasses virtually all characters used widely in computers today. It is capable of addressing more than 1.1 million code points. The standard has provisions for 8-bit, 16-bit and 32-bit encoding forms. The 16-bit encoding is used as its default encoding and allows for its million plus code points to be distributed across 17 "planes" with each plane addressing over 65,000 characters each. The characters in Plane 0-or as it is commonly called the "Basic Multilingual Plane" (BMP)-are used to represent most of the world's written scripts, characters used in publishing, mathematical and technical symbols, geometric shapes, basic dingbats (including all level-100 Zapf Dingbats), and punctuation marks. But in addition to the support for characters in modern languages and for the symbols and shapes just mentioned, Unicode also offers coverage for other characters, such as less commonly used Chinese, Japanese, and Korean (CJK) ideographs, Arabic presentation forms, and musical symbols. Many of these additional characters are mapped beyond the original plane using an extension mechanism called "surrogate pairs." With Unicode 3.2, over 95,000 code points have already been assigned characters; the rest have been set aside for future use. Unicode also provides Private Use Areas of over 131,000 locations available to applications for user-defined characters, which typically are rare ideographs representing names of people or places.

The figure below shows the Unicode encoding layout for the BMP (Plane 0) in abstract form.

Figure 1: Unicode encoding layout for the BMP (Plane 0)


Unicode rules, however, are strict about code-point assignment-each code point has a distinct representation. There are also many cases in which Unicode deliberately does not provide code points. Variants of existing characters are not given separate code points, because to do so would represent duplicate encoding of what is underlying the same character. Examples are font variants (such as bold and italic) and glyph variants, which basically are different ways of representing the same characters.

For the most part, Unicode defines characters uniquely, but some characters can be combined to form others, such as accented characters. The most common accented characters, which are used in French, German, and many other European languages, exist in their precomposed forms and are assigned code points. These same characters can be expressed by combining a base character with one or more nonspacing diacritic marks. For example, "a" followed by a nonspacing accent mark is displayed as "à." Nonspacing accent marks make it possible to have a large set of accented characters without assigning them all distinct code points. This is useful for representing accented characters in written languages that are less widely used, such as some African languages. It's also useful for creating a variety of mathematical symbols. The precomposed characters are encoded in the Unicode Standard primarily for compatibility with other encodings. The Unicode Standard contains strict rules for determining the equivalence of precomposed characters to combining character sequences. The Win32 API function FoldStringW maps multiple combining characters into precomposed forms. Also, MultiByteToWideChar can be used with either the MB_PRECOMPOSED or the MB_COMPOSITE flags for mapping characters to their precomposed or composite forms.

For all their advantages, Unicode Standards are far from a panacea for internationalization. The code-point positions of Unicode elements do not imply a sort order, and Unicode does not encode font information. It is the operating system that defines these rules, as in the case of Win32-based applications, which need to obtain sorting and font information from the operating system.

In addition, basing your software on the Unicode Standard is only one step in the internationalization process. You still need to write code that adapts to cultural preferences or language rules. (For more information on other globalization considerations, see "Locale Model""Input, Display, and Output", and"Multilanguage User Interface [MUI].")

As a further caveat, not all Unicode-based text processing is a matter of simple character-by-character parsing. Complex text-based operations such as hyphenation, line breaking, and glyph formation need to take into account the context in which they are being used (the relation to surrounding characters, for instance). The complexity of these operations hinges on language rules and has nothing to do with Unicode as an encoding standard. Instead, the software implementation should define a higher-level protocol for handling these operations.

In contrast, there are unusual characters that have very specific semantic rules attached to them; these characters are detailed in the The Unicode Standard. Some characters always allow a line break (for example, most spaces), whereas others never do (for instance, nonspacing or nonbreaking characters). Still other characters, including many used in Arabic and Hebrew, are defined as having strong or weak text directionality. The Unicode Standard defines an algorithm for determining the display order of bidirectional text, and it also defines sev eral "directional formatting codes" as overrides for cases not handled by the implicit bidirectional ordering rules to help create comprehensible bidirectional text. These formatting codes allow characters to be stored in logical order but displayed appropriately depending on their directionality. Neutral characters, such as punctuation marks, assume the directionality of the strong or weak characters nearby. Formatting codes can be used to delineate embedded text or to specify the directionality of characters. (For more information on displaying bidirectional Unicode-based text, see The Unicode Standard book.)

Figure 2: Precomposed and composite characters


You have now seen some of the capabilities that Unicode offers. The sections that follow delve deeper into Unicode's functions to provide helpful information as you work with Unicode Standards and encodings. For instance, what is the function of byte-order marks (BOMs)? What are surrogate pairs, and how do they enable you to go from encoding 65,000 characters to over 1 million additional characters? These and other questions will be explored in the following sections.

Transformations of Unicode Code Points

There are different techniques to represent each one of the Unicode code points in binary format. Each of the following techniques uses a different mapping to represent unique Unicode characters. The Unicode encodings are:

  • UTF-8: To meet the requirements of byte-oriented and ASCII-based systems, UTF-8 has been defined by the Unicode Standard. Each character is represented in UTF-8 as a sequence of up to 4 bytes, where the first byte indicates the number of bytes to follow in a multibyte sequence, allowing for efficient string parsing. UTF-8 is commonly used in transmission via Internet protocols and in Web content.
  • UTF-16: This is the 16-bit encoding form of the Unicode Standard where characters are assigned a unique 16-bit value, with the exception of characters encoded by surrogate pairs, which consist of a pair of 16-bit values. The Unicode 16-bit encoding form is identical to the International Organization for Standardization/International Electrotechnical Commision (ISO/IEC) transformation format UTF-16. In UTF-16, any characters that are mapped up to the number 65,535 are encoded as a single 16-bit value; characters mapped above the number 65,535 are encoded as pairs of 16-bit values. (For more information on surrogate pairs, see "Surrogate Pairs" later in this chapter.) UTF-16 little-endian is the encoding standard at Microsoft (and in the Windows operating system).
  • UTF-32: Each character is represented as a single 32-bit integer.

The figure below shows two characters encoded in both code pages and Unicode, using UTF-16 and UTF-8.

Figure 3: The character "A" and a kanji character encoded in code pages and in Unicode with both UTF-16 and UTF-8.


Since UTF-8 is so commonly used in Web content, it's helpful to know how Unicode code points get mapped into this encoding without introducing the hassle of MBCS characters. Table 1 shows the relationship between Unicode code points and a UTF-8-encoded character. The starting byte of a chain of bytes in a UTF-8 encoded character tells how many bytes are used to encode that character. All the following bytes start with the mark "10" and the xxx's denote the binary representation of the encoding within the given range.

Table 1: Relationship between Unicode code points and a UTF-8-encoded character. In UTF-8, the first byte indicates the number of bytes to follow in a multibyte-encoded sequence.


Byte-Order Marks

Another concept to be familiar with as you work with Unicode is that of byte- order marks. A BOM is used to indicate how a processor places serialized text into a sequence of bytes. If the least significant byte is placed in the initial position, this is referred to as "little-endian," whereas if the most significant byte is placed in the initial position, the method is known as "big-endian." A BOM can also be used as a reference to identify the encoding of the text file. Notepad, for example, adds the BOM to the beginning of each file, depending on the encoding used in saving the file. This signature will allow Notepad to reopen the file later. Table 2 shows byte-order marks for various encodings. The UTF-8 BOM identifies the encoding format rather than the BOM of the document-since each character is represented by a sequence of bytes.

Table 2: Binary representation of the byte-order mark (U+FEFF) for specific encodings.


Surrogate Pairs

With the Unicode 16-bit encoding system, over 65,000 characters can be encoded (2^16 = 65536). However, the total number of characters that needs to be encoded has actually exceeded that limit (mainly to accommodate the CJK extension of characters). To find additional place for new characters, developers of the Unicode Standard decided to introduce the notion of surrogate pairs. With surrogate pairs, a Unicode code point from range U+D800 to U+DBFF (called "high surrogate") gets combined with another Unicode code point from range U+DC00 to U+DFFF (called "low surrogate") to generate a whole new character, allowing the encoding of over 1 million additional characters. Unlike MBCS characters, high and low surrogates cannot be interpreted when they do not appear as part of a surrogate pair (one of the major challenges with lead-byte and trail-byte processing of MBCS text).

For the first time, in Unicode 3.01 characters are encoded beyond the original 16-bit code space or the BMP (Plane 0). These new characters, encoded at code positions of U+10000 or higher, are synchronized with the international standard ISO/IEC 10646-2. In addition to two Private Use Areas-plane 15 (U+F0000 - U+FFFFD) and plane 16 (U+100000 - U+10FFFD)-Unicode 3.1 and 10646-2 define three new supplementary planes:

  • Supplementary Multilingual Plane (SMP)-with code positions from U+10000 through U+1FFFF
  • Supplementary Ideographic Plane (SIP)-with code positions from U+20000 through U+2FFFF
  • Supplementary Special-purpose Plane (SSP)-with code positions from (SSP) U+E0000 through U+EFFFF

The SMP, or Plane 1, contains several historic scripts and several sets of symbols: Old Italic, Gothic, Deseret, Byzantine Musical Symbols, (Western) Musical Symbols, and Mathematical Alphanumeric Symbols. Together these comprise 1,594 newly encoded characters. The SIP, or Plane 2, contains a very large collection of additional unified Han ideographs known as "CJK Extension B," comprising 42,711 characters, as well as 542 additional CJK Compatibility ideographs. The SSP, or Plane 14, contains a set of 97 tag characters.

Top of pageTop of page

Creating Win32 Unicode Applications

You've now learned more about the benefits and capabilities that Unicode offers, in addition to looking more closely at its functionality. You might also be wondering about the extent to which Windows supports Unicode's features. Microsoft Windows NT 3.1 was the first major operating system to support Unicode, and since then Microsoft Windows NT 4, Microsoft Windows 2000, and Microsoft Windows XP have extended this support, with Unicode being their native encoding. In fact, when you run a non-Unicode application on them, the operating system converts the application's text internally to Unicode before any processing is done. The operating system then converts the text back to the expected code-page encoding before passing the information back to the application.

In addition, Windows XP supports a majority of the Unicode code points with fonts, keyboard drivers, and other system files necessary to the input and display of content in all supported languages. Once again, the fundamental representation of text in Windows NT-based operating systems is UTF-16, and the WCHAR data type is a UTF-16 code unit. Windows does provide interfaces for other encodings in order to be backward-compatible, but converts such text to UTF-16 internally. The system also provides interfaces to convert between UTF-16 and UTF-8 and to inquire about the basic properties of a UTF-16 code point (for example, whether it is a letter, a digit, or a punctuation mark). Since Microsoft Windows 95, Microsoft Windows 98, and Windows Me are not Unicode-based, they provide only a small subset of the Unicode support available in the Windows NT-based versions of Windows. Thus by working with Unicode and Windows NT-based operating systems, you are yet one step closer toward the goal of creating world-ready applications. The remaining sections will show you practical techniques and examples for creating Win32 Unicode applications, as well as tips for using encoding for Web pages, in the .NET Framework, and in console or text-mode programming.

Porting existing code page-based applications to Unicode is easier than you might think. In fact, Unicode was implemented in such a way as to make writing Unicode applications almost transparent to developers. Unicode also needed to be implemented in such a way as to ensure that non-Unicode applications remain functional whenever running in a pure Unicode platform. To accommodate these needs, the implementation of Unicode required changes in two major areas:

  • Creation of a data-type variable (WCHAR) to handle 16-bit characters
  • Creation of a set of APIs that accept string parameters with 16-bit character encoding

WCHAR, a 16-Bit Data Type

Most string operations for Unicode can be coded with the same logic used for handling the Windows character set. The difference is that the basic unit of operation is a 16-bit quantity instead of an 8-bit one. The header files provide a number of type definitions that make it easy to create sources that can be compiled for Unicode or the Windows character set.

For 8-bit (ANSI) and double-byte characters:
typedef char CHAR; // 8-bit character
typedef char *LPSTR; // pointer to 8-bit string

For Unicode (wide) characters:

typedef unsigned short WCHAR; // 16-bit character
typedef WCHAR *LPWSTR; // pointer to 16-bit string

The figure below shows the method by which the Win32 header files define three sets of types:

  • One set of generic type definitions (TCHAR, LPTSTR), which depend on the state of the _UNICODE manifest constant.
  • Two sets of explicit type definitions (one set for those that are based on code pages or ANSI and one set for Unicode).

With generic declarations, it is possible to maintain a single set of source files and compile them for either Unicode or ANSI support.

Figure 4: WCHAR, a new data type.


W Function Prototypes for Win32 APIs

All Win32 APIs that take a text argument either as an input or output variable have been provided with a generic function prototype and two definitions: a version that is based on code pages or ANSI (called "A") to handle code page-based text argument and a wide version (called "W ") to handle Unicode. The generic function prototype consists of the standard API function name implemented as a macro. The generic prototype gets resolved into one of the explicit function prototypes ("A " or "W "), depending on whether the compile-time manifest constant UNICODE is defined in a #define statement. The letter "W" or "A" is added at the end of the API function name in each explicit function prototype.

// windows.h
#ifdef UNICODE
#define SetWindowText SetWindowTextW
#else
#define SetWindowText SetWindowTextA
#endif // UNICODE

With this mechanism, an application can use the generic API function to work transparently with Unicode, depending on the #define UNICODE manifest constant. It can also make mixed calls by using the explicit function name with "W" or "A."

One function of particular importance in this dual compile design is RegisterClass (and RegisterClassEx). A window class is implemented by a window procedure. A window procedure can be registered with either the RegisterClassA or RegisterClassW function. By using the function's "A" version, the program tells the system that the window procedure of the created class expects messages with text or parameters that are based on code pages; other objects associated with the window are created using a Windows code page as the encoding of the text. By registering the window class with a call to the wide-character version of the function, the program can request that the system pass text parameters of messages as Unicode. The IsWindowUnicode function allows programs to query the nature of each window.

On Windows NT 4, Windows 2000, and Windows XP, "A" routines are wrappers that convert text that is based on code pages or ANSI to Unicode-using the system-locale code page-and that then call the corresponding "W" routine. On Windows 95, Windows 98, and Windows Me, "A" routines are native, and most "W" routines are not implemented. If a "W" routine is called and yet not implemented, the ERROR_CALL_NOT_ IMPLEMENTED error message is returned. (For more information on how to write Unicode-based applications for non-Unicode platforms, see "Microsoft Layer for Unicode (MSLU).")

Unicode Text Macro

Visual C++ lets you prefix a literal with an "L" to indicate it is a Unicode string, as shown here:

LPWSTR str = L"This is a Unicode string";

In the source file, the string is expressed in the code page that the editor or compiler understands. When compiled, the characters are converted to Unicode. The Win32 SDK resource compiler also supports the "L" prefix notation, even though it can interpret Unicode source files directly. WINDOWS.H defines a macro called TEXT() that will mark string literals as Unicode, depending on whether the UNICODE compile flag is set.

#ifdef UNICODE
#define TEXT(string) L#string
#else
#define TEXT(string) string
#endif // UNICODE

So, the generic version of a string of characters should become:

LPTSTR str = TEXT("This is a generic string");

C Run-Time Extensions

The Unicode data type is compatible with the wide-character data type wchar_t in ANSI C, thus allowing access to the wide-character string functions. Most of the C run-time (CRT) libraries contain wide-character versions of the strxxx string functions. The wide-character versions of the functions all start withwcs.

Table 3: Examples of C run-time library routines used for string manipulation.


The C run-time library also provides such functions as mbtowc and wctomb, which can translate the C character set to and from Unicode. The more general set of functions of the Win32 API can perform the same functions as the C run-time libraries including conversions between Unicode, Windows character sets, and MS-DOS code pages. In Windows programming, it is highly recommended that you use the Win32 APIs instead of the CRT libraries in order to take advantage of locale-aware functionality provided by the system, as described in Use Locale Model.

Table 4: Equivalent Win32 API functions for the C run-time library routines.


Conversion Functions Between Code Page and Unicode

Since a large number of applications are still code page-based, and since you might want to support Unicode internally, there are a lot of occasions where a conversion between code-page encodings and Unicode is necessary. The pair of Win32 APIs, MultiByteToWideChar and WideCharToMultiByte, allow you to convert code-page encoding to Unicode and Unicode data to code-page encoding, respectively. Each of these APIs takes as an argument the value of the code page to be used for that conversion. You can, therefore, either specify the value of a given code page (example: 1256 for Arabic) or use predefined flags such as:

  • CP_ACP: for the currently selected system Windows code page
  • CP_OEMCP: for the currently selected system OEM code page
  • CP_UTF8: for conversions between UTF-16 and UTF-8

(For more information, see the Microsoft Developer Network [MSDN] documentation athttp://msdn2.microsoft.com.)

By using MultiByteToWideChar and WideCharToMultiByte consecutively, using the same code-page information, you do what is called a "round trip." If the code-page number that is used in this encoding conversion is the same as the code-page number that was used in encoding the original string, the round trip should allow you to retrieve the initial character string.

Compiling Unicode Applications in Visual C++

By using the generic data types and function prototypes, you have the liberty of creating a non-Unicode application or compiling your software as Unicode. To compile an application as Unicode in Visual C/C++, go to Project/Settings/C/C++ /General, and include UNICODE and _UNICODE in Preprocessor Definitions. The UNICODE flag is the preprocessor definition for all Win32 APIs and data types, and _UNICODE is the preprocessor definition for C run-time functions.

Top of pageTop of page

Migration to Unicode

Creating a new program based on Unicode is fairly easy. Unicode has a few features that require special handling, but you can isolate these in your code. Converting an existing program that uses code-page encoding to one that uses Unicode or generic declarations is also straightforward. Here are the steps to follow:

  1. Modify your code to use generic data types. Determine which variables declared as char orchar* are text, and not pointers to buffers or binary byte arrays. Change these types to TCHAR and TCHAR*, as defined in the Win32 file WINDOWS.H, or to _TCHAR as defined in the Visual C++ file TCHAR.H. Replace instances of LPSTR and LPCH with LPTSTR and LPTCH. Make sure to check all local variables and return types. Using generic data types is a good transition strategy because you can compile both ANSI and Unicode versions of your program without sacrificing the readability of the code. Don't use generic data types, however, for data that will always be Unicode or always stays in a given code page. For example, one of the string parameters toMultiByteToWideChar and WideCharToMultiByte should always be a code page-based data type, and the other should always be a Unicode data type.
  2. Modify your code to use generic function prototypes. For example, use the C run-time call _tcslen instead of strlen, and use the Win32 API SetWindowText instead of SetWindowTextA. This rule applies to all APIs and C functions that handle text arguments.
  3. Surround any character or string literal with the TEXT macro. The TEXT macro conditionally places an "L" in front of a character literal or a string literal definition. Be careful with escape sequences. For example, the Win32 resource compiler interprets L/" as an escape sequence specifying a 16-bit Unicode double-quote character, not as the beginning of a Unicode string.
  4. Create generic versions of your data structures. Type definitions for string or character fields in structures should resolve correctly based on the UNICODE compile-time flag. If you write your own string-handling and character-handling functions, or functions that take strings as parameters, create Unicode versions of them and define generic prototypes for them.
  5. Change your build process. When you want to build a Unicode version of your application, both the Win32 compile-time flag -DUNICODE and the C run-time compile-time flag -D_UNICODE must be defined.
  6. Adjust pointer arithmetic. Subtracting char* values yields an answer in terms of bytes; subtracting wchar_t* values yields an answer in terms of 16-bit chunks. When determining the number of bytes (for example, when allocating memory for a string), multiply the length of the string in symbols by sizeof(TCHAR). When determining the number of characters from the number of bytes, divide by sizeof(TCHAR). You can also create macros for these two operations, if you prefer. C makes sure that the ++ and -- operators increment and decrement by the size of the data type. Or even better, use Win32 APIs CharNext and CharPrev.
  7. Check for any code that assumes a character is always 1 byte long. Code that assumes a character's value is always less than 256 (for example, code that uses a character value as an index into a table of size 256) must be changed. Make sure your definition of NULL is 16 bits long.
  8. Add code to support special Unicode characters. These include Unicode characters in the compatibility zone, characters in the Private Use Area, combining characters, and characters with directionality. Other special characters include the Private Use Area noncharacter U+FFFF, which can be used as a placeholder, and the byte-order marks U+FEFF and U+FFFE, which can serve as flags that indicate a file is stored in Unicode. The byte-order marks are used to indicate whether a text stream is little-endian or big-endian. In plaintext, the line separator U+2028 marks an unconditional end of line. Inserting a paragraph separator, U+2029, between paragraphs makes it easier to lay out text at different line widths.
  9. Debug your port by enabling your compiler's type-checking. Do this with and without the UNICODE flag defined. Some warnings that you might be able to ignore in the code page-based world will cause problems with Unicode. If your original code compiles cleanly with type-checking turned on, it will be easier to port. The warnings will help you make sure that you are not passing the wrong data type to code that expects wide-character data types. Use the Win32 National Language Support API (NLS API) or equivalent C run-time calls to get character typing and sorting information. Don't try to write your own logic for handling locale-specific type checking-your application will end up carrying very large tables!

In the following example, a string is loaded from the resources and is used in two scenarios:

  • As a body to a message box
  • To be drawn at run time in a given window

For the purpose of simplification, this example will ignore where and how irrelevant variables have been defined. Suppose you want to migrate the following code page-based code to Unicode:

char g_szTemp[MAX_STR]; // Definition of a char data type

// Loading IDS_SAMPLE from the resources in our char variable
LoadString(g_hInst, IDS_SAMPLE, g_szTemp, MAX_STR);

// Using the loaded string as the body of the message box
MessageBox(NULL, g_szTemp, "This is an ANSI message box!", MB_OK);

// Using the loaded string in a call to TextOut for drawing at
// run time
ExtTextOut(hDC, 10, 10, ETO_CLIPPED , NULL, g_szTemp,
strlen(g_szTemp), NULL);

Migrating this code to Unicode is as easy as following the generic coding conventions and properly replacing the data type, Win32 APIs, and C run-time API definitions. You can see the changes in bold typeface.

#include
// Include wchar specific header file
TCHAR g_szTemp[MAX_STR]; // Definition of the data type as a
// generic variable

// Calling the generic LoadString and not W or A versions explicitly
LoadString(g_hInst, IDS_SAMPLE, g_szTemp, MAX_STR);

// Using the appropriate text macro for the title of our message box
MessageBox(NULL, g_szTemp, TEXT("This is a Unicode message box."),
MB_OK);

// Using the generic run-time version of strlen
ExtTextOut(hDC, 10, 10, ETO_CLIPPED , NULL, g_szTemp,
_tcslen(g_szTemp), NULL);

After implementing these simple steps, all that is left to do in order to create a Unicode application is to compile your code as Unicode by defining the compiling flags UNICODE and _UNICODE.

Top of pageTop of page

Options to Migrate to Unicode

Depending on your needs and your target operating systems, there are several options for migration from an application that is based on code pages or to one that is based on Unicode. Some of these options do have certain caveats, however.

  • Create two binaries: default compile for Windows 95, Windows 98, and Windows Me, and Unicode compile for Windows NT, Windows 2000, and Windows XP.
    Disadvantage: Maintaining two versions of your software is messy and goes against the principle of a single, worldwide binary.
  • Always register as a non-Unicode application, converting to and from Unicode as needed.
    Disadvantage: Since Windows does not support the creation of custom code pages, you will not be able to use scripts that are supported only through Unicode (such as those in the Indic family of languages, Armenian, and Georgian). Also, this option makes multilingual computing impossible since, when it comes to displaying, your application is always limited to the system's code page.
  • Create a pure Unicode application.
    Disadvantage: This works only on Windows NT, Windows 2000, and Windows XP, since only limited Unicode support is provided on legacy platforms. This is the preferred approach if you are only targeting Unicode platforms.
  • Use Microsoft Layer for Unicode (MSLU). In this easy approach, you merely create a pure Unicode application, and then link the Unicows.lib file provided by the SDK platform to your project. You will also need to ship the Unicows.dll file along with your deliverables. MSLU is essentially wrapping all explicit "W" version calls made in your code, at run time, to "A" versions, if a non-Unicode platform is detected at run time. This approach is by far the best solution for migrating to Unicode and for ensuring backward-compatibility. (For more information, see "MSLU".)

Top of pageTop of page

Best Practices

When writing Unicode code, there are many points to consider, such as when to use UTF-16, when to use UTF-8, what to use for compression, and so forth. The following are recommended practices that will help ensure you choose the best method based on the circumstance at hand.

  • Choose UTF-16 as the fundamental representation of text in your application. UTF-8 should be used for application interoperability only (for example, for content sent to be displayed in browsers that do not support Unicode, or over networks and servers that do not support Unicode). Avoid character-by-character processing and use the existing WCHAR system interfaces and resources wherever possible. The interaction between characters in some languages requires expert knowledge of those languages. Microsoft has developed and tested the system interfaces with most of the languages represented by Unicode-unless you are a multilingual expert, it will be difficult to reproduce this support.
  • If your application must run on Windows 95, Windows 98, or Windows Me, keep UTF-16 as your fundamental text representation and use MSLU on these operating systems. If you must support non-Unicode text, keep data internally in UTF-16 and convert to other encodings via a gateway. Use system interfaces such as MultiByteToWideChar to convert when necessary. Ensure your application supports Unicode characters that require two UTF-16 code points (surrogate pairs). This should be automatic if you use existing system interfaces, but will require careful development and testing when you do not. Avoid the trap of likening surrogate pairs to the older East Asian double-byte encodings (DBCS). Instead, centralize the needed string operations in a few subroutines. These subroutines should take surrogate pairs into consideration, but should also handle combining characters and other characters that require special handling. A well-written application can confine surrogate processing to just a few such routines. Don't use UTF-8 for compression-it actually expands the size of the data for most languages. If you need a real compression algorithm for Unicode, refer to the Unicode Consortium technical standard "Unicode Technical Standard #6: A Standard Compression Scheme for Unicode" available on their site, http://www.unicode.org.
  • Don't choose UTF-32 merely to avoid surrogate processing. Data size will double and the processing benefits are elusive. If you follow the earlier advice on surrogate processing, UTF-16 should be adequate.
  • Test your Unicode support with a mix of unrelated languages such as Arabic, Hindi, and Korean. For a well-written Unicode application, the system-locale setting should be irrelevant-test to verify this is the case.

You've now seen techniques and code samples for creating Win32 Unicode applications. Unicode is also extremely useful for dealing with Web content in the global workplace and market. Knowing how to handle encoding in Web pages will help bridge the gap between the plethora of languages that are in use today within Web content.




2011/11/10 homar <marek.h...@gmail.com>



--
Massimo Belgrano

Viktor Szakáts

unread,
Nov 10, 2011, 10:00:37 AM11/10/11
to harbou...@googlegroups.com
Probably it'd be better to quote (or even LINK!) a non-Microsoft 
page, as I think many of the misunderstandings regarding 
UNICODE are coming from the one-sided view of the topic from 
Windows programming.

UNICODE is good because you can mix _any_ characters 
inside one string and/or one app. F.e. a Cyrillic name in an 
otherwise English language application. I also releives most 
of the pain dealing with legacy (8-bit) codepages.

Nowadays all important languages, platforms and development 
environments support UNICODE.

IMO ASC() can well support UNICODE if we chose to go 
in this direction. The name is a legacy, the point of the function 
however is timeless, that is to return a numeric representation 
of a character.

Viktor

Mindaugas Kavaliauskas

unread,
Nov 10, 2011, 12:55:00 PM11/10/11
to harbou...@googlegroups.com
Hi,


On 2011.11.10 12:16, Przemysław Czerpak wrote:
> I plan to begin some modifications to add basic support for UNICODE to HVM.
> I would like to agree some important things now.
> Let's imagine that we use UTF8 as internal unistring representation.
> Should we change functions like LEN(), SUBSTR(), LEFT(), RIGHT(), PADR(),
> PADC(), ... to operate on character indexes or leave them as byte ones?
> What should we make with CHR() and ASC() functions? Keep then operating
> on ASCII values or switch to UNICODE?

It is very important to have the whole picture before doing some final
agreements. I'll try put share imaginations, though I still do not have
the whole final picture of unicode support. So, it will be a little
brain-storming style ideas. It would be also nice to see at
implementation of other products like Java, PHP, Python before inventing
a wheel.

PRG level code should not depend on internal unicode representation in
HVM. We can use different internal representations. UTF-8 is only one of
internal representations. Another representation can be windows wide
char little-endian unicode representation. Both representations has
drawback and advantages. E.g., UTF-8 saves memory, but obtaining
character offset is more complex. Windows wide char usage let us avoid a
numerous string conversion in Windows application, if windows API is
used often.

The independence of internal string representation gives an answer to
question, how LEN(), SUBSTR(), LEFT(), etc should work. It should work
on characters, not bytes! Otherwise, we'll have different results for
the same "s caron" (U+0161) character:
ASC(LEFT(s caron, 1)) == 0xC5 // since UTF8 is C5 A1
or
ASC(LEFT(s caron, 1)) == 0x61 // since little-endian is 61 01

Using of byte in SUBSTR(), etc, will make PRG level code more hacky - we
can split string in the middle of the character. LEN(FIELD->CHARCTER)
result will depend on field content even for current fixed DBF character
fields, etc.

Byte operations can be useful for those, who works with binary strings,
because you have a strict control on the binary data representation you
store in memory, but in this case I'd say do not need unicode support at
all. Just let's do some binary transformation to UTF-8 or other encoding
before passing string parameter to Cairo API, to wide char in case of
Windows API, etc.

I would expect ASC() and CHR() to work on characters. I.e.,
ASC(s caron) is equal to 353, and CHR(353) return one character string
containing s caron.


The mess begins after I'm trying to thing about binary strings. We will
need such strings even if we have unicode strings. Many functions like
file read write, socket operations, etc. operates on raw bytes, not on
characters. I think the following conversion should be done in such cases:
cBin := hb_translate(cString,, "to_encoding")
FWRITE(hFile, cBin, LEN(cBin))
and
FREAD(hFile, @cBin, LEN(cBin))
cString := hb_translate(cBin, "from_encoding")

The question is how we will keep binary strings in HVM. Will we use some
flag to indicate if string is binary or not? We can store all strings in
unicode representations and do not use binary flag at all. E.g., binary
string "\x55\xAA" can be encoded and stored as 3 bytes in UTF-8. In this
case, we will have to do char to byte translation in functions like
FWRITE() by obtaining integer code for every characters and making
binary string of bytes (code%255). This complicates a little functions
which operates on "memory buffer" like FREAD(). I still can understant
this is not very difficult to solve. Well, storing all strings
(including binary) in unicode representation could be not optimal in
case application to a lot of binary data processing.

One more question I can not solve in my head is the result of functions
like hb_parc() and other. We have a huge number of C level functions
calls, that operates on strings and do not use String API. What result
is expected for such functions? It would be nice to have some setting
for not String API functions. E.g., we expect result be returned in
CP437. The bad thing is that hb_parc() returned value is never freed.
So, returned value should not be obtained by some transcoding and should
return internal HVM representation. This causes some serious
outcome. If we want to have a possibility of different internal HVM
string representation, hb_parc() is completely useless function. Since,
I may obtain string in UTF-8, o little-endian widechar format. I fail to
image any useful usage of old API for strings. Maybe it has some limited
application if we say, that UTF-8 is the only possible internal encoding
for unicode HVM.


Regards,
Mindaugas

Bacco

unread,
Nov 10, 2011, 1:17:47 PM11/10/11
to harbou...@googlegroups.com
I have do delete Massimo's message to read this thread properly. I
believe most of us can search in google, and the ones who can't would
be fine with a link, not the full documentation.


Some considerations: I believe we need separate types for strings and
streams, so with strings the behaviour should be 1 byte always. There
are plenty of uses of Asc, left, right, substr to handle binary data,
that would be unusable simultaneously with UTF8. Being two separate
types, it would be easy to use the proper implementation of left/right
etc

Although I believe that internal UTF8 is good, I feel a little worried
about the implementation.


Regards,
Bacco

Przemysław Czerpak

unread,
Nov 10, 2011, 3:11:19 PM11/10/11
to harbou...@googlegroups.com
On Thu, 10 Nov 2011, Mindaugas Kavaliauskas wrote:

Hi,

> It is very important to have the whole picture before doing some
> final agreements. I'll try put share imaginations, though I still do
> not have the whole final picture of unicode support. So, it will be
> a little brain-storming style ideas. It would be also nice to see at
> implementation of other products like Java, PHP, Python before
> inventing a wheel.

So far looking at some of them I haven't found answers or interesting
solutions for real problems. Just few arbitrary decision or problem
is not touched at all in low level code or everything is redirected to
ICU.

> PRG level code should not depend on internal unicode representation
> in HVM. We can use different internal representations. UTF-8 is only
> one of internal representations. Another representation can be
> windows wide char little-endian unicode representation. Both

BTW Windows uses native endian not little endian - at least in
documentation we have arrays of TCHARs. Of course for x86 machines
it's the same.

> representations has drawback and advantages. E.g., UTF-8 saves
> memory, but obtaining character offset is more complex. Windows wide
> char usage let us avoid a numerous string conversion in Windows
> application, if windows API is used often.

Yes it is. The most important advantage of UTF8 used as internal
encoding is direct casting to char * strings so C code using old
string API (hb_parc*()) can work with such strings so can be updated
in longer terms. It also allow us to keep current 'char *' pointers
in existing HVM structures and functions so we will not have to update
it adding UNICODE strings to HVM.
Anyhow the final representation of UNICODE strings in HVM should be
fully independent from the public API so I will want to touch this
subject too though maybe later.
UTF8 also simplifies string constants in .prg and .c code.
As you said the most important disadvantage are much more complex
character accessing by index in UTF8 strings.

> The independence of internal string representation gives an answer
> to question, how LEN(), SUBSTR(), LEFT(), etc should work. It should
> work on characters, not bytes! Otherwise, we'll have different
> results for the same "s caron" (U+0161) character:
> ASC(LEFT(s caron, 1)) == 0xC5 // since UTF8 is C5 A1
> or
> ASC(LEFT(s caron, 1)) == 0x61 // since little-endian is 61 01
>
> Using of byte in SUBSTR(), etc, will make PRG level code more hacky
> - we can split string in the middle of the character.
> LEN(FIELD->CHARCTER) result will depend on field content even for
> current fixed DBF character fields, etc.
>
> Byte operations can be useful for those, who works with binary
> strings, because you have a strict control on the binary data
> representation you store in memory, but in this case I'd say do not
> need unicode support at all. Just let's do some binary
> transformation to UTF-8 or other encoding before passing string
> parameter to Cairo API, to wide char in case of Windows API, etc.
>
> I would expect ASC() and CHR() to work on characters. I.e.,
> ASC(s caron) is equal to 353, and CHR(353) return one character
> string containing s caron.

So if we want to separate internal representation from PRG code we
have to change string functions to operate on character indexes
instead bytes and used UNICODE character values instead of ASCII
ones.
Support for binary strings as separate type is also important but
it's not solution for all cases. Sooner or later someone will add
UNICODE string to binary string and we have to decide what is the
final result and which conversions on both strings should be done
before concatenation. We can also forbid such operation and generate
RTE what seems to be reasonable if we add set of functions for
conversions between byte and unicode strings so user can change
type of arguments before operation. It forces code updating but
the final code should be much cleaner without some unexpected
runtime results.

> The mess begins after I'm trying to thing about binary strings. We
> will need such strings even if we have unicode strings. Many
> functions like file read write, socket operations, etc. operates on
> raw bytes, not on characters. I think the following conversion
> should be done in such cases:
> cBin := hb_translate(cString,, "to_encoding")
> FWRITE(hFile, cBin, LEN(cBin))
> and
> FREAD(hFile, @cBin, LEN(cBin))
> cString := hb_translate(cBin, "from_encoding")
>
> The question is how we will keep binary strings in HVM. Will we use
> some flag to indicate if string is binary or not? We can store all
> strings in unicode representations and do not use binary flag at
> all. E.g., binary string "\x55\xAA" can be encoded and stored as 3
> bytes in UTF-8. In this case, we will have to do char to byte
> translation in functions like FWRITE() by obtaining integer code for
> every characters and making binary string of bytes (code%255). This
> complicates a little functions which operates on "memory buffer"
> like FREAD(). I still can understant this is not very difficult to
> solve. Well, storing all strings (including binary) in unicode
> representation could be not optimal in case application to a lot of
> binary data processing.

Or maybe we can also generate RTE here when non binary string is passed
to such function. In such case programmer will have to make conversion
himself to the form he needs.

> One more question I can not solve in my head is the result of
> functions like hb_parc() and other. We have a huge number of C level
> functions calls, that operates on strings and do not use String API.
> What result is expected for such functions? It would be nice to have
> some setting for not String API functions. E.g., we expect result be
> returned in CP437. The bad thing is that hb_parc() returned value is
> never freed. So, returned value should not be obtained by some
> transcoding and should return internal HVM representation. This
> causes some serious
> outcome. If we want to have a possibility of different internal HVM
> string representation, hb_parc() is completely useless function.
> Since, I may obtain string in UTF-8, o little-endian widechar
> format. I fail to image any useful usage of old API for strings.
> Maybe it has some limited application if we say, that UTF-8 is the
> only possible internal encoding for unicode HVM.

As I said the nice side effect of UTF8 representation is the fact
that it is still valid char * string so we can think about it later.
This can be resolved in two ways:
1) add to function frame list of strings allocated by hb_parc*()
functions for different parameters which are freed when function
returned.
2) extend asString item structure and add support for alternative
string representations which will be freed by hb_itemClean()
or code which modifies item string buffer. Such solution works
also for hb_itemGetCPtr*() functions.
3) eliminate old API from whole core code and use only new one.
Probably preferred because STR API is MT safe and theoretically
allows to simultaneous write access to the same item from different
threads if we decide to create such HVM version in the future.

best regards,
Przemek

hua

unread,
Nov 10, 2011, 11:31:20 PM11/10/11
to Harbour Developers
On Nov 10, 11:00 pm, Viktor Szakáts <harbour...@syenar.hu> wrote:
> Probably it'd be better to quote (or even LINK!) a non-Microsoft
> page

Here's one for anyone interested that I'd enjoyed reading,
http://www.joelonsoftware.com/articles/Unicode.html
--
hua

Bacco

unread,
Nov 11, 2011, 12:31:33 AM11/11/11
to harbou...@googlegroups.com
I believe that creating a new type would be the first step, leaving
current C type for binary strings, allowing 100% backward
compatibility.

The Stream (or whatever name) can be used for unicode, and can store
the data itself plus one byte to specify the type, so one can start
implementing E.G.: 0x08 as UTF-8 indicator, allowing room for future
encodings.

Also, the stream and character shouldn't never be added by simpler
operations, as If the user doesn't know exactly what it's doing,
probably it should not be doing. If one need to store some unicode
stream as a character, the conversion must be an explicit call to a
function.

Then we have the need to consider the sources. People shoud use the
encoding that best fits their needs. If I'm developing for Qt and
MySql, and don't need japanese or chinese, probably I want edit my
source in 1252, and so on (thinking about an ideal implementation, not
current). Other users may need mostly UTF-8. Maybe we will need some
#pragma in the future to specify the source encoding too.

marek.h...@interia.pl

unread,
Nov 11, 2011, 2:34:21 AM11/11/11
to harbou...@googlegroups.com
Hi,

"Bacco" <carlo...@gmail.com> pisze:


> I believe that creating a new type would be the first step, leaving
> current C type for binary strings, allowing 100% backward
> compatibility.

e.g.

Local txt1 as Unicode, txt2 := Unocode(), txt0 := ''

? Valtype( txt1), txt2:Classname, ValType( txt0) // S STRING|UNICODE ? C [HARACTER]

but wath when

? txt1 + txt0 // ????????

etc

Regards,
Marek Horodyski

----------------------------------------------------------------
Nie zmieniaj opon, zmien auto!
http://linkint.pl/f2a7c

Bacco

unread,
Nov 11, 2011, 9:38:06 AM11/11/11
to harbou...@googlegroups.com
> but wath when
>
> ? txt1 + txt0 // ????????

It should RTL, exactly what I said in my last email.

You can't mix the two without functions. You need to do ?
someconversion( txt1, HB_UTF8, HB_LATIN1 ) + txt0 for example.

The HVM shouldn't assume anything to do with characters above 255,
because we have database encodings, console encodings, GUI encodings,
printer encodings and one can use completely different encodings for
each one simultaneously.

IMHO it's job of who is programming to deal with these. Anything
"automatic" will cause more confusion than solution. Besides, people
will need to learn to do it right from the start, what I think is
better for everyone instead of tons of people getting unexpected
results without a clue.


Regards
Bacco

Reply all
Reply to author
Forward
0 new messages