How to capitalize an Utf-8 string?

Rolf Lampa [RIL]

unread,

Jun 28, 2008, 9:20:00 PM6/28/08

to

I'm working with raw Utf-8 strings, and I sometimes need to capitalize
title strings.

How do I, with std Delphi7 units, Capitalize an Utf-8 title in simplest
thinkable manner, without making damage to the string?

Regards,

// Rolf Lampa

Rob Kennedy

unread,

Jun 28, 2008, 11:18:26 PM6/28/08

to

Convert it to WideString, capitalize it, and convert it back.

Remember that capitalizing a single character doesn't necessarily give
you a single character in return.

--
Rob

Rolf Lampa [RIL]

unread,

Jun 29, 2008, 7:06:38 PM6/29/08

to

Rob Kennedy skrev:

Yes, the multiple char converted result is the tricky part, and that's
also why I asked since it wasn't obvious to me how to handle those cases.

Now I have tried the following code, which seems to work, at least with
a few test strings. Feel free to suggest more efficient code than what I
came up with on first try:

procedure TForm1.CapitalizeString_FromDisk;

function _GetAllButFirstChar(S: Widestring): Widestring;
var i: Integer;
begin
Result := '';
if S = '' then Exit;
SetLength(Result, Length(S)-1);
for i := 2 to Length(S) do
Result[i-1] := S[i];
end;

var
S: String;
wS, tmp: AnsiString; //Widestring;
i: Integer;
begin
memo1.Lines.LoadFromFile('TestStrings.txt');
Memo2.Lines.Clear;
Memo3.Lines.Clear;

// Memo3 is for checking if I can restore an initial
// lowercase back to lower case after Captialization,
// without data loss.

for i := 0 to memo1.Lines.Count-1 do
begin
S := Memo1.Lines[i];

// Need to convert raw UTF-8 into Ansi in order to
// be able to deal with single (Capital) character:

wS := Utf8ToAnsi(S);

// = Capitalize

wS := AnsiUpperCase(wS[1]) + _GetAllButFirstChar(wS);
Memo2.Lines.Add(wS);

// = Restore again

wS[1] := AnsiLowerCase(wS[1])[1];
Memo3.Lines.Add(S);
end;
end;

Below the testdata (loaded into Memo1), with the intermediate and final
processing results displayed in Memo2 & Memo3;

Memo1:
--------------------------------------------
Ã¶stersund (utf8) ; Same as "östersund" in Win-1252 encoding
Ã¥kerman (utf8) ; Same as åkerman
Ã¤rkefiende (utf8) ; same as ärkefiende

Memo2:
--------------------------------------------
Östersund (utf8) ; Uppercase OK
Åkerman (utf8) ; Uppercase Ok
Ärkefiende (utf8) ; Uppercase OK

Memo3: (restored to Memo1-state)
--------------------------------------------
Ã¶stersund (utf8) ; OK
Ã¥kerman (utf8) ; OK
Ã¤rkefiende (utf8) ; OK

Tomorrow I will try the code above on seven million entries (English
Wikipedia), with a breakpoint branch in the code to trap cases where
original string can't be restored back to original after Capitalization.

Regards,

// Rolf Lampa

Rolf Lampa [RIL]

unread,

Jun 29, 2008, 7:42:07 PM6/29/08

to

Rolf Lampa [RIL] skrev:

> Now I have tried the following code, which seems to work, at least with
> a few test strings. Feel free to suggest more efficient code than what I
> came up with on first try:
>
> procedure TForm1.CapitalizeString_FromDisk;

> ...

Ops, some old redundant code was left in the procedure (commented). The
following more compact code gives the same desired result:

procedure TForm1.CapitalizeString_FromDisk;
var
i: Integer;
S: String;
wS: AnsiString; //Widestring;

begin
memo1.Lines.LoadFromFile('TestStrings.txt');
Memo2.Lines.Clear;
Memo3.Lines.Clear;

for i := 0 to memo1.Lines.Count-1 do

begin
S := Memo1.Lines[i];

wS := Utf8ToAnsi(S);

//wS := AnsiUpperCase(wS[1]) + _GetAllButFirst(wS);
wS[1] := AnsiUpperCase(wS[1])[1];
Memo2.Lines.Add(wS);

Rolf Lampa [RIL]

unread,

Jun 29, 2008, 8:40:48 PM6/29/08

to

Rolf Lampa [RIL] skrev:

> Rob Kennedy skrev:
>> Rolf Lampa [RIL] wrote:
>>> I'm working with raw Utf-8 strings, and I sometimes need to
>>> capitalize title strings.
>>>
>>> How do I, with std Delphi7 units, Capitalize an Utf-8 title in
>>> simplest thinkable manner, without making damage to the string?
>>
>> Convert it to WideString, capitalize it, and convert it back.
>>
>> Remember that capitalizing a single character doesn't necessarily give
>> you a single character in return.
>
>
> Yes, the multiple char converted result is the tricky part, and that's
> also why I asked since it wasn't obvious to me how to handle those cases.
>
> Now I have tried the following code, which seems to work, at least with
> a few test strings. Feel free to suggest more efficient code than what I
> came up with on first try:
>
> procedure TForm1.CapitalizeString_FromDisk;

Wopps, previous post had crappy code, here corrected and compacted
(tested) giving the desired result, well, for my small testcase:

procedure TForm1.CapitalizeString_FromDisk;
var
i: Integer;
S: String;
wS: AnsiString; //Widestring;

begin
memo1.Lines.LoadFromFile('TestStrings.txt');
Memo2.Lines.Clear;
Memo3.Lines.Clear;

for i := 0 to memo1.Lines.Count-1 do
begin
// Starting from raw lowercase Utf-8:

S := Memo1.Lines[i];
wS := Utf8ToAnsi(S);

// Capitalize

wS[1] := AnsiUpperCase(wS[1])[1];
Memo2.Lines.Add(wS);

// Restore original lowercase

wS[1] := AnsiLowerCase(wS[1])[1];

S := AnsiToUtf8(wS);
Memo3.Lines.Add(S);
end;
end;

Regards,

// Rolf Lampa

Adem

unread,

Jun 29, 2008, 11:23:11 PM6/29/08

to

Rolf Lampa [RIL] wrote:

> Yes, the multiple char converted result is the tricky part, and
> that's also why I asked since it wasn't obvious to me how to handle
> those cases.

You have to bear in mind that, in corner cases, case folding is
language/locale dependent --'dotless i' and 'dotted I' stuff would give
you headaches if you have any Turkish strings.

Secondly, I would not convert my UTF-8 strings/chars to ANSI; I'd
rather use some lib that helps me do the case folding directly on UTF-8
string/char. Because, otherwise, you might lose some information and
end up with an irrelevant char.

Remy Lebeau (TeamB)

unread,

Jun 30, 2008, 1:38:27 AM6/30/08

to

"Rolf Lampa [RIL]" <rolf....@rilnetwilldofine.com> wrote in message
news:48681579$1...@newsgroups.borland.com...

> Now I have tried the following code, which seems to work, at least
> with a few test strings. Feel free to suggest more efficient code than
> what I came up with on first try:

Your _GetAllButFirstChar() function is not needed. Use Copy() instead to
extract a substring starting at any index. Also, use UTF8Decode() instead
of Utf8ToAnsi(). Try this code:

procedure TForm1.CapitalizeString_FromDisk;
var
S: String;
wS: WideString;
aS: AnsiString;
i: Integer;
begin
Memo1.Lines.LoadFromFile('TestStrings.txt');
Memo2.Lines.Clear;

for i := 0 to Memo1.Lines.Count-1 do

begin
S := Memo1.Lines[i];

wS := UTF8Decode(S);
wS := WideUpperCase(wS[1]) + Copy(wS, 2 MaxInt);

aS := UTF8Encode(wS);
Memo2.Lines.Add(aS);
end;
end;

Gambit

Rolf Lampa [RIL]

unread,

Jun 30, 2008, 10:15:54 AM6/30/08

to

Remy Lebeau (TeamB) skrev:

> Your _GetAllButFirstChar() function is not needed.

Very true, I corrected that (and more) in the subsequent post.

> Use Copy() instead to extract a substring starting at any index.

Not even that is needed (see my subsequent post). Well, tyhat is, unless
copying [2..n] is the only way to avoid data loss for certain characters
(like Turkish, mentioned by Adam).

> Also, use UTF8Decode() instead of Utf8ToAnsi(). Try this code:
>
> procedure TForm1.CapitalizeString_FromDisk;
> var
> S: String;
> wS: WideString;
> aS: AnsiString;
> i: Integer;
> begin
> Memo1.Lines.LoadFromFile('TestStrings.txt');
> Memo2.Lines.Clear;
>
> for i := 0 to Memo1.Lines.Count-1 do
> begin
> S := Memo1.Lines[i];
>
> wS := UTF8Decode(S);
> wS := WideUpperCase(wS[1]) + Copy(wS, 2 MaxInt);
>
> aS := UTF8Encode(wS);
> Memo2.Lines.Add(aS);
> end;
> end;

Ah yes, UTF8Decode returns Widestring. I end up with the following code
when using appropriate types for the vairables:

procedure TForm1.CapitalizeString_FromDisk;
var
i: Integer;
S: String;

wS: WideString;
utf8S: Utf8String;

begin
memo1.Lines.LoadFromFile('TestStrings.txt');
Memo2.Lines.Clear;
Memo3.Lines.Clear;

for i := 0 to memo1.Lines.Count-1 do
begin
S := Memo1.Lines[i]; // Original UTF8 lowercase string

wS := UTF8Decode(S);
wS[1] := WideUpperCase(wS[1])[1]; // Capitalize
Memo2.Lines.Add(wS);

// Verify that the original string can be restored

wS[1] := WideLowerCase(wS[1])[1];
utf8S := UTF8Encode(wS);

Memo3.Lines.Add(utf8S);
end;
end;

I'll test this on EN-Wikipedia titles to verify that no info is lost
during converion, like Turkish, mentioned by Adam.

Regards,

// Rolf Lampa

Rob Kennedy

unread,

Jun 30, 2008, 10:35:22 AM6/30/08

to

Rolf Lampa [RIL] wrote:
> Remy Lebeau (TeamB) skrev:
>> Your _GetAllButFirstChar() function is not needed.
>
> Very true, I corrected that (and more) in the subsequent post.
>
>> Use Copy() instead to extract a substring starting at any index.
>
> Not even that is needed (see my subsequent post). Well, tyhat is, unless
> copying [2..n] is the only way to avoid data loss for certain characters
> (like Turkish, mentioned by Adam).

It's necessary because your code is simply overwriting the first
character of the string with the first character of the uppercase
version. Like I told you before, capitalizing a character may yield
_multiple_ characters. You can't just overwrite one because you'll be
discarding the other uppercase characters.

If Memo1 contains UTF-8 characters, then shouldn't S be declared as a
UTF8String as well?

> wS := UTF8Decode(S);
> wS[1] := WideUpperCase(wS[1])[1]; // Capitalize
> Memo2.Lines.Add(wS);

Note that you're putting a Unicode string into a non-Unicode control.
When you call Add, the WideString value in wS will be converted to an
AnsiString using your operating system's current default code page.

> // Verify that the original string can be restored
>
> wS[1] := WideLowerCase(wS[1])[1];
> utf8S := UTF8Encode(wS);
>
> Memo3.Lines.Add(utf8S);
> end;
> end;
>
> I'll test this on EN-Wikipedia titles to verify that no info is lost
> during converion, like Turkish, mentioned by Adam.
>
> Regards,
>
> // Rolf Lampa

--
Rob

Rolf Lampa [RIL]

unread,

Jun 30, 2008, 2:46:32 PM6/30/08

to

Rob Kennedy skrev:
> Rolf Lampa [RIL] wrote:

>>> Use Copy() instead to extract a substring starting at any index.
>>
>> Not even that is needed (see my subsequent post). Well, tyhat is,
>> unless copying [2..n] is the only way to avoid data loss for certain
>> characters (like Turkish, mentioned by Adam).
>
> It's necessary because your code is simply overwriting the first
> character of the string with the first character of the uppercase
> version. Like I told you before, capitalizing a character may yield
> _multiple_ characters.

Yes. But this is true only for the string when it's still in its UTF-8
encoding. This is why we first convert it to widestring, and while being
a widestring we convert the "first character" - which in Utf8
representation may well be multiple characters.

And after uppecase of the WIDESTRING[1] character we convert it back to
Utf8 again.

And being back in Utf8 encopding we wil notyice that we often have
changed more than one character while modifying only one (1) while inb
WideString mode.

> You can't just overwrite one because you'll be discarding the other
> uppercase characters.

Well, you can.

I just verified that I can do this on ~seven million titles in the
English Wikipedia without any data loss (I can convert the uppercase
back to lowercase again, without any data loss).

My test just confirmed this.

> If Memo1 contains UTF-8 characters, then shouldn't S be declared as a
> UTF8String as well?

No, not necceraily. The Help says the following (System unit):

"
Delphi syntax:
type UTF8String = type string;
"

>> wS := UTF8Decode(S);
>> wS[1] := WideUpperCase(wS[1])[1]; // Capitalize
>> Memo2.Lines.Add(wS);
>
> Note that you're putting a Unicode string into a non-Unicode control.
> When you call Add, the WideString value in wS will be converted to an
> AnsiString using your operating system's current default code page.

Maybe, but the memo is only for visual display. OTOH, my "deep-test" of
data integrity of any strings converted with this methid doesn't bother
about showing the strings in a memo, it only compares the original
string with the Uppercased-LoweCased result, and if it could't restore
the string back to its original byte chars, it enters a breakpoint.

But bo breakpoints where entered for any of all the ~7 million WP titles
in the en-wp, which to mee seems to be a fairly good result even proving
that we are not suffering from any data loss. I think. <thumbs crossed> =)

Regards,

// Rolf Lampa

Remy Lebeau (TeamB)

unread,

Jun 30, 2008, 2:54:43 PM6/30/08

to

"Rolf Lampa [RIL]" <rolf....@rilnetwilldofine.com> wrote in message

news:48692a02$1...@newsgroups.borland.com...

> Yes. But this is true only for the string when it's still in
> its UTF-8 encoding. This is why we first convert it to
> widestring, and while being a widestring we convert the
> "first character" - which in Utf8 representation may well
> be multiple characters.

If the first character in the decoded data is part of a surrogate pair, then
you have to change both the first and second characters in order to get a
proper capital letter. Not every Unicode character can be represented by a
single WideChar.

> And after uppecase of the WIDESTRING[1] character
> we convert it back to Utf8 again.

But in the case of a multi-character letter, you are only converting the
first part of the letter, not the latter part.

> I just verified that I can do this on ~seven million titles in the English
> Wikipedia without any data loss (I can convert the
> uppercase back to lowercase again, without any data loss).

You are only focusing on English, though. There are other languages in
Unicode where your logic would fail to work properly.

Gambit

Rolf Lampa [RIL]

unread,

Jun 30, 2008, 3:20:30 PM6/30/08

to

Remy Lebeau (TeamB) skrev:

> "Rolf Lampa [RIL]" <rolf....@rilnetwilldofine.com> wrote in message
> news:48692a02$1...@newsgroups.borland.com...
>
>> Yes. But this is true only for the string when it's still in
>> its UTF-8 encoding. This is why we first convert it to
>> widestring, and while being a widestring we convert the
>> "first character" - which in Utf8 representation may well
>> be multiple characters.
>
> If the first character in the decoded data is part of a surrogate pair, then
> you have to change both the first and second characters in order to get a
> proper capital letter. Not every Unicode character can be represented
> by a single WideChar.

Well, I thought that WideString[n] would represent a multibyte char.

But, in any case, after having tried to convert seven million titles,
and managed to restore them back to original characters, thus seemingly
it works to uppercase the First[1] WideChar. By mistake or not, it works.

> But in the case of a multi-character letter, you are only converting the
> first part of the letter, not the latter part.

Maybe, but if the first part of the letter is what's concerned (since it
works, seemingly even without data loss) then... well, then it seems to
me to be working "good enough".

>> I just verified that I can do this on ~seven million titles in the English
>> Wikipedia without any data loss (I can convert the
>> uppercase back to lowercase again, without any data loss).
>
> You are only focusing on English, though. There are other languages in
> Unicode where your logic would fail to work properly.

Hm, I do understand your argument, but, the English Wikipedia contains
"all kinds of stuff", I promise. =)

As a matter of fact, I discovered this very problem on the English WP,
meaning that it obviously contains very "odd" characters which needs to
be take care of.

For example there are interwiki links (plain text links inside the
article text, linking to corresponding articles in /all other
languages/. These interwiki links are titles written in native
language/chars (these interwiki links are shown in the "Languages" list
to the lower right on WP).

See example here (set edit mode, and scroll down to the bottom of the
page text):
http://en.wikipedia.org/wiki/Leuven

However, if there really exist a case in which the Capitalize method
does NOT work, then HOW should the method be coded? (that is, back to my
original question in the subject line).

Regards,

// Rolf Lampa

Remy Lebeau (TeamB)

unread,

Jun 30, 2008, 4:10:16 PM6/30/08

to

"Rolf Lampa [RIL]" <rolf....@rilnetwilldofine.com> wrote in message

news:486931f8$1...@newsgroups.borland.com...

> Well, I thought that WideString[n] would represent a multibyte char.

WideString holds UTF-16 encoded Unicode characters. That uses 2 bytes per
character. Some Unicode characters require more than 2 bytes, hense the
existance of surrogate pairs, which are 2 UTF-16 characters that are used
together to map to a single Unicode codepoint. Also, a single UTF-8
character sequence can decode to a Unicode character up to, like, 6 bytes
(but WideString can't hold those characters).

> But, in any case, after having tried to convert seven million titles, and
> managed to restore them back to original characters, thus
> seemingly it works to uppercase the First[1] WideChar. By
> mistake or not, it works.

Only for the subset of Unicode that do not require more than 2 bytes per
character.

> Maybe, but if the first part of the letter is what's concerned

It is not. Both parts working together are what map to a single character.
Change one part but not the other and you map to a completely different
letter.

Gambit

Rolf Lampa [RIL]

unread,

Jun 30, 2008, 5:11:43 PM6/30/08

to

Remy Lebeau (TeamB) wrote:
> "Rolf Lampa [RIL]" <rolf....@rilnetwilldofine.com> wrote in message

>> Maybe, but if the first part of the letter is what's concerned

>
> It is not. Both parts working together are what map to a single character.
> Change one part but not the other and you map to a completely different
> letter.

OK, lets sum it up then:

1. WideChar(Capitalize[1]) works. For all titles in the entire English
Wikipedia.

2. This (1.) is no guarantee for that it will work on all language
versions of, like, Wikipedia.

3. Which in turn means that in reality I still have no (generic)
solution for a CapitalizeString method for UTF8 string data.

:(

But (4.) I have proved that we came much closer to a working solution
than I had before asking, so thank you all anyway!

:)

Regards,

// Rolf Lampa

Rudy Velthuis [TeamB]

unread,

Jun 30, 2008, 6:34:26 PM6/30/08

to

Rob Kennedy wrote:

> > Yes. But this is true only for the string when it's still in its
> > UTF-8 encoding.
>

> No, that's not the case. I was referring specifically to the German
> sharp s, but also ligature characters. The lowercase code point does
> not have a single corresponding uppercase code point.

FWIW, not YET. They are thinking of creating an uppercase version of
the German sharp s, in Germany.

--
Rudy Velthuis [TeamB] http://www.teamb.com

"When you do the common things in life in an uncommon way, you
will command the attention of the world."
-- George Washington Carver (1864-1943)

Rob Kennedy

unread,

Jun 30, 2008, 6:31:13 PM6/30/08

to

Rolf Lampa [RIL] wrote:
> Rob Kennedy skrev:
>> Rolf Lampa [RIL] wrote:
>>>> Use Copy() instead to extract a substring starting at any index.
>>>
>>> Not even that is needed (see my subsequent post). Well, tyhat is,
>>> unless copying [2..n] is the only way to avoid data loss for certain
>>> characters (like Turkish, mentioned by Adam).
>>
>> It's necessary because your code is simply overwriting the first
>> character of the string with the first character of the uppercase
>> version. Like I told you before, capitalizing a character may yield
>> _multiple_ characters.
>
> Yes. But this is true only for the string when it's still in its UTF-8
> encoding.

No, that's not the case. I was referring specifically to the German

sharp s, but also ligature characters. The lowercase code point does not
have a single corresponding uppercase code point.

> This is why we first convert it to widestring, and while being

> a widestring we convert the "first character" - which in Utf8
> representation may well be multiple characters.

It helps if you don't think of the things in a UTF-8 string as
characters. They're bytes that make up code points.

> And after uppecase of the WIDESTRING[1] character we convert it back to
> Utf8 again.
>
> And being back in Utf8 encopding we wil notyice that we often have
> changed more than one character while modifying only one (1) while inb
> WideString mode.
>
>
>> You can't just overwrite one because you'll be discarding the other
>> uppercase characters.
>
> Well, you can.

No, you can't.

> I just verified that I can do this on ~seven million titles in the
> English Wikipedia without any data loss (I can convert the uppercase
> back to lowercase again, without any data loss).

That's also not necessarily true. Not all characters can roundtrip.

> My test just confirmed this.

If you don't understand what you were testing, then you can't really
claim it was verified.

>> If Memo1 contains UTF-8 characters, then shouldn't S be declared as a
>> UTF8String as well?
>
> No, not necceraily. The Help says the following (System unit):
>
> "
> Delphi syntax:
> type UTF8String = type string;
> "

I know that. So clearly it's not declared to differentiate anything to
the compiler. Instead, it's for differentiating things for the
programmer. If you have data that you know is in UTF-8, then use a
UTF8String to hold it. That way, when you look at you're variables,
you'll be less likely to do things to UTF8String variables that aren't
valid on UTF-8-encoded data, such as calling any of the Ansi* functions.

>>> wS := UTF8Decode(S);
>>> wS[1] := WideUpperCase(wS[1])[1]; // Capitalize
>>> Memo2.Lines.Add(wS);
>>
>> Note that you're putting a Unicode string into a non-Unicode control.
>> When you call Add, the WideString value in wS will be converted to an
>> AnsiString using your operating system's current default code page.
>
> Maybe,

Not maybe. Definitely. That's how WideString and AnsiString work in Delphi.

> but the memo is only for visual display. OTOH, my "deep-test" of
> data integrity of any strings converted with this methid doesn't bother
> about showing the strings in a memo, it only compares the original
> string with the Uppercased-LoweCased result, and if it could't restore
> the string back to its original byte chars, it enters a breakpoint.
>
> But bo breakpoints where entered for any of all the ~7 million WP titles
> in the en-wp, which to mee seems to be a fairly good result even proving
> that we are not suffering from any data loss. I think. <thumbs crossed> =)

Isn't it Wikipedia policy that all subjects are capitalized anyway? It
certainly was a year ago when I was last active with it.

Also, what percentage of your tests involved strings that were different
in UTF-8 than they would be in plain ASCII? We're talking about English,
after all.

--
Rob

Remy Lebeau (TeamB)

unread,

Jun 30, 2008, 6:38:13 PM6/30/08

to

"Rolf Lampa [RIL]" <rolf....@rilnetwilldofine.com> wrote in message

news:48694c09$1...@newsgroups.borland.com...

> 1. WideChar(Capitalize[1]) works.

Only for Unicode characters that fit in a single UTF-16 WideChar without
having to use a surrogate pair. Keep in mind that not only could a
lowercase surrogate pair map to an uppercase surrogate pair, but a
non-surrogate lowercase character can map into an uppercase surrogate pair
as well. Or worse, things like the non-surrogate German eszett ("ß") that
uppercases to a non-surrogate "SS". So you have to be ready to handle the
situation where you have 1 WideChar as input but require 2 WideChar as
output when uppercasing. Also take into account that some letters simply do
not have uppercase equivilents at all.

> Which in turn means that in reality I still have no (generic) solution for
> a CapitalizeString method for UTF8 string
> data.

Yes, you do. Simply take surrogate pairs into account, ie:

function IsHighSurrogate(wch: WideChar): Boolean; inline;
begin
Result := (Integer(wch) >= $D800) and (Integer(wch) <= $DBFF);
end;

function IsLowSurrogate(wch: WideChar): Boolean; inline;
begin
Result := (Integer(wch) >= $DC00) and (Integer(wch) <= $DFFF);
end;

function IsSurrogatePair(hs, ls: WideChar): Boolean; inline;
begin
Result := IsHighSurrogate(hs) and IsLowSurrogate(ls);
end;

var
Ch: PWideChar;
begin
Ch := PWideChar(TheString);
if IsSurrogatePair(Ch^, (Ch+1)^) then
begin
// use surrogate pair to look up uppercase character in Unicode
charts...
end else
begin
// use single character to look up uppercase character in
Unicode charts...
end;
end;

Refer to the Unicode standard for what each given character and surrogate
pair uppercases to:

http://www.unicode.org/faq/casemap_charprop.html

Gambit

Rolf Lampa [RIL]

unread,

Jun 30, 2008, 8:42:15 PM6/30/08

to

Rob Kennedy skrev:
> Rolf Lampa [RIL] wrote:
>> Rob Kennedy skrev:
>>> Rolf Lampa [RIL] wrote:
>>>>> Use Copy() instead to extract a substring starting at any index.
>>>>

>>>> Not even that is needed (see my subsequent post). Well, that is,

>>>> unless copying [2..n] is the only way to avoid data loss for certain

>>>> characters (like Turkish, mentioned by Adem).

>>>
>>> It's necessary because your code is simply overwriting the first
>>> character of the string with the first character of the uppercase
>>> version. Like I told you before, capitalizing a character may yield
>>> _multiple_ characters.
>>
>> Yes. But this is true only for the string when it's still in its UTF-8
>> encoding.
>
> No, that's not the case. I was referring specifically to the German
> sharp s, but also ligature characters. The lowercase code point does not
> have a single corresponding uppercase code point.

OK, also Remy just explained these things.

My ignorance on the deepest secrets of character encoding is obvious to
all, first by me asking the question in the first place, and second, by
not knowing why seven million title strings passed through a blackbox test.

BUT, I don't think it's fair to pretend that the blackbox test didn't
pass. Regardless of if I can explain why it passed. Unless the indent is
only to make fumn of my ignorance about the deep secrets of character
encoding.

>>> You can't just overwrite one because you'll be discarding the other
>>> uppercase characters.
>>
>> Well, you can.
>
> No, you can't.

Well, the black box test passed.

I understand that that test is not exhaustive (that is, you and Remy
have explained that there really exist characters which won't be able to
revert back), BUT, again, when I say that my black box test passed, it did.

So what did I say that my test verified?

>> I just verified that I can do this on ~seven million titles in the
>> English Wikipedia without any data loss (I can convert the uppercase
>> back to lowercase again, without any data loss).
>
> That's also not necessarily true. Not all characters can roundtrip.

Here I point out what my test verified. And that *all* the characters
which ACTUALLY were tested made it through the round trip. No more, no
less. Black box test.

Like so:

begin
OriginalUtf8S := Utf8S;

// Black box here:
// Swap encoding
// Capitalize
// Assign result
// + Test: Try convert case back again (if the first letter
// has changed):

ModifiedRestoredUtf8S := WideLowerCase(Result);
if OriginalUtf8S <> ModifiedRestoredUtf8S then
RaiseOrSetBreakpointHere;
end;

Seven million known title strings passed this test.

I had understanding enough to test if the *Utf8_versions* of the
original string and the modified string differed. The test verified that
that never happened. That is, no breakpoint was entered.

So why do you insist on that I don't understand what I test? I do
understand what I test. I tested that in = out - for all cases tested.
What I did not test was my detailed knowledge on character encodings,
though.

My ignorance of character encodings is not the same things as not
knowing or understanding what my test verified. My understanding of my
ignorance shows that I knew how to perform a test anyway - I chose black
box test.

And my test verified that an Utf8 title from enwiki as of 20080524
entering the method can in ALL cases tested be restored back to its
original Utf8 characters, with no exceptions (because in my test I
tested doing just that).

>> My test just confirmed this.
>
> If you don't understand what you were testing, then you can't really
> claim it was verified.

While it is true that there's a not-so-well known, or understood,
problem involved with testing general (which has to do with not really
knowing what one is testing, yes), that doesn't imply that I don't
understand what I have confirmed with my specif test. I have confirmed that:

I restored 7 million UTF8 titles back to UTF8, and I compared the
original UTF8 chars with the Modified + Restored string (after being
decoded back to UTF8). That confirms that I didn't lose any bits which
is testable by comparing Utf8 strings.

Exactly that. No more, no less.

It is implied that an extended test could have given a different result
than my test produced, but that was not what I claimed that my test did
verify.

>>> If Memo1 contains UTF-8 characters, then shouldn't S be declared as a
>>> UTF8String as well?
>>
>> No, not necceraily. The Help says the following (System unit):
>>
>> "
>> Delphi syntax:
>> type UTF8String = type string;
>> "
>
> I know that. So clearly it's not declared to differentiate anything to
> the compiler. Instead, it's for differentiating things for the
> programmer.

I second your thoughts on code clarity.

But we were NOT talking about coding style, instead we were onto
preserving bits while swapping between different encodings.

Therefore my answer on this is still essentially the same, although
somewhat rephrased;

"No not necessarily, because there is no type mismatch
involved with exchanging String and Utf8String".

Which is what was relevant, in context.

> If you have data that you know is in UTF-8, then use a
> UTF8String to hold it.

Good advice, and good practice.

But still no necessity involved, none at all. We were not losing bits
due to bad coding style.

>>> Note that you're putting a Unicode string into a non-Unicode control.
>>> When you call Add, the WideString value in wS will be converted to an
>>> AnsiString using your operating system's current default code page.
>>
>> Maybe,
>
> Not maybe. Definitely. That's how WideString and AnsiString work in Delphi.
>> but the memo is only for visual display.

I already told you that in my code library, where I also performed my
test, I don't usually send all my strings via memos. Nor did I do it in
this case. So why are you insisting?

>> But no breakpoints where entered for any of all the ~7 million WP

>> titles in the en-wp, which to mee seems to be a fairly good result
>> even proving that we are not suffering from any data loss. I think.
>> <thumbs crossed> =)
>
> Isn't it Wikipedia policy that all subjects are capitalized anyway? It
> certainly was a year ago when I was last active with it.

My index deals with #redirect titles also, and those #redirect titles
are often poorly written (but since they are not so "visible" they are
not always fixed very soon).

But problem is that my app uses these poorly written #redirect titles,
and when my app realizes that something is wrong with such a title it
frankly tries to fix it.

Hence the CaptilizeString (and other related tidyup-functions).

You probably know what a redirect is, but for those who does not, I give
an example below. Say that there is an article with the title "Rolf
lampa". The last "l" should be uppercased, otherwise Wikipedia won't be
able to redirect to the correct article, so we manually make a new
redirect article, pointing at the properly spelled article, like so:

#REDIRECT [[rolf LampA]]

OK?

No. Because not this time either I got the casing correct, and that's
what happens also to Wikipedia users very often. And my problem is that
my index uses these poorly written redirects, because in my Index table
I need to be able to return "EffectivePage" to my application.

An example result from EffectivePage would in this case be the correctly
spelled (end) title [[Rolf Lampa]], even if my app asked for the bad
spelling:

S := EffectivePageByTitle['rolf lampA'];

My app would typically have picked up the bad format above from a wiki
[[link]] in an article. My index resolves these cases.

> Also, what percentage of your tests involved strings that were different
> in UTF-8 than they would be in plain ASCII? We're talking about English,
> after all.

The percentage is irrelevant. My index was corrupted already when it had
its first single title which had lost some bits. But not so anymore.

With the last posted CapitalizeString code, no such corruptions occur
anymore - at least not any corruptions which can be detected by
comparing the UTF8 versions of the string.

Regards,

// Rolf Lampa

Adem

unread,

Jun 30, 2008, 8:46:14 PM6/30/08

to

Remy Lebeau (TeamB) wrote:

> > 1. WideChar(Capitalize[1]) works.
>
> Only for Unicode characters that fit in a single UTF-16 WideChar
> without having to use a surrogate pair.

Rolf,

I can feel the pain you must be going through..

Here is what I'd suggest you do:

1) Get a copy of DIUnicode from here:
http://www.yunqa.de/delphi/doku.php/products/unicode/index

2) Convert your UTF-8 (or whatever) data to UCS-4, so that you have a
4-byte (LongInt) per caharacter value. This will save you headaches
such as surrogate pairs and combining characters and whole lot of
others.

3) Then, use CaseFolding.txt from here:
http://unicode.org/Public/UNIDATA/CaseFolding.txt
to do the case folding (for those applicable chars and cases), so that
you get a normalized string (actually, array of LongInts)

4) Then do all your processing with these arrays of LongInts.

5) When you need to display something to the user, convert from UCS-4
to WideChars or whatever.

I recommend these --strongly.

It takes a little more RAM, but is a lot less hassle and is much faster.

HTH,
Adem

Herre de Jonge

unread,

Jul 1, 2008, 1:17:30 AM7/1/08

to

Rolf,

Your test shows you can capitalize the first letter of a string and
lowercase it again. But you don't know if your capitalized word
makes sense.

Typical Dutch example (first I thought this was a ligrature, but it
is in fact a digraph): The combination "ij" is one sound and seen
as one letter. It should also be capitalized as such. We have a lake
in Holland with a name starting with "ij". This is the IJsselmeer
(http://en.wikipedia.org/wiki/IJsselmeer). If you just go changing
the capitalization of the first letter it looks really strange. You'd
end up with 'Ijsselmeer' or 'iJsselmeer', both of which look really
strange (actually I have a Webster's dictionary at home that has the
first form in it. In a later edition they corrected it).

The point this illustrates is that your test shows it is possible to
change the capitalization of an UTF-8 string without loss of information
and taking into account that some characters take up a different number
of characters when capitalized. What you test does _not_ show is whether
your capitalized string makes sense or not.

I'm not sure what you're trying to achieve, but it seems like opening
a "Pandora's box" to me.

Good luck,
Herre

Rolf Lampa [RIL]

unread,

Jul 1, 2008, 2:08:07 AM7/1/08

to

Herre de Jonge wrote:
> Rolf,

> The point this illustrates is that your test shows it is possible to
> change the capitalization of an UTF-8 string without loss of information

Well, some UTF-8 strings can but not any utf8 string.

> and taking into account that some characters take up a different number
> of characters when capitalized. What you test does _not_ show is whether
> your capitalized string makes sense or not.
>
> I'm not sure what you're trying to achieve, but it seems like opening
> a "Pandora's box" to me.

Thank you for your good example of were single character Capitalization
of titles won't make sense. This is useful information for me.

How do you solve such problems without having a table such words which
have such double "CCapitalization"? (I assume that there's no simple
rule or algorithm covering this entirely?)

The particular index I'm working on does not aim to change
Capitalization for the "original" titles (they are already capitalized).
Instead my application needs to "tidy up" other *references* to these
original titles in order to be able to perform case sensitive matches.

The need for tidying up of titles is common for titles picked up from
#Redirect pages, or from free text [[links]] in article wiki text. I
need to match these reference against original titles, which are case
sensitive unique identifiers.

Thus I would benefit from being able to handle also the cases you mention.

Unfortunately links and redirects are often typed like this:

#REDIRECT [[Template:name]]

(lowercase on "name"). Internally in my app this string won't get a
match against the existing original title "Template:Name". Therefore I
need to decouple the namespace from the title part ("Template:" from
"name"), and then capitalize "Name", and then at last put them back
together again into "Template:Name".

If this redirect example had been on the Swedish Wikipedia my tidyup
would also have changed "Template:name" into "Mall:Name" (English
namespaces works on all language versions, but string matches are faster
if localizing all namespaces to use identical names).

Regards,

// Rolf Lampa

Herre de Jonge

unread,

Jul 1, 2008, 5:45:53 AM7/1/08

to

Rolf Lampa [RIL] wrote:
> Herre de Jonge wrote:
>> Rolf,
>
>> The point this illustrates is that your test shows it is possible to
>> change the capitalization of an UTF-8 string without loss of information
>
> Well, some UTF-8 strings can but not any utf8 string.
>
>> and taking into account that some characters take up a different number
>> of characters when capitalized. What you test does _not_ show is whether
>> your capitalized string makes sense or not.
>>
>> I'm not sure what you're trying to achieve, but it seems like opening
>> a "Pandora's box" to me.
>
> Thank you for your good example of were single character Capitalization
> of titles won't make sense. This is useful information for me.
>
> How do you solve such problems without having a table such words which
> have such double "CCapitalization"? (I assume that there's no simple
> rule or algorithm covering this entirely?)

I guess so. For some project I looked into hyphenation. This
is maybe even trickier. Especially when extra letters appear
or disappear when the word is hyphenated. Most algorithms can
come quite close but usually there also exists an extra list
of exceptions.

> The particular index I'm working on does not aim to change
> Capitalization for the "original" titles (they are already capitalized).
> Instead my application needs to "tidy up" other *references* to these
> original titles in order to be able to perform case sensitive matches.
>
> The need for tidying up of titles is common for titles picked up from
> #Redirect pages, or from free text [[links]] in article wiki text. I
> need to match these reference against original titles, which are case
> sensitive unique identifiers.
>
> Thus I would benefit from being able to handle also the cases you mention.

I think you should maybe change your approach. Basically you already
have the correct capitalization (your database with original links).
You could try to match against them, I guess. First try to match
case sensitive. If you can find your match, there should be nothing
to correct. If you can't find a match, try to look it up case
insensitive.

Maybe for speed it is best to find all case insensitive match(es) and
use some rules to determine the correct version:
1) If the same capitalization exists --> no change
2) If only one result --> change to that one
3) Multiple matches, so try to choose best one (but what are
the criteria? Only first letter is capitalized? Least
differences?)

At first I thought case 3 might not be possible in the Wikipedia
database, but
http://en.wikipedia.org/w/index.php?title=Ijsselmeer&redirect=no
(again our lake) shows that multiple versions of the same term
can exist. So maybe you should take into account whether the match
is a redirect page or not...

Take care,
Herre

Rolf Lampa [RIL]

unread,

Jul 1, 2008, 1:42:29 PM7/1/08

to

Herre de Jonge wrote:
> Rolf Lampa [RIL] wrote:
>> Herre de Jonge wrote:

>> The particular index I'm working on does not aim to change
>> Capitalization for the "original" titles (they are already capitalized).
>> Instead my application needs to "tidy up" other *references* to these
>> original titles in order to be able to perform case sensitive matches.
>>
>> The need for tidying up of titles is common for titles picked up from
>> #Redirect pages, or from free text [[links]] in article wiki text. I
>> need to match these reference against original titles, which are case
>> sensitive unique identifiers.
>>
>> Thus I would benefit from being able to handle also the cases you
>> mention.
>
> I think you should maybe change your approach. Basically you already
> have the correct capitalization (your database with original links).

No. My application does two things: It must be able to mimic Mediawiki
behavior, and two; log bad syntax.

And in so doing (mimicking) my app will on the fly #1. recognize false
leads in Redirects (redirects which due to bad casing will NOT find the
intended target article, not even on WP). Therefore also my title index
*should* return "IndexOf = -1" in such cases.

This means that my app can optionally #2. log a syntactically bad
reference exist in the Wikipedia (or whatever database it is
processing), and #3. it can try suggest a solution.

Now, since many syntactic errors can be ambiguous, one can not always
automagically just fix them. Thus the log file should be very helpful in
formatting its findings and suggestions in *wiki syntax* so that one can
copy&paste the log info into a wiki page (on the site where the data
comes from) and then manually click on the links to the titles concerned
in order to look up the the logged errors and fix them, all according to
the clear text hints provided by my log.

> You could try to match against them, I guess. First try to match
> case sensitive. If you can find your match, there should be nothing
> to correct. If you can't find a match, try to look it up case
> insensitive.

My data processing need to do so much lookups that that which takes
hours today with my current solution would take days if I had to try
several times.

Especially when I use the COM server version of the index it simply
wouldn't work, it would take too long. Instead of prolonging the time
spent for lookups, my app needs to reduce such time, and it even has
become several times faster still, in order to provide with the intended
functionality (processing time of a dump must must fit in into a small
time window for a live database).

All in all, putting more load on the title index would make that
impossible already from start. But that is not the only aspect. I also
need to be able to mimic Mediawiki behavior. And since indexed titles in
MW are ALWAYS Propercased, and case-sensitive only on "[2..n]:[2..n]"
for a "Namespace:Title" string, the simple and fastest rule to match a
title is: #1. Always Capitalize any attempt to reference a valid page
title. But *never* touch any of the ([2..n]:)[2..n] characters in a
title string to look up (AFAIK, this is relevant also for the Dutch
CCapitals you mentioned, that is, if the title doesn't have correct case
in pos [2] then there will be no match in WP either. Try insert a link
on a page (the search input box has different behavior) like so:
"[[WIkipedia]]". You'll find that the link will be read, that is, the
article doesn't exist, while the article [[Wikipedia]] DO exist.)

So, that my app can Capitalize correctly is ALWAYS (no exceptions)
essential for all links/references it picks up in any text anywhere in
the MW data.

Now, valid Mediawiki syntax for references IN TEXT (not how they are
stored as valid indexed titles) can be very different, and thus they
need to be "tidyed up" on the fly (well, the capitals) in order to find
its target pages, example:

[[:Template:somename]] - OK, works. MW will automagically uppercase to
"Somename" and thus give a match (I call this "Propercase").

[[:template:somename]] - Same here, MW will ProperCase to
"Template:Somename"

{{somename}} - MW assumes that no namespace means "Template:" and this
it will Insert Template: and Propercase the name into "Template:Somename"

{{sOmename}} - This template call will become Propercase by MW into
"Template:SOmename". And thus MW (or WP) will NOT find the template! (if
it's title name is "Template:Somename".

Same thing for manually typed redirects:

"#REDIRECT [[template:sOmename]] - MW will Propercase into
"Template:SOmename" and thus will NOT find the target article (that is,
this redirect reference is a "Nil pointer" so to say).

SO, if a string which my app picks up and try to look up is not
Capitalized (in two places in the string: "First:Second") then I know in
advance that I will NOT find a match (the indexed titles ARE always
Capitalized, no exceptions).

But my app deals not only with "Mimicking" MW, it also logs invalid
syntax in all data it handles, because such knowledge is of great value.
And because my log system is trying to be helpful in creating "TODO
lists" of things to fix manually, my app also attempts to match
mismatches, and if it finds any thinkable candidates it adds that too to
the log.

In this way my app produces logs with valuable information for manual
corrections to syntax errors in any MW xml dump it processes.

> 3) Multiple matches, so try to choose best one (but what are
> the criteria? Only first letter is capitalized? Least
> differences?)
>
> At first I thought case 3 might not be possible in the Wikipedia
> database, but
> http://en.wikipedia.org/w/index.php?title=Ijsselmeer&redirect=no
> (again our lake) shows that multiple versions of the same term
> can exist. So maybe you should take into account whether the match
> is a redirect page or not...

Normally I must mimic MW behavior (apart from that I tend to log
suspicious syntactic errors in the text), and that means that there is
two Mediawiki compliant ways of accessing a page. One is the "raw brute
force" approach, which looks up the data in disk, loads a page object,
and returns the text:

sArticleText := PageByTitle['Ijsselmeer'].AsXml;

...which will return the text:
#REDIRECT [[IJsselmeer]]
See: http://en.wikipedia.org/w/index.php?title=Ijsselmeer&redirect=no

Or, the other way (also MW behavior, when desired)

sArticleText := *EffectivePage*ByTitle['Ijsselmeer'].AsXml;

...which will "follow redirects where they go" and return the effective
desired target article text, which starts like:

"{{coor title dm|52|49|N|5|15|E|region:NL_type:waterbody}}
[[Image:Zuider1.jpeg|thumb|right|250 px|Landsat photo]]
'''[[IJsselmeer]]''' (sometimes translated as '''Lake Ĳssel''',
alternative international spelling..."

And so on. See: http://en.wikipedia.org/wiki/IJsselmeer

So, all in all, the fastest approach and the approach which always gives
valid results (mimicking MW) is to always Capitalize[1] first char on
references trying to look up a valid Mediawiki title. This is true also
for Dutch double CCapitalizations.*

:)

Regards,

// Rolf Lampa

* It would be a bug if I'd force Dutch CCapitalization, but on
mismatching reference it would be useful feature if attempting such a
CCapitalization would suggest a match for a mismatching reference,
(since I could log that as a hint for manual fix/tidy up of a poorly
cased link!)

Rolf Lampa [RIL]

unread,

Jul 3, 2008, 12:37:12 AM7/3/08

to

Adem skrev:

> 3) Then, use CaseFolding.txt from here:
> http://unicode.org/Public/UNIDATA/CaseFolding.txt
> to do the case folding (for those applicable chars and cases), so that
> you get a normalized string (actually, array of LongInts)
>
> 4) Then do all your processing with these arrays of LongInts.

OK, assuming that you meant that if I use UCS-4 I can always go for
(F)ULL folding. Thus I could just strip out the (S)imple alternatives
from the list and entirely disregard high/low bits?

If I understood the table correctly I can use the info from a row in the
file CaseFolding.txt to go both ways, that is, from (0041-->0061) and
back again (0061-->0041), and for codes not in this list I can leave
them as is. ( Row example: "0041; C; 0061; # LATIN CAPITAL LETTER A" )

Is this all I need to regard?

As for increased memory consumption when using the intermediate format
UCS-4 for small pieces of text, like words and/or titles, it doesn't
really matter. Converting entire articles to UCS-4 OTOH can start to
consume memory since I plan to keep some articles (like static templates
etc) in an internal cache to speed up things (instead of retrieving the
same page from disk again and again).

Btw, if I convert from Utf8 to UCS-4, and put the result into a regular
string (a "convenient byte array"), can't I just mask this string
byte-wise on the first four bytes using the hex code from the code table?

Like so:

procedure Utf8Capitalize_Test(var sUtf8: String);

function ConvertUtf8ToUCS4(sUtf8: Utf8String): String;
begin
// Using DIUnicode lib.
// ...
Result := sUtf8;
end;

function ConvertUCS4ToUtf8(sUCS4: String): String;
begin
// Using DIUnicode lib.
// ...
Result := sUCS4;
end;

function GetFoldCodeForUCS4Char(
aCharCode: LongInt;
out FoldCode: LongInt): Boolean;
var
tmp: LongInt;
begin
// ... lookup code in CodeFolding.txt table
FoldCode := {GetFoldCode}(aCharCode);
Result := FoldCode <> aCharCode;
end;

var
sUcs4: String; // Yes, string :b
FoldCode, FirstCharCode: LongInt;
begin
sUcs4 := ConvertUtf8ToUCS4(sUtf8); // Using DIUnicode

// Byte-wise extract the encodeing of the first UCS-4 character

FirstCharCode := (Ord(sUcs4[1]) shl 24);
FirstCharCode := (Ord(sUcs4[2]) shl 16) or FirstCharCode;
FirstCharCode := (Ord(sUcs4[3]) shl 8 ) or FirstCharCode;
FirstCharCode := Ord(sUcs4[1]) or FirstCharCode;

// Now, lookup the new code in CodeFolding table, and modify
// only if the new 4 byte char code is different:

if GetFoldCodeForUCS4Char(FirstCharCode, FoldCode) then
begin
// Byte-wise mask the first UCS-4 character to uppercase
sUcs4[1] := Char( FoldCode and $F000 );
sUcs4[2] := Char( FoldCode and $0F00 );
sUcs4[3] := Char( FoldCode and $00F0 );
sUcs4[4] := Char( FoldCode and $000F );
sUtf8 := ConvertUCS4ToUtf8(sUcs4); // Using DIUnicode
end {
else
sUtf8 := not modified; }
end;

Something like this would work? Someone have a better/faster solution to
this?

Regards,

// Rolf Lampa

Rolf Lampa [RIL]

unread,

Jul 3, 2008, 12:43:02 AM7/3/08

to

> FirstCharCode := (Ord(sUcs4[1]) shl 24);
> FirstCharCode := (Ord(sUcs4[2]) shl 16) or FirstCharCode;
> FirstCharCode := (Ord(sUcs4[3]) shl 8 ) or FirstCharCode;
> FirstCharCode := Ord(sUcs4[1]) or FirstCharCode;

Correction to the last row ( [1]-->[4] ):

FirstCharCode := Ord(sUcs4[4]) or FirstCharCode;

Rolf Lampa [RIL]

unread,

Jul 3, 2008, 2:02:49 AM7/3/08

to

Rolf Lampa [RIL] skrev:

> if GetFoldCodeForUCS4Char(FirstCharCode, FoldCode) then
> begin
> // Byte-wise mask the first UCS-4 character to uppercase
> sUcs4[1] := Char( FoldCode and $F000 );
> sUcs4[2] := Char( FoldCode and $0F00 );
> sUcs4[3] := Char( FoldCode and $00F0 );
> sUcs4[4] := Char( FoldCode and $000F );
> sUtf8 := ConvertUCS4ToUtf8(sUcs4); // Using DIUnicode
> end {
> else
> sUtf8 := not modified; }
> end;

ops, forgot to shift right:

if GetFoldCodeForUCS4Char(FirstCharCode, FoldCode) then
begin
// Byte-wise mask the first UCS-4 character to uppercase

B := (FoldCode and $F000) shr 24;
sUcs4[1] := Char( B );
B := (FoldCode and $0F00) shr 16;
sUcs4[2] := Char( B );
B := (FoldCode and $00F0) shr 8;
sUcs4[3] := Char( B );

sUcs4[4] := Char( FoldCode and $000F );
sUtf8 := ConvertUCS4ToUtf8(sUcs4); // Using DIUnicode
end

BTW, why doesn't it work to shift inside the Char() parenthesis, like so?

sUcs4[1] := Char((FoldCode and $F000) shr 24);

Adem

unread,

Jul 3, 2008, 3:41:28 AM7/3/08

to

Rolf Lampa [RIL] wrote:

> Adem skrev:

>
> If I understood the table correctly I can use the info from a row in
> the file CaseFolding.txt to go both ways, that is, from (0041-->0061)
> and back again (0061-->0041), and for codes not in this list I can
> leave them as is. ( Row example: "0041; C; 0061; # LATIN CAPITAL
> LETTER A" )

That is my understanding too. Except for the chars that have combining
chars.

By 'combining chars', I am referring to things like this (from
CaseFolding.txt) :

0390; F; 03B9 0308 0301; # GREEK SMALL LETTER IOTA WITH DIALYTIKA AND
TONOS

Here, in this string '03B9 0308 0301', '03B9' is the base character and
all the rest are glyphs that are added to it.

[[
This is what I mean: a + ~ --> 'ã' and here it is represented as
BASE_CHAR + TILDE.

Anyway, you could Google for this confusion but it is not that
important.
]]

What is important is, for normalization purposes, I suggest you take
the base char (and ignore the combining chars).

This makes life a lot easier.

[Having said that, I would not recommend decomposing --it will screw up
stuff in langs like Turkish (if that's important).]

> Is this all I need to regard?

To the best of my knowledge, yes.

But, you might also wish to use 'UnicodeData.txt' (from the same
address) just to be sure --as well as
http://unicode.org/Public/UNIDATA/SpecialCasing.txt

> Btw, if I convert from Utf8 to UCS-4, and put the result into a
> regular string (a "convenient byte array"), can't I just mask this
> string byte-wise on the first four bytes using the hex code from the
> code table?
> Like so:

Why do you do that?

Why not simply use an array such as this for lookup?

var
CaseFoldingArray [$41..$10427] of Cardinal;

and, initially fill it with $FFFFFFFF meaning that case folding for
that cell does not exist.

Then, fill it with the data from CaseFolding.txt.

Afterwards, all you need is something like this: {compiled with
XanaNews :) }

function GetCaseFolding(const AInput: Cardinal; out AOutput: Cardinal):
Boolean;
begin
Result := False;
AOutput := $FFFFFFFF;
case AInput of
$41..$10427: begin
if CaseFoldingArray[AInput] <> $FFFFFFFF then begin
AOutput := CaseFoldingArray[AInput];
Result := True;
end;
end;
end;
end;

I don't see why something like this shouldn't be faster.

Rolf Lampa [RIL]

unread,

Jul 3, 2008, 8:42:38 AM7/3/08

to

Adem skrev:
> Rolf Lampa [RIL] wrote:

>> Adem wrote:
> 0390; F; 03B9 0308 0301; # GREEK SMALL LETTER IOTA WITH DIALYTIKA AND
> TONOS
>
> Here, in this string '03B9 0308 0301', '03B9' is the base character and
> all the rest are glyphs that are added to it.
>
> [[
> This is what I mean: a + ~ --> 'ã' and here it is represented as
> BASE_CHAR + TILDE.

OK.

I assume that the additional numbers are OR:ed together. But $0000 is
4+4+4+4 (16) bits (UCS-2), not 32 as in UCS-4.

Thus, should the "additions" to the base_char be added to higher or
lower half of a 32 bit UCS-4? Like so?:

03B9
0308
0301 or
---- ----
03B9

Why I ask is because it's obvious that the "additions" doesn't add
anything to the base char in this case (or if they are, they were
already added to the base char $03B9 from the beginning).

Or is it meant to be added on top, like so?:

0000 03B9
0308
0301 or
---- ----
0309 03B9

I doubt it. To me it seems like the additional numbers are there only
for information, they don't add anything to the base_char. But if the
base_char is two bytes (WordChar, 2 bytes or 16 bits), then what is
UCS-4 good for?

> Anyway, you could Google for this confusion but it is not that
> important.
> ]]

I promise, I don't have to google for confusion to get confused... =)

(pst, somehow it comes natural for me, at least when speaking character
encodings... =)

> [Having said that, I would not recommend decomposing --it will screw up
> stuff in langs like Turkish (if that's important).]

Well, who knows. My app will be used for processing MW data, and it
feels strange to exclude any language. Turkish may well be crunched with
it too although not by me.

> But, you might also wish to use 'UnicodeData.txt' (from the same
> address) just to be sure --as well as
> http://unicode.org/Public/UNIDATA/SpecialCasing.txt

Ouch, the SpecialCasing.txt isn't for glyphs like me... :)

>> Btw, if I convert from Utf8 to UCS-4, and put the result into a
>> regular string (a "convenient byte array"), can't I just mask this

...
> Why not simply use an array such as this for lookup? ...

> Then, fill it with the data from CaseFolding.txt.

> ...

> function GetCaseFolding(const AInput: Cardinal; out AOutput: Cardinal):

...

> I don't see why something like this shouldn't be faster.

That's OK for the code table, I meant "masking in" the new base_char
code (retrieved by, yes, why not this GetCaseFolding method) into the
string to be capitalized.

In any case, I'll start out with 16 bit WordChar, and if it happens that
my app ever loses any bits above the first 16 bits I'll tell them picky
users to google for them... (until next major release, or so) :)

Regards,

// Rolf Lampa

Adem

unread,

Jul 3, 2008, 11:45:09 AM7/3/08

to

Rolf Lampa [RIL] wrote:

> I doubt it. To me it seems like the additional numbers are there only
> for information, they don't add anything to the base_char.

They do. But, not aritmetically.

The rendering engine, when it encounters one of those chars, superposes
them onto the base char.

Yeah, I know, it is rather silly and most confusing.

Unicode is supposed to be a canonical system for codepoints, but it's
not there yet.

So, what it does is, for some very rare chars, instead of assigning a
codepint, it offers a lego-like mechanism to construct (for visual
purposes) the character glyph.

In any case, we are assured that, these odd charaters are about 100
only.

> But if the base_char is two bytes (WordChar, 2 bytes or 16 bits),
> then what is UCS-4 good for?

Simple: 2-bytes can cover only 64K chars; and it is just not enough, if
you wish to cover the Asian/Eastern and ancient languages. 3-bytes
would probably be enough, but would you really wish to work with 3-byte
data types?

> (pst, somehow it comes natural for me, at least when speaking
> character encodings... =)

Don't be fooled by my tone of voice either --I am definitely no expert
myself. I too am trying to make my way around the Byzantine dungeons,
and the concepts (or, rather the reasons for the workarounds) are only
recently sinking in --if at all.

> > [Having said that, I would not recommend decomposing --it will
> > screw up stuff in langs like Turkish (if that's important).]
>
> Well, who knows. My app will be used for processing MW data, and it
> feels strange to exclude any language. Turkish may well be crunched
> with it too although not by me.

Turkish, Azeri --as well as probably a couple of other-- langs pose a
challenge to the ANSI-accustomed when it comes to case folding, but
that problem is relatively easy. Your main concerns will arise, I
believe, covering the Asian/Eastern and ancient languages. Then I
expect that you will appreciate having started with (the use of) UCS-4.

Now... How do you normalize a Chinese script so that your lookups work
as expected?

Or, do you need to normalize a Chinese script?

I don't know. Someone else should fill in at that point.

[ http://www.flexiguided.de/publications.utf8proc.en.html may help with
this. And, all Delphi community (including me) will be immensely
grateful if you translated it into Pascal. ]

> > But, you might also wish to use 'UnicodeData.txt' (from the same
> > address) just to be sure --as well as
> > http://unicode.org/Public/UNIDATA/SpecialCasing.txt
>
> Ouch, the SpecialCasing.txt isn't for glyphs like me... :)

I too hate to step on these mines.. At every turn of a corner, you're
likely to encounter a new file that modifies 'UnicodeData.txt' in some
cryptic way.

There's a reason for that: Unicode.org needs to keep 'UnicodeData.txt'
backwards compatible --meaning they cannot add new fields to the format.

I don't like it. But, there's no choice: I have to lump it.

> That's OK for the code table, I meant "masking in" the new base_char
> code (retrieved by, yes, why not this GetCaseFolding method) into the
> string to be capitalized.

Masking?

What masking?

Are you sure you're not still thinking ANSI/ASCII?

There's no linear relationship between an uppercase and a lowercase?

These are all mandatory now --in Unicode land. Plus, on top of
uppercase and lowercase, you also have a titlecase..

> In any case, I'll start out with 16 bit WordChar, and if it happens
> that my app ever loses any bits above the first 16 bits I'll tell
> them picky users to google for them... (until next major release, or
> so) :)

Please don't do that. That will be desing mistake which will cause
endless headaches later on.

Instead, stop thinking that chars are chars. Think of them as some
'const's of 'Cardinal' type. And, strings are array of Cardinals.

Do all your background 'text' processing/calculations with these 4-byte
'consts', and convert them to WideChar only at the moment of displaying
them.

Doing it that way will also mean that your background processing will
be independent of your users' Locale (of the computer) until the moment
of display.

Rolf Lampa [RIL]

unread,

Jul 3, 2008, 1:26:41 PM7/3/08

to

Adem wrote:
> Rolf Lampa [RIL] wrote:
>

>> To me it seems like the additional numbers are there only
>> for information, they don't add anything to the base_char.
>
> They do. But, not aritmetically.

Spooky.

>> But if the base_char is two bytes (WordChar, 2 bytes or 16 bits),
>> then what is UCS-4 good for?
>
> Simple: 2-bytes can cover only 64K chars; and it is just not enough,

OK, this was based on my faulty assumption about only one byte-pair in
the first place.

> Now... How do you normalize a Chinese script so that your lookups work
> as expected?

If you don't ask, then I won't have to answer. :)

>> That's OK for the code table, I meant "masking in" the new base_char
>> code (retrieved by, yes, why not this GetCaseFolding method) into the
>> string to be capitalized.
>

> Masking? ... > There's no linear relationship between an uppercase and
> a lowercase?

Sorry, I meant bit- or byte masking the new char-code into the beginning
of the string, after the correct code has been found. But concatenation
is probably the best since the code folded char may grow on folding.

> These are all mandatory now --in Unicode land. Plus, on top of
> uppercase and lowercase, you also have a titlecase..

Title case? Is that different from uppercasing the first char?

(in my case I can entirely disregard any "non Mediawiki" special cases,
like the Dutch CCapitalize, because in MW one must not touch the [2..n]
chars, ever. Only the first character needs to be always uppercased in
order to match the indexed titles.

> Instead, stop thinking that chars are chars. Think of them as some
> 'const's of 'Cardinal' type. And, strings are array of Cardinals.

I need only to capitalize (temporarily) smaller string fragments picked
up from text, in order to match them against indexed page titles.".

I need not touch the Utf8 encoding for other text processing (search and
replacements can be performed on the raw Utf8 strings, and convert the
search&replace strings to Utf8 instead).

> Do all your background 'text' processing/calculations with these 4-byte
> 'consts', and convert them to WideChar only at the moment of displaying
> them.

Perhaps I didn't say it before, but I actually don't display any MW data
to users, I only process/manipulate it. It's up to the MW parser to
display any MW content properly, when data is on board again on the MW
platform. <sigh of relief>

Thus I have only the temp capitalization to handle.

Regards,

// Rolf Lampa

Adem

unread,

Jul 3, 2008, 2:30:51 PM7/3/08

to

Rolf Lampa [RIL] wrote:

> > Simple: 2-bytes can cover only 64K chars; and it is just not
> > enough,
>
> OK, this was based on my faulty assumption about only one byte-pair
> in the first place.

Yep. And, that --to me-- dictates 4-byte per codepoint.

> >Masking? ... > There's no linear relationship between an uppercase
> and a lowercase?
>
> Sorry, I meant bit- or byte masking the new char-code into the
> beginning of the string, after the correct code has been found. But
> concatenation is probably the best since the code folded char may
> grow on folding.

I can see that you're still trying to sneak in WideChar/WideString
stuff into the game ^)

Oh, well.. At least, I did my best to warn you against that :P

> Title case? Is that different from uppercasing the first char?

There are so many cases.. letter cases :)

Here is a short reading list.

http://en.wikipedia.org/wiki/Letter_case
http://en.wikipedia.org/wiki/Capitalization [briefly explains title
case]
http://en.wikipedia.org/wiki/Sentence_case
http://en.wikipedia.org/wiki/Capitalisation
http://en.wikipedia.org/wiki/Unicase

Here's of course what we Delphi people love to practice:
http://en.wikipedia.org/wiki/CamelCase

And, this would be relevant for when it is relevant :P
http://en.wikipedia.org/wiki/Internet_capitalization_conventions

Title case is changing each word (of the selected text) to uppercase,
while the rest of the letters are in lowercase.

But, of course, it is not that simple --is anything??.
For instance, if you select the text "this is a test" and then try to
apply the above algo to change the text to title case, you end up with
"This Is A Test."

Yet, common rules of capitalization, would dictate that the "short"
words ("is" and "a") should not be capitalized.
So, title casing is not only lang-dependent, it also requires further
intelligence in the algo.

> (in my case I can entirely disregard any "non Mediawiki" special
> cases, like the Dutch CCapitalize, because in MW one must not touch
> the [2..n] chars, ever. Only the first character needs to be always
> uppercased in order to match the indexed titles.

Hmmm.. I wasn't aware of the 'Dutch CCapitalize'.. but, I did always
know there was something peculiar about the Dutch --as well as the
Spanish with their inverted exclamation and question marks at the
beginning of sentences :P

> > Instead, stop thinking that chars are chars. Think of them as some
> > 'const's of 'Cardinal' type. And, strings are array of Cardinals.
>
> I need only to capitalize (temporarily) smaller string fragments
> picked up from text, in order to match them against indexed page
> titles.".

True. You're luckier than most.

> Perhaps I didn't say it before, but I actually don't display any MW
> data to users, I only process/manipulate it. It's up to the MW parser
> to display any MW content properly, when data is on board again on
> the MW platform. <sigh of relief>
>
> Thus I have only the temp capitalization to handle.

I was under the impression that you may have to display the actual text
for URLs that are found to missing, etc.

Rolf Lampa [RIL]

unread,

Jul 3, 2008, 3:36:18 PM7/3/08

to

Adem wrote:

> I can see that you're still trying to sneak in WideChar/WideString
> stuff into the game ^)
> Oh, well.. At least, I did my best to warn you against that :P

OK, OK, OK, I got it, I will concatenate cardinals instead, promise!
<crossing fingers> :)

> Title case is changing each word <...> "This Is A Test."

Oh no, it's not allowed to modify MW titles in THAT way!...

> Hmmm.. I wasn't aware of the 'Dutch CCapitalize'..

That was new also to me until Herre de Jonge told me about it in this
thread.

> I was under the impression that you may have to display the actual text
> for URLs that are found to missing, etc.

Yes, but its only simple log text in Utf8 format which many editors can
convert to Win-1252 or the alike before pasting it into a live wiki
page. Or, I can optionally add the log to a "log article" as the last
xml page in the processed dump, and then the text is in proper Utf8
format ready for output to the stream.

During the processing with this tool there's no meaningful user
interaction. One can sit and silly watch the progress bar slowly advance
while the sun sets over the county, but that gets boring after a while.
Processing can take hours+ if the xml file is very big.

Regards,

// Rolf Lampa

Rolf Lampa [RIL]

unread,

Jul 4, 2008, 8:06:38 AM7/4/08

to

> Adem wrote:

>> I was under the impression that you may have to display the actual text

BTW, since I aim not make an app for viewing MW content (I only produce
and process MW data for later import) I have a hint about an att ¨that
does do view Mediawiki content on a PC: WikiTaxi.

So, if if someone wants an quick and easy way to view the entire English
Wikipedia on their laptop, try Mr Junker's WikiTaxi. It'll import
directly from a mediawiki dump, this one for example (last successful
English Wikipedia, all articles (only last revision), except for user &
discussion pages) :
http://download.wikimedia.org/enwiki/20080524/pages-articles.xml.bz2

WikiTaxi reads this dump directly (no unpacking needed) and stores it in
a SQLite database in less than two hours (~ 1 hour on my new Dual2 DELL
laptop).

It's got no images (of course) and no LaTex for math formulas, but it
seems to render most of all other stuff very well, and fast.

Quick and handy. Recommended.

http://www.wikitaxi.org/delphi/doku.php/products/wikitaxi/index

Regards,

// Rolf Lampa

Rolf Lampa [RIL]

unread,

Jul 4, 2008, 8:19:10 AM7/4/08

to

> Adem wrote:
>> I was under the impression that you may have to display the actual text

BTW, since I'm not making an app for viewing MW content myself (my app
only produces and processes MW data for later import), I have a hint
about an application which does view Mediawiki content on a PC: WikiTaxi.

So, if someone wants an quick and easy way to view the entire English
Wikipedia on their laptop, try Mr Junker's WikiTaxi*.

WikiTaxi will import directly from a Mediawiki dump, for example the

last successful English Wikipedia, all articles (only last revision),

except for user & discussion pages:

http://download.wikimedia.org/enwiki/20080524/pages-articles.xml.bz2

WikiTaxi reads this dump directly (no unpacking needed) and stores it in
a SQLite database in less than two hours (~ 1 hour on my new Dual2 DELL

laptop). Db size efter import ~6.7GB, dump size (.bz2) ~3.8GB compressed.

It displays no images (of course) and no LaTex for math formulas, but it
seems to render most of all other stuff very well.

Quick and handy. Recommended.

http://www.wikitaxi.org/delphi/doku.php/products/wikitaxi/index

Regards,

// Rolf Lampa

* A more complicated way to view WP offline is to install xampplite and
mediawiki.
- http://www.apachefriends.org/en/xampp-windows.html#646
- http://www.mediawiki.org/wiki/MediaWiki