We shouldn't be assuming this. What code makes this assumption? We
should be filing bugs on those places where you find it.
> But don't we use utf16, where
> there are unicode characters that don't fit into 16 bits?
Correct.
> What I want to do: get the actual length of string, iterate over each of the
> actual characters.
What do you need to do this for? Generally when people want to do
this, they're doing the wrong thing. Even if you iterate over the
Unicode "code points", there are combining accents, vowel points,
ligatures, etc that can make this measure fairly meaningless.
Brett
On Tue, Aug 17, 2010 at 12:40 PM, Dominic Mazzoni <dmaz...@chromium.org> wrote:We shouldn't be assuming this. What code makes this assumption? We
> What's the proper way to work with arbitrary unicode text in Chrome?
> It looks like most code uses wstring or string16 and then assumes that each
> wchar_t / char16 is equal to one character.
should be filing bugs on those places where you find it.
> But don't we use utf16, whereCorrect.
> there are unicode characters that don't fit into 16 bits?
What do you need to do this for? Generally when people want to do
> What I want to do: get the actual length of string, iterate over each of the
> actual characters.
this, they're doing the wrong thing. Even if you iterate over the
Unicode "code points", there are combining accents, vowel points,
ligatures, etc that can make this measure fairly meaningless.
I agree. Can you file a bug on that?
> On the flip side, could you point me to some code that does it correctly
> that I could follow as an example?
>
>>
>> > But don't we use utf16, where
>> > there are unicode characters that don't fit into 16 bits?
>>
>> Correct.
>>
>> > What I want to do: get the actual length of string, iterate over each of
>> > the
>> > actual characters.
>>
>> What do you need to do this for? Generally when people want to do
>> this, they're doing the wrong thing. Even if you iterate over the
>> Unicode "code points", there are combining accents, vowel points,
>> ligatures, etc that can make this measure fairly meaningless.
>
> I'm working on accessibility for text controls. I've got a string that comes
> from a native text control and the start and end selection bounds (in
> characters, not bytes). I'd like to pull the substring corresponding to the
> selection out from the middle of the string. I also want to know the length
> of that substring and the length of the original string.
I assume you're working on GTK that has some weird behavior here?
If you want to iterate over UTF8 or UTF16 characters, the best thing
is the ICU macros. You can see U8_NEXT being used in
browser/history/snippet.cc. I'd love to have some nicer wrapper class
that would iterate over UTF8 or UTF16 characters and return code
points if you feel like writing such a thing.
Brett
> Some of the methods in app/text_elider.cc, like CutString, seem unsafe toI agree. Can you file a bug on that?
> me. Specifically, it looks like it could cut right in the middle of a
> character, creating either an unusual character or worse making an invalid
> UTF-16 string.
If you want to iterate over UTF8 or UTF16 characters, the best thingis the ICU macros. You can see U8_NEXT being used in
browser/history/snippet.cc. I'd love to have some nicer wrapper class
that would iterate over UTF8 or UTF16 characters and return code
points if you feel like writing such a thing.
GTK text controls implement the GtkEditable interface, which provides
API to get the proper substring from the selection bounds directly.
http://library.gnome.org/devel/gtk/unstable/GtkEditable.html#gtk-editable-get-chars
Whoops, note by the way that I linked you to the GTK 3 docs, which
include new functions that aren't usable for us. (I noticed this
because I saw something else there that seemed very useful...) I'm
sure this particular function has been around forever, but don't trunk
other docs you find there.
We currently target 2.12:
http://library.gnome.org/devel/gtk/2.12/