Working with unicode text

22 views
Skip to first unread message

Dominic Mazzoni

unread,
Aug 17, 2010, 3:40:32 PM8/17/10
to chromium-dev
What's the proper way to work with arbitrary unicode text in Chrome?

It looks like most code uses wstring or string16 and then assumes that each wchar_t / char16 is equal to one character. But don't we use utf16, where there are unicode characters that don't fit into 16 bits?

What I want to do: get the actual length of string, iterate over each of the actual characters.

Thanks,
- Dominic

Brett Wilson

unread,
Aug 17, 2010, 3:57:56 PM8/17/10
to dmaz...@google.com, chromium-dev
On Tue, Aug 17, 2010 at 12:40 PM, Dominic Mazzoni <dmaz...@chromium.org> wrote:
> What's the proper way to work with arbitrary unicode text in Chrome?
> It looks like most code uses wstring or string16 and then assumes that each
> wchar_t / char16 is equal to one character.

We shouldn't be assuming this. What code makes this assumption? We
should be filing bugs on those places where you find it.

> But don't we use utf16, where
> there are unicode characters that don't fit into 16 bits?

Correct.

> What I want to do: get the actual length of string, iterate over each of the
> actual characters.

What do you need to do this for? Generally when people want to do
this, they're doing the wrong thing. Even if you iterate over the
Unicode "code points", there are combining accents, vowel points,
ligatures, etc that can make this measure fairly meaningless.

Brett

Dominic Mazzoni

unread,
Aug 17, 2010, 4:46:21 PM8/17/10
to Brett Wilson, chromium-dev
On Tue, Aug 17, 2010 at 12:57 PM, Brett Wilson <bre...@chromium.org> wrote:
On Tue, Aug 17, 2010 at 12:40 PM, Dominic Mazzoni <dmaz...@chromium.org> wrote:
> What's the proper way to work with arbitrary unicode text in Chrome?
> It looks like most code uses wstring or string16 and then assumes that each
> wchar_t / char16 is equal to one character.

We shouldn't be assuming this. What code makes this assumption? We
should be filing bugs on those places where you find it.

Some of the methods in app/text_elider.cc, like CutString, seem unsafe to me. Specifically, it looks like it could cut right in the middle of a character, creating either an unusual character or worse making an invalid UTF-16 string.

On the flip side, could you point me to some code that does it correctly that I could follow as an example?
 
> But don't we use utf16, where
> there are unicode characters that don't fit into 16 bits?

Correct.

> What I want to do: get the actual length of string, iterate over each of the
> actual characters.

What do you need to do this for? Generally when people want to do
this, they're doing the wrong thing. Even if you iterate over the
Unicode "code points", there are combining accents, vowel points,
ligatures, etc that can make this measure fairly meaningless.

I'm working on accessibility for text controls. I've got a string that comes from a native text control and the start and end selection bounds (in characters, not bytes). I'd like to pull the substring corresponding to the selection out from the middle of the string. I also want to know the length of that substring and the length of the original string.

- Dominic

Brett Wilson

unread,
Aug 17, 2010, 5:21:57 PM8/17/10
to Dominic Mazzoni, chromium-dev
On Tue, Aug 17, 2010 at 1:46 PM, Dominic Mazzoni <dmaz...@chromium.org> wrote:
> On Tue, Aug 17, 2010 at 12:57 PM, Brett Wilson <bre...@chromium.org> wrote:
>>
>> On Tue, Aug 17, 2010 at 12:40 PM, Dominic Mazzoni <dmaz...@chromium.org>
>> wrote:
>> > What's the proper way to work with arbitrary unicode text in Chrome?
>> > It looks like most code uses wstring or string16 and then assumes that
>> > each
>> > wchar_t / char16 is equal to one character.
>>
>> We shouldn't be assuming this. What code makes this assumption? We
>> should be filing bugs on those places where you find it.
>
> Some of the methods in app/text_elider.cc, like CutString, seem unsafe to
> me. Specifically, it looks like it could cut right in the middle of a
> character, creating either an unusual character or worse making an invalid
> UTF-16 string.

I agree. Can you file a bug on that?

> On the flip side, could you point me to some code that does it correctly
> that I could follow as an example?
>
>>
>> > But don't we use utf16, where
>> > there are unicode characters that don't fit into 16 bits?
>>
>> Correct.
>>
>> > What I want to do: get the actual length of string, iterate over each of
>> > the
>> > actual characters.
>>
>> What do you need to do this for? Generally when people want to do
>> this, they're doing the wrong thing. Even if you iterate over the
>> Unicode "code points", there are combining accents, vowel points,
>> ligatures, etc that can make this measure fairly meaningless.
>
> I'm working on accessibility for text controls. I've got a string that comes
> from a native text control and the start and end selection bounds (in
> characters, not bytes). I'd like to pull the substring corresponding to the
> selection out from the middle of the string. I also want to know the length
> of that substring and the length of the original string.

I assume you're working on GTK that has some weird behavior here?

If you want to iterate over UTF8 or UTF16 characters, the best thing
is the ICU macros. You can see U8_NEXT being used in
browser/history/snippet.cc. I'd love to have some nicer wrapper class
that would iterate over UTF8 or UTF16 characters and return code
points if you feel like writing such a thing.

Brett

Dominic Mazzoni

unread,
Aug 17, 2010, 5:40:37 PM8/17/10
to bre...@chromium.org, chromium-dev
On Tue, Aug 17, 2010 at 2:21 PM, Brett Wilson <bre...@chromium.org> wrote:
> Some of the methods in app/text_elider.cc, like CutString, seem unsafe to
> me. Specifically, it looks like it could cut right in the middle of a
> character, creating either an unusual character or worse making an invalid
> UTF-16 string.

I agree. Can you file a bug on that?

 
If you want to iterate over UTF8 or UTF16 characters, the best thing
is the ICU macros. You can see U8_NEXT being used in
browser/history/snippet.cc. I'd love to have some nicer wrapper class
that would iterate over UTF8 or UTF16 characters and return code
points if you feel like writing such a thing.

Thanks, this is exactly what I was looking for.

I'm not sure what exactly would be needed for a wrapper class, but I'd be happy to add some utility functions to base/string_util.h - perhaps just Length and Substring for UTF strings.

- Dominic

Evan Martin

unread,
Aug 17, 2010, 6:01:13 PM8/17/10
to bre...@chromium.org, Dominic Mazzoni, chromium-dev
On Tue, Aug 17, 2010 at 2:21 PM, Brett Wilson <bre...@chromium.org> wrote:
>> I'm working on accessibility for text controls. I've got a string that comes
>> from a native text control and the start and end selection bounds (in
>> characters, not bytes). I'd like to pull the substring corresponding to the
>> selection out from the middle of the string. I also want to know the length
>> of that substring and the length of the original string.
>
> I assume you're working on GTK that has some weird behavior here?

GTK text controls implement the GtkEditable interface, which provides
API to get the proper substring from the selection bounds directly.
http://library.gnome.org/devel/gtk/unstable/GtkEditable.html#gtk-editable-get-chars

Dominic Mazzoni

unread,
Aug 17, 2010, 6:04:31 PM8/17/10
to Evan Martin, bre...@chromium.org, chromium-dev
Thanks! I may still need a couple of things beyond what it provides, but I'll try to use GtkEditable where possible, thanks!

- Dominic

Evan Martin

unread,
Aug 17, 2010, 6:10:23 PM8/17/10
to Dominic Mazzoni, bre...@chromium.org, chromium-dev
On Tue, Aug 17, 2010 at 3:04 PM, Dominic Mazzoni <dmaz...@chromium.org> wrote:
>> GTK text controls implement the GtkEditable interface, which provides
>> API to get the proper substring from the selection bounds directly.
>>
>>  http://library.gnome.org/devel/gtk/unstable/GtkEditable.html#gtk-editable-get-chars
>
> Thanks! I may still need a couple of things beyond what it provides, but
> I'll try to use GtkEditable where possible, thanks!

Whoops, note by the way that I linked you to the GTK 3 docs, which
include new functions that aren't usable for us. (I noticed this
because I saw something else there that seemed very useful...) I'm
sure this particular function has been around forever, but don't trunk
other docs you find there.

We currently target 2.12:
http://library.gnome.org/devel/gtk/2.12/

Reply all
Reply to author
Forward
0 new messages