Extract index in unicode String

114 views
Skip to first unread message

Ke Tran Manh

unread,
May 11, 2012, 7:58:03 PM5/11/12
to juli...@googlegroups.com
Hi,

anyone can help me dealing with extract substring of an unicode string.
s = "Česká"
s[1:2] = Č
s[2:2] = ""

I expect to get s[1:2]= "Če" and s[2:2]="e"

K.

Stefan Karpinski

unread,
May 11, 2012, 8:09:08 PM5/11/12
to juli...@googlegroups.com
You may want to read the manual section on strings. Basically, indices into UTF8Strings are byte indices, not character indices. The characters 'Č' and 'á' are both two bytes wide :

julia> length("Č")
2

julia> length("á")
2

Thus, they both take up two bytes and contain two byte indices.

A substring slice "gets" a character if and only if it includes the character's initial byte. Thus, s[2:2] is an empty string because the second byte of s is not the initial byte of any character — it's a continuation byte of 'Č'.

The chr2ind and ind2chr functions can translate between character and byte indices for a given string:

julia> chr2ind(s,1)
1

julia> chr2ind(s,2)
3

julia> chr2ind(s,3)
4

Keep in mind that these operations are O(n), however, so they should be used sparingly in performance-sensitive string algorithms.

Kevin Squire

unread,
May 11, 2012, 8:13:10 PM5/11/12
to juli...@googlegroups.com
Hi K.,

This is described in good detail in the manual:


Although it is strange that s[2:2] doesn't give you an error:

julia> s = "Česká" 
"Česká"

julia> s[2]
invalid UTF-8 character index
 in next at utf8.jl:36
 in ref at string.jl:22

julia> s[2:2]
""


Kevin

Stefan Karpinski

unread,
May 11, 2012, 8:22:01 PM5/11/12
to juli...@googlegroups.com
That's intentional. The expression s[2] is asking for the character at index 2, which is not a valid character index, so it's an error. The expression s[2:2] is asking for all the characters from index 2 to 2, which is zero characters, a perfectly valid substring and answer. If you have disjoint index ranges, you will always get disjoint substrings; if the index ranges cover the entire original string, you will get the entire original string.

The only thing that would make that an error and still make sense would be making s[1:1] an illegal range and requiring that index ranges include all of the bytes of a character or none of them, requiring the substring including the first character to be written as s[1:2]. That seems pointlessly strict to me though. As it stands, you can always split a string in half without worrying about where the valid byte offsets are by writing something like s[1:k] and s[k+1:end] — it doesn't matter if k is a valid character index or not. Similarly, we allow slices like s[1:0] and 0 is never a valid character index.

Kế Trần Mạnh

unread,
May 11, 2012, 8:40:35 PM5/11/12
to juli...@googlegroups.com
It seems like the chars function works in this case.
Here is my solution, although it's not elegant.

julia> x = chars(s)
julia> join(map(c->string(c),x[1:2]),"")
"Če"

K.
--
Ke Tran
Department of Linguistics
University of Groningen


Stefan Karpinski

unread,
May 11, 2012, 8:53:55 PM5/11/12
to juli...@googlegroups.com
Basically, the chars function explodes a string into a 32-bit, fixed-width, native byte-order representation. If you're going to be doing a lot of character indexing into a string, doing that once up-front is a reasonable thing to do. After that character indexing is cheap because the representation is fixed-width. The tradeoff is that you have to decode the entire string first and potentially blow up its storage by a factor of four. In some cases it will be the right thing to do, in most it won't.

Stefan Karpinski

unread,
May 11, 2012, 8:57:25 PM5/11/12
to juli...@googlegroups.com
This could be facilitated by reinstating the now-retired (for lack of maintenance) UTF32String type. By converting a string to UTF32String, you can do simple character indexing and slicing without having to join the individual characters back into strings at the end. It would be relatively easy to get the UTF32String type back into working order. There are some complications though, such as the fact that UTF-32-encoded strings coming from external sources can be in either little-endian or big-endian byte-orders.

Stefan Karpinski

unread,
May 11, 2012, 9:24:36 PM5/11/12
to juli...@googlegroups.com
Mostly for my own record, I think the correct thing to do with UTF32String and UTF16String is to make the abstract types and have different implementations for the little- and big-endian variants:

abstract UTF16String <: String
type UTF16BEString <: UTF16String ... end
type UTF16LEString <: UTF16String ... end

abstract UTF32String <: DirectIndexString
type UTF32BEString <: UTF32String ... end
type UTF32LEString <: UTF32String ... end

Most code for these types doesn't care about the endianness of the encoding and can be written to the abstract type, but a few core functions do. Conversion to UTF32String and UTF16String can convert to the type corresponding to the native endianness.

Jeff Bezanson

unread,
May 12, 2012, 4:54:47 AM5/12/12
to juli...@googlegroups.com
c->string(c) => string

or join(map(c->string(c),x[1:2]),"") => string(x[1:2]...)

On Fri, May 11, 2012 at 8:40 PM, Kế Trần Mạnh <ketra...@gmail.com> wrote:

Jeff Bezanson

unread,
May 12, 2012, 4:56:23 AM5/12/12
to juli...@googlegroups.com
Why would you want to keep data in memory in a non-native endianness?
That's just silly. Byteswap on I/O.

Ke Tran Manh

unread,
May 12, 2012, 5:09:14 AM5/12/12
to juli...@googlegroups.com
thanks, this looks nicer.

Stefan Karpinski

unread,
May 12, 2012, 8:31:08 AM5/12/12
to juli...@googlegroups.com
Because, e.g., your string is actually an mmapped 100GB file.

John Cowan

unread,
May 12, 2012, 11:36:31 AM5/12/12
to juli...@googlegroups.com
On Sat, May 12, 2012 at 4:56 AM, Jeff Bezanson <jeff.b...@gmail.com> wrote:

> Why would you want to keep data in memory in a non-native endianness?
> That's just silly. Byteswap on I/O.

Because it comes from the net (which is big-endian) and goes back to
the net, but your hardware is little-endian (as almost all of it is
nowadays).

--
GMail doesn't have rotating .sigs, but you can see mine at
http://www.ccil.org/~cowan/signatures

Patrick O'Leary

unread,
May 12, 2012, 1:25:31 PM5/12/12
to juli...@googlegroups.com
Ooh. That's an ugly leak in the mmap abstraction. Hate it when that happens. Sort of breaks http://commandcenter.blogspot.com/2012/04/byte-order-fallacy.html, doesn't it?

Stefan Karpinski

unread,
May 12, 2012, 5:08:15 PM5/12/12
to juli...@googlegroups.com
It's not that bad. You just need to call the appropriate hton-ish function upon reading bytes in/out of the string. The key point of Rob Pike's argument still holds: you need to worry about data byte order but not machine byte order.

Jeff Bezanson

unread,
May 12, 2012, 5:42:52 PM5/12/12
to juli...@googlegroups.com
Can't we all just agree to stop using network byte order? Was that
some kind of conspiracy by Sun or something? :)

Stefan Karpinski

unread,
May 12, 2012, 6:30:50 PM5/12/12
to juli...@googlegroups.com
If the machine and the data are in the same byte order, the conversions are noops, so there's no harm done except for the annoying additional machinery. I agree that it sucks, but shit happens. We could just say that we only support little-endian everything.

John Cowan

unread,
May 12, 2012, 6:43:49 PM5/12/12
to juli...@googlegroups.com
On Sat, May 12, 2012 at 5:42 PM, Jeff Bezanson <jeff.b...@gmail.com> wrote:

> Can't we all just agree to stop using network byte order? Was that
> some kind of conspiracy by Sun or something? :)

Definitely not. RFC 791 (September 1981) is the first documentation
of the IP protocol to explain network order, and it predates the
founding of Sun Microsystems by five months.

It probably goes right back to the big-endian PDP-10, actually.

Jeff Bezanson

unread,
May 12, 2012, 7:01:31 PM5/12/12
to juli...@googlegroups.com
Well then DEC would fall under the "or something" category :)
Sorry, I tend to be a bit irreverent. Or a lot. I appreciate the
accurate history though.

John Cowan

unread,
May 12, 2012, 7:46:23 PM5/12/12
to juli...@googlegroups.com
On Sat, May 12, 2012 at 7:01 PM, Jeff Bezanson <jeff.b...@gmail.com> wrote:

> Well then DEC would fall under the "or something" category :)

Yeah. The PDP-11 was the first little-endian computer; before that,
including earlier DECs, it was all big-endian.

Patrick O'Leary

unread,
May 12, 2012, 8:10:37 PM5/12/12
to juli...@googlegroups.com
You wouldn't have do that if the array weren't mmapped, though. That's the leak.

Stefan Karpinski

unread,
May 13, 2012, 12:47:38 AM5/13/12
to juli...@googlegroups.com
Sure you would. You have to do the translation one way or the other. Rob Pike's code just does it implicitly.

Patrick O'Leary

unread,
May 13, 2012, 8:18:15 AM5/13/12
to juli...@googlegroups.com
If I have an array in memory, you've already taken care of the translation. Nothing is required to use the array as normal. However, you have to translate the mmapped array on every r/w operation even though it otherwise looks like it's really in memory. I'm trying to figure out what I'm saying that's unclear, because this seems totally obvious to me, and has nothing to do with Pike's code other than having to use it on something that looks like its in memory, which means you're thinking about endinanness when you shouldn't have to. The mmap means you have potentially "foreign" I/O on every access to the array. That can be moved in to the ref() and assign() methods as long as no one uses primitive assignment.

Stefan Karpinski

unread,
May 13, 2012, 1:10:48 PM5/13/12
to juli...@googlegroups.com
Oh, I see what you're saying. Yeah, that's definitely a huge leak in the mmap abstraction. It's not exactly like having a regular array in memory.
Reply all
Reply to author
Forward
0 new messages