No LastIndexRune?

605 views
Skip to first unread message

John Nagle

unread,
Apr 5, 2013, 1:23:31 AM4/5/13
to golan...@googlegroups.com

http://golang.org/pkg/strings/

has

func Index(s, sep string) int
func IndexAny(s, chars string) int
func IndexFunc(s string, f func(rune) bool) int
func IndexRune(s string, r rune) int

and

func LastIndex(s, sep string) int
func LastIndexAny(s, chars string) int
func LastIndexFunc(s string, f func(rune) bool) int

For completeness, there should be LastIndexRune as well.

(Use case: word wrap)

John Nagle

Kyle Lemons

unread,
Apr 5, 2013, 2:15:55 AM4/5/13
to John Nagle, golang-nuts
It turns out to be comparatively expensive to iterate backward through a string rune-wise.  You could convert to a []rune, perhaps.



                                John Nagle

--
You received this message because you are subscribed to the Google Groups "golang-nuts" group.
To unsubscribe from this group and stop receiving emails from it, send an email to golang-nuts...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.



John Nagle

unread,
Apr 5, 2013, 2:48:57 AM4/5/13
to golan...@googlegroups.com
On 4/4/2013 11:15 PM, Kyle Lemons wrote:
> It turns out to be comparatively expensive to iterate backward through a
> string rune-wise. You could convert to a []rune, perhaps.
>
>
> On Thu, Apr 4, 2013 at 10:23 PM, John Nagle <na...@animats.com> wrote:
>...
>>
>> For completeness, there should be LastIndexRune as well.

There's LastIndex. That has the same backwards-indexing problems.

John Nagle

David Symonds

unread,
Apr 5, 2013, 2:53:03 AM4/5/13
to John Nagle, golang-nuts
On Fri, Apr 5, 2013 at 5:48 PM, John Nagle <na...@animats.com> wrote:

> There's LastIndex. That has the same backwards-indexing problems.

It doesn't. LastIndex is not UTF-8 aware.

John Nagle

unread,
Apr 5, 2013, 3:09:48 AM4/5/13
to golan...@googlegroups.com
My mistake. I didn't realize how lame Go's Unicode implementation is.

This feels like moving back to Python 2.3.

John Nagle



chris dollin

unread,
Apr 5, 2013, 3:32:48 AM4/5/13
to John Nagle, golang-nuts
I don't think LastIndex needs to be UTF8-aware.

Looking for a UTF8-encoded substring inside a UTF8-encoded subject
string doesn't need to do any UTF8-decoding -- it can just work off the
bytes. Isn't that part of the point of UTF8?

Chris



                                        John Nagle



--
You received this message because you are subscribed to the Google Groups "golang-nuts" group.
To unsubscribe from this group and stop receiving emails from it, send an email to golang-nuts...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.





--
Chris "allusive" Dollin

roger peppe

unread,
Apr 5, 2013, 4:04:28 AM4/5/13
to John Nagle, golang-nuts
func LastIndexRune(s string, r rune) int {
return strings.LastIndexAny(s, string(r))
}

or if you want to save the allocation:

func LastIndexRune(s string, r rune) int {
for i := len(s); i > 0; {
rune, size := utf8.DecodeLastRuneInString(s[0:i])
i -= size
if rune == r {
return i
}
}
return -1
}

(you're probably right that for completeness we could have
a LastIndexRune, but I don't believe you're justified
in calling the unicode implementation "lame"; chris dollin's
remark is spot on)

chris dollin

unread,
Apr 5, 2013, 4:19:22 AM4/5/13
to Paul Samways, golang-nuts
On 5 April 2013 08:41, Paul Samways <pa...@paulsamways.com> wrote:
  > I don't think LastIndex needs to be UTF8-aware.

It does if you're wanting the rune index and not the byte offset of
the sub-string.

That applies to Index too.

If one wants rune offsets then one can convert to []rune and back
(taking care not to do it repeatedly ...).

Chris

--
Chris "allusive" Dollin

Roberto Waltman

unread,
Apr 5, 2013, 10:50:21 AM4/5/13
to golan...@googlegroups.com
John Nagle wrote:
>David Symonds wrote:
...
>> It doesn't. LastIndex is not UTF-8 aware.
> My mistake. I didn't realize how lame Go's Unicode implementation is.
> This feels like moving back to Python 2.3.

Or support UCS-2 internally? Convert UTF-8 to UCS-2 on reading etc.
(Even if that implies we won't be able to match regular expressions on
text from ancient Babylonian clay tablets.)

Sorry if this puts me in the "I am new to Go, so this is what must be
changed" group, but after years of "C" conditioning I feel that having
the size (memory req.) and the length of a string being a linear
function of each other, and/or being able to directly access the "nth"
character in the string index (linear f again) are too big a convenience
to give up easily.

--
Roberto Waltman

Jan Mercl

unread,
Apr 5, 2013, 10:53:49 AM4/5/13
to Roberto Waltman, golang-nuts
On Fri, Apr 5, 2013 at 4:50 PM, Roberto Waltman <ggr...@rwaltman.com> wrote:
> having the
> size (memory req.) and the length of a string being a linear function of
> each other, and/or being able to directly access the "nth" character in the
> string index (linear f again) are too big a convenience to give up easily.

Actually, in general not a single one of the quoted above properties
hold for UCS-2.

-j

Roberto Waltman

unread,
Apr 5, 2013, 11:06:40 AM4/5/13
to golang-nuts
Jan Mercl wrote:
> Actually, in general not a single one of the quoted above properties
> hold for UCS-2.

You are right, of course. What I meant, (and did not write,) is
UCS-2/UTF-16 supporting only U+0000..U+D7FF + U+E000..U+FFFF


--
Roberto Waltman

Matt Kane's Brain

unread,
Apr 5, 2013, 11:07:37 AM4/5/13
to Jan Mercl, Roberto Waltman, golang-nuts
On Fri, Apr 5, 2013 at 10:53 AM, Jan Mercl <0xj...@gmail.com> wrote:
Actually, in general not a single one of the quoted above properties
hold for UCS-2.


It MIGHT hold for UTF-32. At least for now. We may start assimilating alien scripts someday so you never know.

--
matt kane's brain
twitter: the_real_mkb / nynexrepublic
http://hydrogenproject.com

Jan Mercl

unread,
Apr 5, 2013, 11:12:46 AM4/5/13
to Roberto Waltman, golang-nuts
On Fri, Apr 5, 2013 at 5:06 PM, Roberto Waltman <ggr...@rwaltman.com> wrote:
> You are right, of course. What I meant, (and did not write,) is UCS-2/UTF-16
> supporting only U+0000..U+D7FF + U+E000..U+FFFF

That range includes: http://en.wikipedia.org/wiki/Combining_character

-j

Ian Lance Taylor

unread,
Apr 5, 2013, 12:26:36 PM4/5/13
to Matt Kane's Brain, Jan Mercl, Roberto Waltman, golang-nuts
On Fri, Apr 5, 2013 at 8:07 AM, Matt Kane's Brain
<mkb-...@hydrogenproject.com> wrote:
> On Fri, Apr 5, 2013 at 10:53 AM, Jan Mercl <0xj...@gmail.com> wrote:
>>
>> Actually, in general not a single one of the quoted above properties
>> hold for UCS-2.
>>
>
> It MIGHT hold for UTF-32. At least for now. We may start assimilating alien
> scripts someday so you never know.

And to work easily with UTF-32 in Go, use []rune, not string. For
text, the string type gives you UTF-8, but it's not the only option.

Ian

John Nagle

unread,
Apr 5, 2013, 7:59:37 PM4/5/13
to golan...@googlegroups.com
On 4/5/2013 12:32 AM, chris dollin wrote:
> I don't think LastIndex needs to be UTF8-aware.
>
> Looking for a UTF8-encoded substring inside a UTF8-encoded subject
> string doesn't need to do any UTF8-decoding -- it can just work off the
> bytes. Isn't that part of the point of UTF8?
>
> Chris

Right. The problem is the same in both forward and reverse modes.
Fortunately, no valid UTF-8 sequence will mismatch as a misaligned
substring of another UTF-8 sequence.

So there's no reason not to have LastIndexRune.

Go programmers need to know the details of the UTF-8 representation. See
"http://unspecified.wordpress.com/2012/04/19/the-importance-of-language-level-abstract-unicode-strings/"
for an essay on why this is
undesirable, and leads to obscure bugs. Go's representation is
prominently mentioned.

John Nagle

Rémy Oudompheng

unread,
Apr 6, 2013, 3:36:51 PM4/6/13
to John Nagle, golang-nuts
And you can have exactly the opposite arguments:
* nobody needs to access the n-th rune in a string. I don't know
anybody who gave a single convincing argument that doing so had any
meaning w.r.t. human languages.
* the length of a string as a number of runes doesn't make any sense
and is totally disconnected from the reality of human languages, just
like the length as a number of bytes. Except that the length as a
number of bytes has a technical reality, which makes sense in a
low-level-capable language like Go.
* []rune is not a better representation than string/[]byte for Unicode
strings: it allows just as many invalid sequences, and there is no
strictly conformant type system that can allow this except if you
introduce range types like in Ada.
* Python approach to Unicode strings has performance issues too, and
the new approach used in Python 3.3 (variable rune size) introduces
issues on its own too, and an extremely complex API.

The blog post you quote simply criticizes all existing approaches and
doesn't even suggest an acceptable approach exists.

It also suggests that "having a built-in Unicode aware datatype mostly
solves encoding issues", which is certainly not true. The fact that we
need canonicalization of characters, notably w.r.t. to diacritics and
combining characters shows that we have not yet reached a proper
representation of human writing systems, and we are actually very far
from doing so.
Even the belief that writing systems can be accurately representing as
linear strings of anything (may it be bytes or runes or whatever), is
disrupted by non-linear writing systems like hieroglyphs.

I am a Unicode admirer though, and yet I think that Go has the most
convenient compromise we can find to handle text. Every time I tried
to manipulate text in Python was an awful pain.

Rémy.

Jan Mercl

unread,
Apr 6, 2013, 3:38:06 PM4/6/13
to John Nagle, golang-nuts
On Sat, Apr 6, 2013 at 1:59 AM, John Nagle <na...@animats.com> wrote:

> Go programmers need to know the details of the UTF-8 representation.

Breaking news: Go programmers wanting to add numbers need to know the
details of arithmetic addition.

Otherwise: Not going to discuss FUD posts. Hope they're paid for as
otherwise it make little sense to me.

-j

"That what can be asserted without evidence can also be dismissed
without evidence." -- Christopher Hitchens

John Nagle

unread,
Apr 6, 2013, 4:36:12 PM4/6/13
to golan...@googlegroups.com
On 4/6/2013 12:36 PM, R�my Oudompheng wrote:
> And you can have exactly the opposite arguments:
> * nobody needs to access the n-th rune in a string.

There's an argument that string types should not
be subscriptable. This was considered and rejected for
Python 3. It's possible to have a non-subscriptable
string representation. The operations provided are
those that don't return string indices. Split, join,
regular expressions, and replacement don't need indices.

Go sort of has that. Strings appear to be subscriptable,
but string elements are only useful if you're decoding UTF-8.
What's useful is the ability to extract a substring from a
string. The various "index" operations can be thought of
as returning a substring position marker rather than an integer.
Those values are always at a rune boundary, provided that the
input to functions like "Index" is valid UTF-8. Arithmetic
on substring position markers is not meaningful.

The Go string package has roughly the same set of
string operations as Python, Javascript, etc., but those aren't
quite the right set given the string representation.
If you want to advance in a string, you need a "next" operation
to advance one rune.

func Next(s string, ix int) int

The string package probably needs Next and Prev, and users
need to know to use them. If a user uses Index to find
something, and they want to take the rest of the string
after what they found, adding 1 to the result of Index is
wrong. They need to use a Next operation.

The documentation at "http://golang.org/pkg/strings/"
ignores these issues. People rewriting code from other
languages in Go are likely to create string code that
only works right for ASCII.

> * the length of a string as a number of runes doesn't make any sense
> and is totally disconnected from the reality of human languages,

Line length. Doh.

> Matt Kane:
> And to work easily with UTF-32 in Go, use []rune, not string. For
> text, the string type gives you UTF-8, but it's not the only option.

The Go string operations are defined for "string" not []rune. They
could be duplicated for []rune, but that wasn't done.

John Nagle

Rémy Oudompheng

unread,
Apr 6, 2013, 4:44:29 PM4/6/13
to John Nagle, golang-nuts
On 2013/4/6 John Nagle <na...@animats.com> wrote:
> On 4/6/2013 12:36 PM, Rémy Oudompheng wrote:
>> * the length of a string as a number of runes doesn't make any sense
>> and is totally disconnected from the reality of human languages,
>
> Line length. Doh.

Line length is not determined by the number of runes. In either []rune
or []byte-like representation, determining line length is not a
constant-time representation.

Also, many people use non-constant width glyphs, so things are quite
more subtle than just counting things.

Rémy.

Ugorji Nwoke

unread,
Apr 6, 2013, 4:48:36 PM4/6/13
to golan...@googlegroups.com, na...@animats.com
range clause returns runes. That's your Next.
conversion returns a slice of runes. That's your indexing, and you 
can then use all the slice operations.

I think there's a fallacy that most code needs to index unicode strings,
and so it should be a core type, and have everyone pay the price for it.

When you need that functionality, the libraries available are there for you.
unicode, unicode/utf8, unicode/utf16

The Go documentation is really really good, and easy to read, 
and is at all our fingertips. From the spec to the package library.

On Saturday, April 6, 2013 4:36:12 PM UTC-4, John Nagle wrote:

John Nagle

unread,
Apr 6, 2013, 5:22:03 PM4/6/13
to Ugorji Nwoke, golan...@googlegroups.com
On 4/6/2013 1:48 PM, Ugorji Nwoke wrote:
> range clause returns runes. That's your Next.

Range clauses are only valid in "for" statements. They're
not useful on the output of "string.Index()", the case being
discussed here.

> conversion returns a slice of runes. That's your indexing, and you
> can then use all the slice operations.

But not the string operations. The regular expression package
is also string, not rune array, oriented. Few operations are
provided on slices of runes. The clear intent of the language
designers, as expressed by the library design, is that one does
string operations on strings, not slices of runes.

> I think there's a fallacy that most code needs to index unicode strings,
> and so it should be a core type, and have everyone pay the price for it.
> When you need that functionality, the libraries available are there for you.
> unicode, unicode/utf8, unicode/utf16

"unicode" has only per-rune properties. utf8 and utf16 are all about
conversion to other formats.

> The Go documentation is really really good, and easy to read,

The Go documentation tends to gloss over the design
defects of the language. Fanboy enthusiasm is not helping
fix this mess.

John Nagle

Ugorji Nwoke

unread,
Apr 6, 2013, 5:26:33 PM4/6/13
to golan...@googlegroups.com, Ugorji Nwoke, na...@animats.com




> The Go documentation is really really good, and easy to read,

    The Go documentation tends to gloss over the design
defects of the language.  Fanboy enthusiasm is not helping
fix this mess.
It's not fan-boy enthusiasm. It's more just irritation. Your postings, tone, etc
are starting to become the one sore-point I have looking at golang-nuts. 
A lot of times, your postings strike me as not thought out, and setup to 
provoke a negative reaction. Tact is important. For pointers, look 
at Russ posts. 

                                John Nagle

alco

unread,
Apr 6, 2013, 5:54:32 PM4/6/13
to golan...@googlegroups.com, na...@animats.com
Such a heated discussion. I'd just like to put here, that Objective-C's NSString seems to be using UTF-8 too. If you want a char at a certain index, either convert the string to an array on unichar (typedefed shorts), or use the slow `characterAtIndex:` method.

So imagine you were programming for iOS and tried to pitch the idea of going away from UTF-8 to Apple :)

There are tradeoffs in any string implementation. Just embrace the Go way and move along.

Dan Kortschak

unread,
Apr 6, 2013, 6:37:09 PM4/6/13
to na...@animats.com, Ugorji Nwoke, golan...@googlegroups.com
On Sat, 2013-04-06 at 14:22 -0700, John Nagle wrote:
> Range clauses are only valid in "for" statements. They're
> not useful on the output of "string.Index()", the case being
> discussed here.

How can you index into an array of variable length items without walking
through them?

Dan Kortschak

unread,
Apr 6, 2013, 7:01:23 PM4/6/13
to <nagle@animats.com>, golan...@googlegroups.com
OK, so that's an O(n) operation which is perfectly suited to a range loop.

On 07/04/2013, at 8:08 AM, "John Nagle" <na...@animats.com> wrote:
> With search functions such as "strings.IndexRune(ch rune) int".
>
> John Nagle
>
>

Andy Balholm

unread,
Apr 6, 2013, 7:05:11 PM4/6/13
to golan...@googlegroups.com, na...@animats.com
On Saturday, April 6, 2013 2:54:32 PM UTC-7, alco wrote:
Such a heated discussion. I'd just like to put here, that Objective-C's NSString seems to be using UTF-8 too.

Actually, NSString uses UTF-16 (which has many of the disadvantages of UTF-8, without the advantages). 

Andrew Gerrand

unread,
Apr 7, 2013, 3:29:14 AM4/7/13
to John Nagle, Ugorji Nwoke, golang-nuts

John, you are very focused on the shortcomings of Go without seeming to acknowledge that *any* approach to character encoding has its tradeoffs.

Go bakes utf8 deep into the language, so go programmers need to understand utf8. (Fortunately, utf8 is simple- it was designed on a napkin in a diner!) This is in keeping with the kind of level at which Go programmers are expected to operate, which should be unsurprising to anyone who has spent time with the language.

Nobody here claims Go is perfection, but a great deal of thought was put into the design of Go's strings. You are ignoring that, and that makes you seem rude.

Andrew

John Nagle

unread,
Apr 7, 2013, 2:15:53 PM4/7/13
to golan...@googlegroups.com
There was a period in the 1990s during which Unicode was defined as
being 16 bit characters (UCS-2). Java happened to lock in their
approach to Unicode during that period. So Java got stuck with
UTF-16, which means Java programs tend to work right for most
of the common glyphs, including Han for the Asian languages,
but are iffy for characters outside the 2-byte range.

At least Go didn't get caught that way.

John Nagle




Andrew Gerrand

unread,
Apr 7, 2013, 5:32:14 PM4/7/13
to John Nagle, golang-nuts

Never attribute to marketing that which is adequately explained by oversight.

If you think there are omissions in our docs, please file an issue.

Andrew

On 8 Apr 2013 03:40, "John Nagle" <na...@animats.com> wrote:
On 4/7/2013 12:29 AM, Andrew Gerrand wrote:
> John, you are very focused on the shortcomings of Go without seeming to
> acknowledge that *any* approach to character encoding has its tradeoffs.

    That's because the documents for Go gloss over the shortcomings.
They read like marketing materials.  There are some design decisions
in Go which can induce hard to find bugs, and possible security
holes, in Go programs.  Shared data between parallel tasks is one
area in which that is the case.  String representation is another.

    You write "Go bakes utf8 deep into the language, Go programmers need
to understand utf8."  I agree.  "Effective Go" does not make
such a statement.  Nor does the documentation for the "strings"
package.  That's what I'm talking about.

                                John Nagle



Reply all
Reply to author
Forward
0 new messages