DirectIndexString considered harmful..

129 views
Skip to first unread message

Páll Haraldsson

unread,
Jun 9, 2015, 11:49:02 AM6/9/15
to juli...@googlegroups.com
I remember a similar issue (and a recent post by Scott, that I thought was a bug) and now this one:

https://github.com/JuliaLang/julia/issues/7811

There is no such thing as a DirectIntex string, not even for UTF-32, if you consider Direct indexing should be to a grapheme cluster.

UTF-16 is also not a DirectIndex string, because of surrogates.

ASCIIString may be, but having some literal produce that one and allowing direct indexing that works and others with incorrect byte addressing is a catastrophe waiting to happen.

I propose just one Unicode string type that doesn't have indexing, that will use UTF-8 (or UCS-2, not UTF-16) internally. Maybe more if needed, or even just UTF-8..

Whatever the encoding, there is not just one way of indexing, it is ambigous.. Instead of one error:

julia> a="Páll"[3]
ERROR: invalid UTF-8 character index
 in next at ./utf8.jl:69
 in getindex at string.jl:57

I propose something like:

julia> a="Páll"[3] #not just for this one index..
ERROR: Indexing is not supported, do CharIndexing(your_string), or GraphemeIndexing(your_string) or WordIndexing(your_string).

Returning a type that points to your_string, can be made fast for some string (if we choose to have ASCII behind-the-scenes) and in other case make an index into your string. That was one of my possible plans for a new string type.

The sooner we get rid of this misfeature, 0.4 release, the better I think. We can also make a ByteIndexing type, for "compatibilty" with buggy behaviour.. or for those who know what they are doing.. Can Compat.jl handle this? To take features away?

If the world settles on grapheme cluster indexing some day (or char indexing), we could reintroduce any default indexing into a string we like or will be chosen.


Do we really need indexing? Or just iterators and appending? Strings do NOT behave as arrays..

-- 
Palli.



Páll Haraldsson

unread,
Jun 9, 2015, 11:51:20 AM6/9/15
to juli...@googlegroups.com
I was also going to say:

It's not like byte-indexing is the lowest level..

dna"ACGT"[3]

But this is outside of Unicode scope.. :)

Milan Bouchet-Valat

unread,
Jun 9, 2015, 1:34:04 PM6/9/15
to juli...@googlegroups.com
Reply all
Reply to author
Forward
0 new messages