DirectIndexString considered harmful..

閲覧: 129 回
最初の未読メッセージにスキップ

Páll Haraldsson

未読、
2015/06/09 11:49:022015/06/09
To: juli...@googlegroups.com
I remember a similar issue (and a recent post by Scott, that I thought was a bug) and now this one:

https://github.com/JuliaLang/julia/issues/7811

There is no such thing as a DirectIntex string, not even for UTF-32, if you consider Direct indexing should be to a grapheme cluster.

UTF-16 is also not a DirectIndex string, because of surrogates.

ASCIIString may be, but having some literal produce that one and allowing direct indexing that works and others with incorrect byte addressing is a catastrophe waiting to happen.

I propose just one Unicode string type that doesn't have indexing, that will use UTF-8 (or UCS-2, not UTF-16) internally. Maybe more if needed, or even just UTF-8..

Whatever the encoding, there is not just one way of indexing, it is ambigous.. Instead of one error:

julia> a="Páll"[3]
ERROR: invalid UTF-8 character index
 in next at ./utf8.jl:69
 in getindex at string.jl:57

I propose something like:

julia> a="Páll"[3] #not just for this one index..
ERROR: Indexing is not supported, do CharIndexing(your_string), or GraphemeIndexing(your_string) or WordIndexing(your_string).

Returning a type that points to your_string, can be made fast for some string (if we choose to have ASCII behind-the-scenes) and in other case make an index into your string. That was one of my possible plans for a new string type.

The sooner we get rid of this misfeature, 0.4 release, the better I think. We can also make a ByteIndexing type, for "compatibilty" with buggy behaviour.. or for those who know what they are doing.. Can Compat.jl handle this? To take features away?

If the world settles on grapheme cluster indexing some day (or char indexing), we could reintroduce any default indexing into a string we like or will be chosen.


Do we really need indexing? Or just iterators and appending? Strings do NOT behave as arrays..

-- 
Palli.



Páll Haraldsson

未読、
2015/06/09 11:51:202015/06/09
To: juli...@googlegroups.com
I was also going to say:

It's not like byte-indexing is the lowest level..

dna"ACGT"[3]

But this is outside of Unicode scope.. :)

Milan Bouchet-Valat

未読、
2015/06/09 13:34:042015/06/09
To: juli...@googlegroups.com
全員に返信
投稿者に返信
転送
新着メール 0 件