Promotion rules and strings

114 megtekintés
Ugrás az első olvasatlan üzenetre

Michael Louwrens

olvasatlan,
2015. júl. 26. 5:11:142015. 07. 26.
– julia-users
On 0.3.10 I was playing with strings for amusement and found it strange that while ASCII * UTF8 = UTF8,  UTF16/32 = ASCII.

Is this intended or should an issue be opened.

Scott Jones

olvasatlan,
2015. júl. 26. 6:18:402015. 07. 26.
– julia-users
Could you give an example?
I've been trying to fix any Unicode related bugs I've found in 0.4

Michael Louwrens

olvasatlan,
2015. júl. 26. 6:27:522015. 07. 26.
– julia-users, scott.pa...@gmail.com


On Sunday, 26 July 2015 12:18:40 UTC+2, Scott Jones wrote:
Could you give an example?
I've been trying to fix any Unicode related bugs I've found in 0.4

julia> x =utf16("string 1")
"string 1"

julia> y = " string2"
" string2"

julia> println(typeof(x), "\t", typeof(y),"\t" ,typeof(x*y))
UTF16String     ASCIIString     ASCIIString

 The same occurs for UTF32String. It seems the promotion rules are just a bit strange.

Michael Louwrens

olvasatlan,
2015. júl. 26. 6:37:012015. 07. 26.
– julia-users, scott.pa...@gmail.com, michael.w...@outlook.com
I noticed that this also happens:
julia> x =utf16("string 1")
"string 1"

julia> y = utf8(" string2")
" string2"

julia> println(typeof(x), "\t", typeof(y),"\t" ,typeof(y*x))
UTF16String     UTF8String      ASCIIString

It seems Julia is always promoting to the simplest type that can represent the string. It seems that this was an intentional choice.

Scott Jones

olvasatlan,
2015. júl. 26. 6:39:152015. 07. 26.
– julia-users, michael.w...@outlook.com
I really don't think this has anything to do with Julia's promotion rules, but rather, the way the `string` function works (and `x*y` where x and y are strings is really just `string(x,y)`)

Michael Louwrens

olvasatlan,
2015. júl. 26. 6:53:502015. 07. 26.
– julia-users, scott.pa...@gmail.com
Aah... Well at least it gives utf8 when you have a non-ascii symbol in the string. Thanks for pointing me to why this is happening!

Scott Jones

olvasatlan,
2015. júl. 26. 7:01:372015. 07. 26.
– julia-users, michael.w...@outlook.com
Yes, I don't believe it should be returning `UTF8String` or `ASCIIString`, if the inputs were `UTF16String` or `UTF32String`, something I'd like to work on in the future.

Steven G. Johnson

olvasatlan,
2015. júl. 27. 16:53:232015. 07. 27.
– julia-users, michael.w...@outlook.com, scott.pa...@gmail.com
Basically, the situation is that a lot of the string operations in Julia are currently optimized for UTF8 (and ASCII), so performing those operations (like concatenation) on another type like UTF16 first converts to UTF8 or ASCII.

In the future, it will probably make sense to expand the set of optimized operations that work on UTF16 and UTF32 without conversion, but since UTF8 is Julia's default string type it will probably always have more code.

Scott Jones

olvasatlan,
2015. júl. 27. 21:22:222015. 07. 27.
– julia-users, michael.w...@outlook.com, steve...@gmail.com
Well, UTF8String/ASCIIString are currently Julia's default string types, but I still have hopes that in the future, something based on Jacob Quinn's (@quinnj) String.jl, Stefan's bytevec, with traits based Encodings, would supplant the current set of ASCIIString/UTF8String/UTF16String/UTF32String currently used.
PCRE2 has support for 8, 16, and 32-bit codeunits, with run-time selection of UTF encoding (i.e. you could use ASCII, Latin1, or UTF-8 for 8-bit codeunits, or UCS2 or UTF-16 for 16-bit codeunits, instead of
always converting to UTF-8.
ICU is always UTF-16, and a lot of Unicode APIs use UTF-16, so I hope than anything that isn't currently optimized for UTF-16 (or UCS2) will be (if nobody else does, I'll definitely be adding more optimizations,
as I've done for simple conversions).
Válasz mindenkinek
Válasz a szerzőnek
Továbbítás
0 új üzenet