Promotion rules and strings

114 views
Skip to first unread message

Michael Louwrens

unread,
Jul 26, 2015, 5:11:14 AM7/26/15
to julia-users
On 0.3.10 I was playing with strings for amusement and found it strange that while ASCII * UTF8 = UTF8,  UTF16/32 = ASCII.

Is this intended or should an issue be opened.

Scott Jones

unread,
Jul 26, 2015, 6:18:40 AM7/26/15
to julia-users
Could you give an example?
I've been trying to fix any Unicode related bugs I've found in 0.4

Michael Louwrens

unread,
Jul 26, 2015, 6:27:52 AM7/26/15
to julia-users, scott.pa...@gmail.com


On Sunday, 26 July 2015 12:18:40 UTC+2, Scott Jones wrote:
Could you give an example?
I've been trying to fix any Unicode related bugs I've found in 0.4

julia> x =utf16("string 1")
"string 1"

julia> y = " string2"
" string2"

julia> println(typeof(x), "\t", typeof(y),"\t" ,typeof(x*y))
UTF16String     ASCIIString     ASCIIString

 The same occurs for UTF32String. It seems the promotion rules are just a bit strange.

Michael Louwrens

unread,
Jul 26, 2015, 6:37:01 AM7/26/15
to julia-users, scott.pa...@gmail.com, michael.w...@outlook.com
I noticed that this also happens:
julia> x =utf16("string 1")
"string 1"

julia> y = utf8(" string2")
" string2"

julia> println(typeof(x), "\t", typeof(y),"\t" ,typeof(y*x))
UTF16String     UTF8String      ASCIIString

It seems Julia is always promoting to the simplest type that can represent the string. It seems that this was an intentional choice.

Scott Jones

unread,
Jul 26, 2015, 6:39:15 AM7/26/15
to julia-users, michael.w...@outlook.com
I really don't think this has anything to do with Julia's promotion rules, but rather, the way the `string` function works (and `x*y` where x and y are strings is really just `string(x,y)`)

Michael Louwrens

unread,
Jul 26, 2015, 6:53:50 AM7/26/15
to julia-users, scott.pa...@gmail.com
Aah... Well at least it gives utf8 when you have a non-ascii symbol in the string. Thanks for pointing me to why this is happening!

Scott Jones

unread,
Jul 26, 2015, 7:01:37 AM7/26/15
to julia-users, michael.w...@outlook.com
Yes, I don't believe it should be returning `UTF8String` or `ASCIIString`, if the inputs were `UTF16String` or `UTF32String`, something I'd like to work on in the future.

Steven G. Johnson

unread,
Jul 27, 2015, 4:53:23 PM7/27/15
to julia-users, michael.w...@outlook.com, scott.pa...@gmail.com
Basically, the situation is that a lot of the string operations in Julia are currently optimized for UTF8 (and ASCII), so performing those operations (like concatenation) on another type like UTF16 first converts to UTF8 or ASCII.

In the future, it will probably make sense to expand the set of optimized operations that work on UTF16 and UTF32 without conversion, but since UTF8 is Julia's default string type it will probably always have more code.

Scott Jones

unread,
Jul 27, 2015, 9:22:22 PM7/27/15
to julia-users, michael.w...@outlook.com, steve...@gmail.com
Well, UTF8String/ASCIIString are currently Julia's default string types, but I still have hopes that in the future, something based on Jacob Quinn's (@quinnj) String.jl, Stefan's bytevec, with traits based Encodings, would supplant the current set of ASCIIString/UTF8String/UTF16String/UTF32String currently used.
PCRE2 has support for 8, 16, and 32-bit codeunits, with run-time selection of UTF encoding (i.e. you could use ASCII, Latin1, or UTF-8 for 8-bit codeunits, or UCS2 or UTF-16 for 16-bit codeunits, instead of
always converting to UTF-8.
ICU is always UTF-16, and a lot of Unicode APIs use UTF-16, so I hope than anything that isn't currently optimized for UTF-16 (or UCS2) will be (if nobody else does, I'll definitely be adding more optimizations,
as I've done for simple conversions).
Reply all
Reply to author
Forward
0 new messages