Documentation issue

122 views
Skip to first unread message

Scott Jones

unread,
May 21, 2015, 8:57:29 AM5/21/15
to juli...@googlegroups.com
I think the documenation is incorrect for utf16 and utf32:
  If you have a ``UInt32`` array
   ``A`` that is already NUL-terminated UTF-32 data, then you
   can instead use `UTF32String(A)`` to construct the string without
   making a copy of the data and treating the NUL as a terminator
   rather than as part of the string.
The copy is always made (because the string is immutable, and the array is not).
Should I make a PR and fix this?
 

Stefan Karpinski

unread,
May 21, 2015, 9:12:47 AM5/21/15
to juli...@googlegroups.com
julia> a = ['h','e','l','l','o','\0']
6-element Array{Char,1}:
 'h'
 'e'
 'l'
 'l'
 'o'
 '\0'

julia> s = UTF32String(a)
"hello"

julia> a[1] = 'j'
'j'

julia> s
"jello"

Scott Jones

unread,
May 21, 2015, 9:22:54 AM5/21/15
to juli...@googlegroups.com
You need to do what it says in the documentation... make an array of UInt32, *not* an array of Char.

julia> a = UInt32[1,2,3,0]
4-element Array{UInt32,1}:
 
0x00000001
 
0x00000002
 
0x00000003
 
0x00000000
julia
> j = UTF32String(a)
"\x01\x02\x03\0"
julia
> a[1] = 65
65
julia
> j
"\x01\x02\x03\0"

Stefan Karpinski

unread,
May 21, 2015, 9:39:52 AM5/21/15
to juli...@googlegroups.com
I think the phrase "NUL-terminated UTF-32 data" is somewhat ambiguous. If that means Vector{Char}, which is how I interpreted it, then the documentation is correct; if that means Vector{UInt32}, which seems to be how you interpreted it, then the documentation is incorrect. It could stand some clarification in any case, so a PR would be good.

Scott Jones

unread,
May 21, 2015, 9:52:17 AM5/21/15
to juli...@googlegroups.com
It's not ambiguous at all... it says a `UInt32` array, not a `Char` array...
The documentation should be made clear, that it only copies it if it is a `UInt32` array, and doesn't if it is a `Char` array...
[or the behavior should be changed, to do a `reinterpret(Char,arr)`, so that it doesn't copy...]

julia> j = UTF32String(reinterpret(Char,a))
"A\x02\x03"
julia
> a[2] = 66
66
julia
> j
"AB\x03"

Which should be changed?  One *is* wrong.

Scott Jones

unread,
May 21, 2015, 10:03:09 AM5/21/15
to juli...@googlegroups.com
Thinking about it, since `UTF8String` on a `Vector{UInt8}`, and `UTF16String` on a `Vector{UInt16}` both create a mutable string, it would seem that `UTF32String` should be changed
to also accept `Vector{UInt32}` as well as `Vector{Char}` without copying.

*However*, it seems that none of these constructors should work without copying (yes, I'm aware that a lot of code that merrily builds a vector and then "converts" it to a string would have to change).  Having these this way means that people can easily create mutable strings without even knowing about it, and that code operating on strings can't depend
on strings being immutable...  Since this is a common way of constructing strings, it seems rather dangerous.

Is there some way to have an "unsafe" constructor, that could be used in the places where a vector is built, and it is returned as a string, so that there aren't any aliases?
Reply all
Reply to author
Forward
0 new messages