Reinterpret Array{Uint8,2} as Array{UTF8String}?

337 views
Skip to first unread message

Jacob Quinn

unread,
Dec 16, 2013, 12:58:21 AM12/16/13
to juli...@googlegroups.com
So I'm in need of a very performant solution to convert a block of Array{Uint8,2} data, where each column is a separate string to an Array{UTF8String}.

I'm using unsafe_copy! for bitstypes, and I know it works on other types too, but I need the additional step of converting the Uint8 arrays to strings as well. Anybody dealt with this or have any ideas? This is in working with a C API where the data is returned into a pre-allocated block of memory and I'm copying the block into allocated Julia arrays.

This is what I'm using right now, as a translation of the unsafe_copy! code from Base, but I just wanted to see if anyone else has dealt with this or knows of a better solution here.

function my_unsafe_copy!(dest::Array{UTF8String},dsto,src::Array{Uint8,2},n,ind)
    for i=1:n
        @inbounds arrayset(dest, utf8(bytestring(src[:,i])), i+dsto-1)
    end
end


Pierre-Yves Gérardy

unread,
Jan 3, 2014, 6:55:58 AM1/3/14
to juli...@googlegroups.com
You can do it at no expense by passing the array to the UTF8String constructor. It doesn't copy the array. Be careful not to mutate the arrays afterwards, it will modify the string contents.

    julia> a = [0x1,0x2]
    2-element Uint8 Array:
     0x01
     0x02

    julia> s = UTF8String(a)
    "\x01\x02"

    julia> g.data === a
    true

—Pierre-Yves

Pierre-Yves Gérardy

unread,
Jan 3, 2014, 8:16:28 AM1/3/14
to juli...@googlegroups.com
Whoops, array slices are actually copies, anyway.

Jacob Quinn

unread,
Jan 10, 2014, 11:45:14 PM1/10/14
to juli...@googlegroups.com
Yeah, the problem is I DO need to copy the data. I'm just wondering if there's a more efficient way to copy Uint8 bytes and convert them to julia strings in the most performant way possible.

-Jacob

John Myles White

unread,
Jan 10, 2014, 11:52:33 PM1/10/14
to juli...@googlegroups.com
I don’t have much context, but do you know about CharString?

julia> CharString([0x61, 0x62, 0x63])
"abc"

— John

Pierre-Yves Gérardy

unread,
Jan 11, 2014, 9:58:15 AM1/11/14
to juli...@googlegroups.com, quinn....@gmail.com
The solution I suggested only makes one copy (the array slice), whereas yours makes two (`bytestring()` makes its own copy). It doesn't check the UTF-8 validity, but, actually, neither does your code. It would if you didn't call `bytestring()` (but it would still make a copy).

I don't know if it's possible to create the uint8 1-array faster than using the slice operator, but I'd be surprised if it was.

How does this perform compared to your version?

function another_copy!(dest::Array{UTF8String},dsto,src::Array{Uint8,2},n,ind)
    for i=1:n
        @inbounds arrayset(dest, UTF8String([:,i]), i+dsto-1)
    end
end

Pierre-Yves Gérardy

unread,
Jan 11, 2014, 10:52:54 AM1/11/14
to juli...@googlegroups.com
CharStrings are actually UTF-32-ish strings (four bytes per character, "-ish", because they can contain invalid code points).

— Pierre-Yves
Reply all
Reply to author
Forward
0 new messages