Creating Strings from UTF-8 in [UInt8]

124 views
Skip to first unread message

chris...@gmail.com

unread,
Jun 16, 2015, 2:54:56 PM6/16/15
to swift-l...@googlegroups.com
Hi,

I have a sequence of bytes in an array that might be a valid UTF-8 sequence. I can of course dip down into Foundation to convert these into a String, however I'm looking for a way to do this in "pure" Swift.

Looking at the standard library docs just makes my head spin! I can easily get UTF8 out of a String, but going in the other direction seems extraordinarily complicated.

Has anyone figured out the magic syntax for doing something like this?

Thanks,

Chris


Jens Alfke

unread,
Jun 16, 2015, 6:51:06 PM6/16/15
to chris...@gmail.com, swift-l...@googlegroups.com

On Jun 16, 2015, at 11:54 AM, chris...@gmail.com wrote:

I have a sequence of bytes in an array that might be a valid UTF-8 sequence. I can of course dip down into Foundation to convert these into a String, however I'm looking for a way to do this in "pure" Swift.

Honestly I don’t know if there is one. They may not have added this yet since it’s easily done using the bridging to NSString. If so, this is something they’ll need to fix for the open-source release since Foundation won’t be available.

—Jens

Brent Royal-Gordon

unread,
Jun 16, 2015, 11:58:49 PM6/16/15
to Jens Alfke, chris...@gmail.com, swift-l...@googlegroups.com
There's a struct called UTF8 that encodes and decodes UTF-8 sequences into UnicodeScalars. Using it is a little cumbersome, but very doable:

func decode<UnicodeCodec: UnicodeCodecType>(codeUnits: [UnicodeCodec.CodeUnit], var decoder: UnicodeCodec) -> String? {

    var string = ""

    

    var codeUnitGenerator = codeUnits.generate()

    while true {

        switch decoder.decode(&codeUnitGenerator) {

        case .Result (let scalar):

            string.append(scalar)

        case .EmptyInput:

            return string

        case .Error:

            return nil

        }

    }

}



--
You received this message because you are subscribed to the Google Groups "Swift Language" group.
To unsubscribe from this group and stop receiving emails from it, send an email to swift-languag...@googlegroups.com.
To post to this group, send email to swift-l...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/swift-language/595462CD-AC45-4900-9F86-A05585D03A74%40mooseyard.com.
For more options, visit https://groups.google.com/d/optout.

chris...@gmail.com

unread,
Jun 17, 2015, 8:56:08 AM6/17/15
to swift-l...@googlegroups.com, chris...@gmail.com, je...@mooseyard.com
On Wednesday, June 17, 2015 at 4:58:49 AM UTC+1, Brent Royal-Gordon wrote:
There's a struct called UTF8 that encodes and decodes UTF-8 sequences into UnicodeScalars. Using it is a little cumbersome, but very doable:

func decode<UnicodeCodec: UnicodeCodecType>(codeUnits: [UnicodeCodec.CodeUnit], var decoder: UnicodeCodec) -> String? {

    var string = ""

    

    var codeUnitGenerator = codeUnits.generate()

    while true {

        switch decoder.decode(&codeUnitGenerator) {

        case .Result (let scalar):

            string.append(scalar)

        case .EmptyInput:

            return string

        case .Error:

            return nil

        }

    }

}



Brent, that's fantastically helpful - thank you very much!

Chris

Jens Alfke

unread,
Jun 17, 2015, 1:29:17 PM6/17/15
to Brent Royal-Gordon, chris...@gmail.com, swift-l...@googlegroups.com

On Jun 16, 2015, at 8:58 PM, Brent Royal-Gordon <br...@brentdax.com> wrote:

There's a struct called UTF8 that encodes and decodes UTF-8 sequences into UnicodeScalars. Using it is a little cumbersome, but very doable:

That’s way too cumbersome to be the way we’re expected to convert UTF-8 to a String. I assume that the current preferred way is to use the NSString-bridged method, and that they’ll add a pure-Swift method in time for the open source release.

—Jens

Jens Alfke

unread,
Nov 5, 2015, 2:34:01 PM11/5/15
to Swift Language, chris...@gmail.com, je...@mooseyard.com
I just ran into this issue — I've got raw UTF-8, not nul-terminated, and I need to create a String from it. This is in some code that needs to run as fast as possible, and may eventually be cross-platform, so I want to avoid bridging through NSString. I'm skeptical of the approach below because it looks slow (appending characters one at a time) and complex.

It looks like if I could create a String.UTF8View from a [UInt8], then I could use that to initialize a String. UTF8View doesn't show any initialization methods of its own, but it inherits from protocols like Collection. My Swift-fu is still pretty weak, however, so I'm not sure how to use that to create an instance from an array.

—Jens

Kevin Ballard

unread,
Nov 5, 2015, 5:48:14 PM11/5/15
to swift-l...@googlegroups.com
CollectionType doesn't have any initializers. There's no (public) way to construct a String.UTF8View, short of getting one from an existing String, and there's no way to mutate a String.UTF8View once you have it. In addition, a String.UTF8View is really just a transformation on top of the String's native buffer, which turns out to be UTF-16 (even for native Strings, which I think is awful but probably exists to get the expected O(1) behavior when bridged to NSString), so even if you could construct a String.UTF8View it would just end up doing the same decoding routine.
 
The only way to do better than starting with an empty string and appending to it is to either predict how many unicode scalars the string will actually have, or actually decode the UTF-8 once to count it and then again to create the string. If you want to try and predict how many unicode scalars there are in a UTF-8 sequence of length N, it's anywhere from N/4 to N (as the maximal encoding size for one scalar is 4 UTF-8 code units). If you're ok with over-estimating and expect your input to likely be ASCII, you can pick N. If you're not sure what your input is, N/2 is a reasonable approximation. If you really don't want to overestimate at all (e.g. memory usage is a concern), go with N/4, although once you append any scalars past this point, the string will end up overestimating anyway.
 
As for actually decoding, the quoted code is as good as you're going to get. If it's inlined (and if the UTF8 struct is implemented efficiently), then it should be plenty fast.
 
Also note that this returns nil on error, another common way to handle decoding errors is to replace the bad sequence with U+FFFD instead, so you could alter the function to use `string.append("\u{FFFD}")` upon .Error if you like that idea.
 
Incidentally, there's a global function transcode() that can convert between any two UnicodeCodecs, so you could say something like
 
var s = ""
transcode(UTF8.self, UTF32.self, inputSeq, { c in s.append(UnicodeScalar(c)) }, stopOnError: yesOrNo)
 
although that has to convert from a UInt32 into a UnicodeScalar, which probably does a bounds check, and so it might actually be slower than the explicit decode() function.
 
-Kevin Ballard
--
You received this message because you are subscribed to the Google Groups "Swift Language" group.
To unsubscribe from this group and stop receiving emails from it, send an email to swift-languag...@googlegroups.com.
To post to this group, send email to swift-l...@googlegroups.com.

Jens Alfke

unread,
Nov 5, 2015, 6:17:21 PM11/5/15
to Kevin Ballard, swift-l...@googlegroups.com
Thanks for the info, Kevin. It still seems quite wrong to me that there’s no simple way to do this, given how fundamental it is to read strings from external data. For instance, the Obj-C project I work on, which does a lot of database access and networking, has 39 calls to NSString’s -initWithBytes:length:encoding: and -initWithData:encoding: methods in its source code (all of which specify UTF-8.)

As I said before, I’m guessing that’s because people can use the NSString methods as a crutch. When the cross-platform release of Swift comes out, there had better be a clean way to do this in pure Swift.

(There does exist String.fromCString, but it requires that the UTF-8 data be nul-terminated, which mine isn’t.)

—Jens

Kevin Ballard

unread,
Nov 5, 2015, 6:21:02 PM11/5/15
to swift-l...@googlegroups.com
There's a _lot_ of areas where the Swift stdlib doesn't really provide
what you need and assumes you'll have Foundation to back you up. For
example, there's tons of APIs on String itself that are only present
once you import Foundation, including `init?<S: SequenceType where
S.Generator.Element == UInt8>(bytes: S, encoding: NSStringEncoding)`
that would be the simplest way to do the decoding you want except that
it's not cross-platform.

-Kevin
> --
> You received this message because you are subscribed to the Google Groups
> "Swift Language" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to swift-languag...@googlegroups.com.
> To post to this group, send email to swift-l...@googlegroups.com.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/swift-language/A64F7FA4-F26D-49AF-BAEA-F1FBE53FD37C%40mooseyard.com.

Chris Ridd

unread,
Nov 6, 2015, 2:40:28 AM11/6/15
to Kevin Ballard, swift-l...@googlegroups.com

> On 5 Nov 2015, at 23:20, Kevin Ballard <ke...@sb.org> wrote:
>
> There's a _lot_ of areas where the Swift stdlib doesn't really provide
> what you need and assumes you'll have Foundation to back you up. For
> example, there's tons of APIs on String itself that are only present
> once you import Foundation, including `init?<S: SequenceType where
> S.Generator.Element == UInt8>(bytes: S, encoding: NSStringEncoding)`
> that would be the simplest way to do the decoding you want except that
> it's not cross-platform.

Is it worth raising radars for each dependency on Foundation for “basic” functionality that should be in the stdlib?

Or should we just wait and submit bugs/patches for the open source stdlib?

Chris
Reply all
Reply to author
Forward
0 new messages