Processing Strings as Arrays of Unichars

110 views
Skip to first unread message

Tom Wetmore

unread,
Oct 9, 2015, 11:51:50 AM10/9/15
to Swift Language
I write a lot of lexing and parsing code. As much as I have hopped onto the Swift bandwagon, I still cringe whenever I have to do any type of even minorly complex string processing with Swift. I have experimented with a number of different approaches, including NSScanners (using Nate Cook’s great Swift extension to NSScanner) and iterating through String values directly, but every approach I’ve tried ends up making me feel constipated.

In a recent project involving natural language processing of English, being porting from a system I wrote years ago in Obj-C, I was faced with the need to port a reasonably complicated, hand-crafted lexer that converts raw strings into sequences of properly tokenized and properly recognized English sentences, not a trivial task.

Attempting to follow the existing Obj-C approach as closely as possible, I decided to try to deal with UTF16 values, which is essentially the Obj-C and Java approach. So, the first thing I did with the input String was to convert it to an array of UTF16/unichar/UInt16 values as follows:

let chars: [unichar] = Array(string.utf16) // where string is a String value.

That was pretty easy. Now that array chars is a nice, neat and clean array of Unicode characters (ALL Unicode characters I will ever process with this app are encodable as single UTF16 values, so this representation is perfect for me). Having characters in this form makes it very simple to write hand-crafted lexers and parsers.

My current port lexes through this array of unichars, creating tokens whose text values I then convert back into Strings. Which brings me to me direct question. The way I do that conversion is with the following (to me ugly and obscure and esoteric) function:

func utf16ArrayToString (chars: [unichar]) -> String {
   
var string = ""
   
for char in chars { string.append(Character(UnicodeScalar(char))) }
   
return string
}

Have any of you found any easier way to convert a unichar value or arrays of unichar values to Swift Strings? I would think that this would be a very obvious API? Maybe I have missed something.

I also find it very aggravating that I don’t have any good way to compare a literally represented Unicode character to a specific UTF16 value using the good ‘ole single quoted character approach of C and Obj-C and most other languages. I now create unichar constants for key characters, e.g.,

let HyphenCharacter: unichar = 45 // '-'


Would it be difficult to add to Swift the single quote delimiter to mean the UTF16 or UTF32 value of a character, as in ‘3’ or ‘V’ or ‘\n’ and so on?

Tom Wetmore

Justin Kolb

unread,
Oct 9, 2015, 7:23:57 PM10/9/15
to Swift Language
I would guess this is by design to start to move away from UTF16. Using the characters or unicodeScalars properties allows you to append Character or UnicodeScalar directly to a mutable String which makes them seem more supported. Having to convert UTF16 to UnicodeScalar doesn't seem like that much of a burden. Most likely the code to convert between the various encodings is centralized there.

http://programmers.stackexchange.com/questions/102205/should-utf-16-be-considered-harmful

Brent Royal-Gordon

unread,
Oct 10, 2015, 5:49:23 AM10/10/15
to Tom Wetmore, Swift Language
> func utf16ArrayToString (chars: [unichar]) -> String {
> var string = ""
> for char in chars { string.append(Character(UnicodeScalar(char))) }
> return string
> }
>
> Have any of you found any easier way to convert a unichar value or arrays of unichar values to Swift Strings?

The simplest way I’ve found that the type checker seems to guarantee will work is to (lazily) convert the array to UnicodeScalars, construct a UnicodeScalarView, and then build a string from it:

func utf16ArrayToString (chars: [unichar]) -> String {
let scalars = chars.lazy.map(UnicodeScalar.init)
let scalarView = String.UnicodeScalarView() + scalars
return String(scalarView)
}

// one-liner style
func utf16ArrayToString (chars: [unichar]) -> String {
return String(String.UnicodeScalarView() + chars.lazy.map(UnicodeScalar.init))
}

> I also find it very aggravating that I don’t have any good way to compare a literally represented Unicode character to a specific UTF16 value using the good ‘ole single quoted character approach of C and Obj-C and most other languages. I now create unichar constants for key characters, e.g.,
>
> let HyphenCharacter: unichar = 45 // '-'
>
>
> Would it be difficult to add to Swift the single quote delimiter to mean the UTF16 or UTF32 value of a character, as in ‘3’ or ‘V’ or ‘\n’ and so on?

You can actually add something like this to Swift yourself. unichar is just an alias for UInt16, so you can make UInt16 support character literals by making it conform to the UnicodeScalarLiteralConvertible protocol:

extension UInt16: UnicodeScalarLiteralConvertible {
public init(unicodeScalarLiteral value: UnicodeScalar) {
self.init(value.value)
}
}

Now anywhere you were previously able to write a UInt16/unichar, you can instead write a single-character string literal.

utf16ArrayToString(["a", "b", "c", "d", "e"])

--
Brent Royal-Gordon
Architechies

Tom Wetmore

unread,
Oct 10, 2015, 12:19:15 PM10/10/15
to Swift Language
Justin, Brent,

Thanks for the responses and suggestions.

Tom Wetmore
Reply all
Reply to author
Forward
0 new messages