Efficiently Retrieving UTF8 from a Character in a String

I currently have code like this:

for character in text {
    if condition(character) {
         continue
    } 

    let utf8ForCharacter = String(character).utf8
    doSomethingWith(utf8ForCharacter)
}

I am familiar-ish with how Unicode works, but by no means an expert, so maybe this question is stupid ... Is there a more efficient way of getting the UTF8 sequence for that particular character?

Wrapping the character into a new String only to immediately request the utf8 out of it somehow feels like I'm treading down the wrong path.

I could take the text.utf8 view and then analyze the stream of bytes in turn, but then I'm reimplementing the Unicode decoder so it seems like that's not the answer. The other thing I considered was enumerating text.indices and then using that to fetch each character, and finding the UTF8 slice using characterIndex.samePosition(in: text.utf8) but that intuitively didn't feel efficient (or nice to write either).

(This code is not causing a performance bottleneck, so there's no pressing problem to solve. I'm merely interested if there was something more idiomatic.)

1 Like

For now, you can use String(myCharacter).utf8, which (currently in Swift 5) will not perform a deep copy if the character is large. Swift 5 has small-form optimizations for Characters of up to 15 UTF-8 code units in length, so in practice you often wouldn't even care about the deep copy.

After Swift 5.0 wraps up, we should add all of these missing views to the next release, such as UTF8View on Character.

4 Likes

Also, you can try String(myCharacter).utf8.withContiguousStorageIfAvailable { utf8 in ... } if you want the lowest level of access to the content. Another thing we should add next release is something like this directly on String/Character/Unicode.Scalar.

Awesome. Thanks.

Remember also that a Character is not the same as a code point (Unicode scalar). This is one character, but at least three code points: x̂̌ ('x' + COMBINING CIRCUMFLEX ACCENT + COMBINING CARON). It's possible that iterating over text.unicodeScalars is closer to what you want…

(since you didn't specify what doSomethingWith is going to do)

…except that Unicode.Scalar doesn't have a good way to get the UTF-8 out of it either, right now. There's an accessor for UTF-16 but not UTF-8.

Yeah, rounding this all out is part of my list of things to attack right after 5.0 wraps up. Here's what it should be:

View Read Mutate
String (Default) Bidirectional RangeReplaceable
String.UnicodeScalarView Bidirectional RangeReplaceable
String.UTF16View Bidirectional no
String.UTF8View Bidirectional no
Substring (Default) Bidirectional RangeReplaceable
Substring.UnicodeScalarView Bidirectional RangeReplaceable
Substring.UTF16View Bidirectional no
Substring.UTF8View Bidirectional no
Character.UnicodeScalarView Bidirectional no
Unicode.Scalar.UTF16View RandomAccess no
(to add) Unicode.Scalar.UTF8View RandomAccess no
(to add) Character.UTF16View Bidirectional no
(to add) Character.UTF8View Bidirectional no

(mutation below scalar boundaries introduces the need for encoding validity checking after a batch of operations, which is a whole other area for the future)

5 Likes

That would explain a (different) problem I was having. Thanks for pointing this out.

1 Like

I though a a character contains a string.

So it should be fairly straightforward to add the views to character by just accessing them from _str

Scroll down a few lines and you’ll see it already has them. Again, right after 5.0 ;-)

2 Likes

Oops :stuck_out_tongue:

@Michael_Ilseman, does the internal _str: String stored by Character have to be UTF-8 encoded?

The initializer that creates a character from a single-character string uses init(unchecked:) with an _internalInvariant(_str._guts.isFastUTF8).

Will there be an issue when a single-character NSString is lazily bridged as a String without copying?

When you get a Character from a String, it has its own storage, in contrast to Substring which shares storage. I.e. Character is the Element type which is copied, Substring is the SubSequence which shares. This is also big win for performance because the vast majority of Characters will fit in small form, so we don't need to retain the outer String.

There is no NSExtendedGraphemeCluster to lazily-bridge in, and there's likely no benefit in any future foreign Character concept. So, there should be no situations where a Character is not native.

I was thinking of the following situation:

import Foundation

// Create a single-character string (UTF-16).
let ns: NSString = "A"

// Create a character (UTF-16).
let ch = Character(ns as String)

Edit: The example NSString would need to be non-ASCII, to ensure UTF-16 storage internally.

That... would exhibit a bug if non-tagged :sweat_smile:. Could you file a JIRA?

1 Like

@Michael_Ilseman, I've assigned SR-9935 to you.

1 Like

I pitched a UTF8View on Character here if you're interested.