for character in text {
if condition(character) {
continue
}
let utf8ForCharacter = String(character).utf8
doSomethingWith(utf8ForCharacter)
}
I am familiar-ish with how Unicode works, but by no means an expert, so maybe this question is stupid ... Is there a more efficient way of getting the UTF8 sequence for that particular character?
Wrapping the character into a new String only to immediately request the utf8 out of it somehow feels like I'm treading down the wrong path.
I could take the text.utf8 view and then analyze the stream of bytes in turn, but then I'm reimplementing the Unicode decoder so it seems like that's not the answer. The other thing I considered was enumerating text.indices and then using that to fetch each character, and finding the UTF8 slice using characterIndex.samePosition(in: text.utf8) but that intuitively didn't feel efficient (or nice to write either).
(This code is not causing a performance bottleneck, so there's no pressing problem to solve. I'm merely interested if there was something more idiomatic.)
For now, you can use String(myCharacter).utf8, which (currently in Swift 5) will not perform a deep copy if the character is large. Swift 5 has small-form optimizations for Characters of up to 15 UTF-8 code units in length, so in practice you often wouldn't even care about the deep copy.
After Swift 5.0 wraps up, we should add all of these missing views to the next release, such as UTF8View on Character.
Also, you can try String(myCharacter).utf8.withContiguousStorageIfAvailable { utf8 in ... } if you want the lowest level of access to the content. Another thing we should add next release is something like this directly on String/Character/Unicode.Scalar.
Remember also that a Character is not the same as a code point (Unicode scalar). This is one character, but at least three code points: x̂̌ ('x' + COMBINING CIRCUMFLEX ACCENT + COMBINING CARON). It's possible that iterating over text.unicodeScalars is closer to what you want…
(since you didn't specify what doSomethingWith is going to do)
…except that Unicode.Scalar doesn't have a good way to get the UTF-8 out of it either, right now. There's an accessor for UTF-16 but not UTF-8.
Yeah, rounding this all out is part of my list of things to attack right after 5.0 wraps up. Here's what it should be:
View
Read
Mutate
String (Default)
Bidirectional
RangeReplaceable
String.UnicodeScalarView
Bidirectional
RangeReplaceable
String.UTF16View
Bidirectional
no
String.UTF8View
Bidirectional
no
Substring (Default)
Bidirectional
RangeReplaceable
Substring.UnicodeScalarView
Bidirectional
RangeReplaceable
Substring.UTF16View
Bidirectional
no
Substring.UTF8View
Bidirectional
no
Character.UnicodeScalarView
Bidirectional
no
Unicode.Scalar.UTF16View
RandomAccess
no
(to add) Unicode.Scalar.UTF8View
RandomAccess
no
(to add) Character.UTF16View
Bidirectional
no
(to add) Character.UTF8View
Bidirectional
no
(mutation below scalar boundaries introduces the need for encoding validity checking after a batch of operations, which is a whole other area for the future)
When you get a Character from a String, it has its own storage, in contrast to Substring which shares storage. I.e. Character is the Element type which is copied, Substring is the SubSequence which shares. This is also big win for performance because the vast majority of Characters will fit in small form, so we don't need to retain the outer String.
There is no NSExtendedGraphemeCluster to lazily-bridge in, and there's likely no benefit in any future foreign Character concept. So, there should be no situations where a Character is not native.
import Foundation
// Create a single-character string (UTF-16).
let ns: NSString = "A"
// Create a character (UTF-16).
let ch = Character(ns as String)
Edit: The example NSString would need to be non-ASCII, to ensure UTF-16 storage internally.