bzamayo
(Benjamin Mayo)
1
I currently have code like this:
for character in text {
if condition(character) {
continue
}
let utf8ForCharacter = String(character).utf8
doSomethingWith(utf8ForCharacter)
}
I am familiar-ish with how Unicode works, but by no means an expert, so maybe this question is stupid ... Is there a more efficient way of getting the UTF8 sequence for that particular character?
Wrapping the character into a new String only to immediately request the utf8 out of it somehow feels like I'm treading down the wrong path.
I could take the text.utf8 view and then analyze the stream of bytes in turn, but then I'm reimplementing the Unicode decoder so it seems like that's not the answer. The other thing I considered was enumerating text.indices and then using that to fetch each character, and finding the UTF8 slice using characterIndex.samePosition(in: text.utf8) but that intuitively didn't feel efficient (or nice to write either).
(This code is not causing a performance bottleneck, so there's no pressing problem to solve. I'm merely interested if there was something more idiomatic.)
1 Like
For now, you can use String(myCharacter).utf8, which (currently in Swift 5) will not perform a deep copy if the character is large. Swift 5 has small-form optimizations for Characters of up to 15 UTF-8 code units in length, so in practice you often wouldn't even care about the deep copy.
After Swift 5.0 wraps up, we should add all of these missing views to the next release, such as UTF8View on Character.
4 Likes
Also, you can try String(myCharacter).utf8.withContiguousStorageIfAvailable { utf8 in ... } if you want the lowest level of access to the content. Another thing we should add next release is something like this directly on String/Character/Unicode.Scalar.
jrose
(Jordan Rose)
5
Remember also that a Character is not the same as a code point (Unicode scalar). This is one character, but at least three code points: x̂̌ ('x' + COMBINING CIRCUMFLEX ACCENT + COMBINING CARON). It's possible that iterating over text.unicodeScalars is closer to what you want…
(since you didn't specify what doSomethingWith is going to do)
…except that Unicode.Scalar doesn't have a good way to get the UTF-8 out of it either, right now. There's an accessor for UTF-16 but not UTF-8.
Yeah, rounding this all out is part of my list of things to attack right after 5.0 wraps up. Here's what it should be:
| View |
Read |
Mutate |
| String (Default) |
Bidirectional |
RangeReplaceable |
| String.UnicodeScalarView |
Bidirectional |
RangeReplaceable |
| String.UTF16View |
Bidirectional |
no |
| String.UTF8View |
Bidirectional |
no |
| Substring (Default) |
Bidirectional |
RangeReplaceable |
| Substring.UnicodeScalarView |
Bidirectional |
RangeReplaceable |
| Substring.UTF16View |
Bidirectional |
no |
| Substring.UTF8View |
Bidirectional |
no |
| Character.UnicodeScalarView |
Bidirectional |
no |
| Unicode.Scalar.UTF16View |
RandomAccess |
no |
| (to add) Unicode.Scalar.UTF8View |
RandomAccess |
no |
| (to add) Character.UTF16View |
Bidirectional |
no |
| (to add) Character.UTF8View |
Bidirectional |
no |
(mutation below scalar boundaries introduces the need for encoding validity checking after a batch of operations, which is a whole other area for the future)
5 Likes
bzamayo
(Benjamin Mayo)
7
That would explain a (different) problem I was having. Thanks for pointing this out.
1 Like
I though a a character contains a string.
So it should be fairly straightforward to add the views to character by just accessing them from _str
Scroll down a few lines and you’ll see it already has them. Again, right after 5.0 ;-)
2 Likes
@Michael_Ilseman, does the internal _str: String stored by Character have to be UTF-8 encoded?
The initializer that creates a character from a single-character string uses init(unchecked:) with an _internalInvariant(_str._guts.isFastUTF8).
Will there be an issue when a single-character NSString is lazily bridged as a String without copying?
When you get a Character from a String, it has its own storage, in contrast to Substring which shares storage. I.e. Character is the Element type which is copied, Substring is the SubSequence which shares. This is also big win for performance because the vast majority of Characters will fit in small form, so we don't need to retain the outer String.
There is no NSExtendedGraphemeCluster to lazily-bridge in, and there's likely no benefit in any future foreign Character concept. So, there should be no situations where a Character is not native.
I was thinking of the following situation:
import Foundation
// Create a single-character string (UTF-16).
let ns: NSString = "A"
// Create a character (UTF-16).
let ch = Character(ns as String)
Edit: The example NSString would need to be non-ASCII, to ensure UTF-16 storage internally.
That... would exhibit a bug if non-tagged
. Could you file a JIRA?
1 Like
@Michael_Ilseman, I've assigned SR-9935 to you.
1 Like
I pitched a UTF8View on Character here if you're interested.