Efficiently Retrieving UTF8 from a Character in a String

bzamayo · January 28, 2019, 7:27pm

I currently have code like this:

for character in text {
    if condition(character) {
         continue
    } 

    let utf8ForCharacter = String(character).utf8
    doSomethingWith(utf8ForCharacter)
}

I am familiar-ish with how Unicode works, but by no means an expert, so maybe this question is stupid ... Is there a more efficient way of getting the UTF8 sequence for that particular character?

Wrapping the character into a new String only to immediately request the utf8 out of it somehow feels like I'm treading down the wrong path.

I could take the text.utf8 view and then analyze the stream of bytes in turn, but then I'm reimplementing the Unicode decoder so it seems like that's not the answer. The other thing I considered was enumerating text.indices and then using that to fetch each character, and finding the UTF8 slice using characterIndex.samePosition(in: text.utf8) but that intuitively didn't feel efficient (or nice to write either).

(This code is not causing a performance bottleneck, so there's no pressing problem to solve. I'm merely interested if there was something more idiomatic.)

Michael_Ilseman · January 28, 2019, 7:56pm

For now, you can use String(myCharacter).utf8, which (currently in Swift 5) will not perform a deep copy if the character is large. Swift 5 has small-form optimizations for Characters of up to 15 UTF-8 code units in length, so in practice you often wouldn't even care about the deep copy.

After Swift 5.0 wraps up, we should add all of these missing views to the next release, such as UTF8View on Character.

Michael_Ilseman · January 28, 2019, 7:59pm

Also, you can try String(myCharacter).utf8.withContiguousStorageIfAvailable { utf8 in ... } if you want the lowest level of access to the content. Another thing we should add next release is something like this directly on String/Character/Unicode.Scalar.

bzamayo · January 28, 2019, 8:00pm

Awesome. Thanks.

jrose · January 28, 2019, 10:57pm

Remember also that a Character is not the same as a code point (Unicode scalar). This is one character, but at least three code points: x̂̌ ('x' + COMBINING CIRCUMFLEX ACCENT + COMBINING CARON). It's possible that iterating over text.unicodeScalars is closer to what you want…

(since you didn't specify what doSomethingWith is going to do)

…except that Unicode.Scalar doesn't have a good way to get the UTF-8 out of it either, right now. There's an accessor for UTF-16 but not UTF-8.

Michael_Ilseman · January 29, 2019, 12:14am

Yeah, rounding this all out is part of my list of things to attack right after 5.0 wraps up. Here's what it should be:

View	Read	Mutate
String (Default)	Bidirectional	RangeReplaceable
String.UnicodeScalarView	Bidirectional	RangeReplaceable
String.UTF16View	Bidirectional	no
String.UTF8View	Bidirectional	no
Substring (Default)	Bidirectional	RangeReplaceable
Substring.UnicodeScalarView	Bidirectional	RangeReplaceable
Substring.UTF16View	Bidirectional	no
Substring.UTF8View	Bidirectional	no
Character.UnicodeScalarView	Bidirectional	no
Unicode.Scalar.UTF16View	RandomAccess	no
(to add) Unicode.Scalar.UTF8View	RandomAccess	no
(to add) Character.UTF16View	Bidirectional	no
(to add) Character.UTF8View	Bidirectional	no

(mutation below scalar boundaries introduces the need for encoding validity checking after a batch of operations, which is a whole other area for the future)

bzamayo · January 29, 2019, 1:32am

That would explain a (different) problem I was having. Thanks for pointing this out.

ASwiftUser · January 29, 2019, 2:56am

I though a a character contains a string.

github.com

apple/swift/blob/78dda7b6717333ec8e29201c8b637bb4a9e767ba/stdlib/public/core/Character.swift#L67


      
          /// the [Unicode.org glossary][glossary]. In particular, this discussion
          /// mentions [extended grapheme clusters][clusters] and [Unicode scalar
          /// values][scalars].
          ///
          /// [glossary]: http://www.unicode.org/glossary/
          /// [clusters]: http://www.unicode.org/glossary/#extended_grapheme_cluster
          /// [scalars]: http://www.unicode.org/glossary/#unicode_scalar_value
          @_fixed_layout
          public struct Character {
            @usableFromInline
            internal var _str: String
          
          
  @inlinable @inline(__always)
            internal init(unchecked str: String) {
              self._str = str
              _invariantCheck()
            }
          }
          
          
extension Character {
            #if !INTERNAL_CHECKS_ENABLED

So it should be fairly straightforward to add the views to character by just accessing them from _str

Michael_Ilseman · January 29, 2019, 3:15am

Scroll down a few lines and you’ll see it already has them. Again, right after 5.0 ;-)

ASwiftUser · January 29, 2019, 3:36pm

Oops

benrimmington · February 15, 2019, 8:12pm

@Michael_Ilseman, does the internal _str: String stored by Character have to be UTF-8 encoded?

The initializer that creates a character from a single-character string uses init(unchecked:) with an _internalInvariant(_str._guts.isFastUTF8).

Will there be an issue when a single-character NSString is lazily bridged as a String without copying?

Michael_Ilseman · February 15, 2019, 8:18pm

When you get a Character from a String, it has its own storage, in contrast to Substring which shares storage. I.e. Character is the Element type which is copied, Substring is the SubSequence which shares. This is also big win for performance because the vast majority of Characters will fit in small form, so we don't need to retain the outer String.

There is no NSExtendedGraphemeCluster to lazily-bridge in, and there's likely no benefit in any future foreign Character concept. So, there should be no situations where a Character is not native.

benrimmington · February 15, 2019, 8:33pm

I was thinking of the following situation:

import Foundation

// Create a single-character string (UTF-16).
let ns: NSString = "A"

// Create a character (UTF-16).
let ch = Character(ns as String)

Edit: The example NSString would need to be non-ASCII, to ensure UTF-16 storage internally.

Michael_Ilseman · February 15, 2019, 8:39pm

That... would exhibit a bug if non-tagged . Could you file a JIRA?

benrimmington · February 15, 2019, 9:05pm

@Michael_Ilseman, I've assigned SR-9935 to you.

Michael_Ilseman · February 28, 2019, 12:40am

I pitched a UTF8View on Character here if you're interested.