Minor String enhancements

I'd like to pitch a couple of hopefully uncontroversial additions to the String APIs (at least I don't think the concepts are too controversial, although I expect and welcome bikeshedding):

  1. Add utf8Offset getter and initializer to String.Index.

    SE-0241 was a late proposal which deprecated encodedOffset and introduced utf16Offset getters and initializers for String.Index. However, UTF-8, String's native encoding, isn't included. This makes it more difficult for low-level string processing working at the UTF-8 level to accept and return String.Index values in their APIs.

  2. Add known normalization flags.

    String often knows quite a bit of useful information regarding its contents, which it does not expose to developers. For instance, it will often check if it is ASCII, or if its scalars are normalised in Normalization Form C (NFC). This can be really important for consumers of the String's UTF8 bytes - ASCII strings are essentially random-access at the code-unit level, and both ASCII and NFC code-units can be trivially compared with memcmp.

    I would like to expose these flags to developers, as a set of "known normalization" flags. When String knows this data (almost always), it can return it immediately; otherwise they will be calculated (IIUC, this basically only applies to bridged strings from Objective-C, and even then they can almost always say if they're ASCII).

    In the future, this could be expanded with utilities to compute the normalization flags, or to verify an expected set of flags, for generic collections of bytes - i.e. checking if it's safe to claim that some bytes are valid UTF8, or that they are ASCII or NFC.

    extension String {
      public struct NormalizationKinds: OptionSet {
        public var isNFC: Self { get }
        public var isASCII: Self { get }
      }
    
      public var knownNormalization: NormalizationKinds { get }
    }
    
    // To use:
    processString(
     someString,
     isASCII: someString.knownNormalization.contains(.isASCII)
    )
    

    Note that isASCII implies isNFC.

    Open question: should these be available on Substring? For non-ASCII/non-NFC strings, the contents in the Substring may still be ASCII/NFC, so it would almost always need to be calculated.

  3. Add withUTF8 to StringProtocol. It was apparently left out to save witness table entries, but it's quite important for low-level text processing. The "proper" alternative is to the the UTF8View and write your algorithms to use withContiguousStorageIfAvailable, but many users instead just downcast their StringProtocol to either String or Substring, trapping if the type isn't one of those 2.

8 Likes

Having .isASCII would be very useful for string parsing and converting to other characters sets as done by swift-corelibs-foundation.

Another useful function to add would be the ability to append an ASCII byte to a string, eg

func append(ascii: UInt8)

as an alternative to using

.append(Character(Unicode.Scalar(value)))

assuming the implementation would be faster.

3 Likes