Pitch: String Gaps and Missing APIs

Michael_Ilseman · February 27, 2019, 11:08pm

edit: First revision removes isLeadByte and isContinuationByte, as explained below. Second revision removes attributes from "Detailed Description", as the information is already covered in "Effect on API resilience", and cleaned up the appearance.

String Gaps and Missing APIs

Proposal: SE-NNNN
Authors: Michael Ilseman
Review Manager: TBD
Status: Awaiting review
Implementation: apple/swift#22869
Bugs: SR-9955

Introduction

String and related types are missing trivial and obvious functionality, much of which currently exists internally but has not been made API. We propose adding 9 new methods/properties and 3 new code unit views.

Swift-evolution thread: TBD

Motivation

These missing APIs address commonly encountered gaps and missing functionality for users of String and its various types, often leading developers to reinvent the same trivial definitions.

Proposed solution

We propose:

6 simple APIs on Unicode’s various encodings
2 generic initializers for string indices and ranges of indices
Substring.base, equivalent to Slice.base
Make Character.UTF8View and Character.UTF16View public
Add Unicode.Scalar.UTF8View

Detailed design

1. Unicode obvious/trivial additions

This functionality existed internally as helpers and is generally useful (even if they’re simple) for anyone working with Unicode.


extension Unicode.ASCII {
  /// Returns whether the given code unit represents an ASCII scalar
  public static func isASCII(_ x: CodeUnit) -> Bool
}

extension Unicode.UTF8 {
  /// Returns the number of code units required to encode the given Unicode
  /// scalar.
  ///
  /// Because a Unicode scalar value can require up to 21 bits to store its
  /// value, some Unicode scalars are represented in UTF-8 by a sequence of up
  /// to 4 code units. The first code unit is designated a *lead* byte and the
  /// rest are *continuation* bytes.
  ///
  ///     let anA: Unicode.Scalar = "A"
  ///     print(anA.value)
  ///     // Prints "65"
  ///     print(UTF8.width(anA))
  ///     // Prints "1"
  ///
  ///     let anApple: Unicode.Scalar = "🍎"
  ///     print(anApple.value)
  ///     // Prints "127822"
  ///     print(UTF16.width(anApple))
  ///     // Prints "4"
  ///
  /// - Parameter x: A Unicode scalar value.
  /// - Returns: The width of `x` when encoded in UTF-8, from `1` to `4`.
  public static func width(_ x: Unicode.Scalar) -> Int

  /// Returns whether the given code unit represents an ASCII scalar
  public static func isASCII(_ x: CodeUnit) -> Bool
}

extension Unicode.UTF16 {
  /// Returns a Boolean value indicating whether the specified code unit is a
  /// high or low surrogate code unit.
  public static func isSurrogate(_ x: CodeUnit) -> Bool

  /// Returns whether the given code unit represents an ASCII scalar
  public static func isASCII(_ x: CodeUnit) -> Bool
}

extension Unicode.UTF32 {
  /// Returns whether the given code unit represents an ASCII scalar
  public static func isASCII(_ x: CodeUnit) -> Bool
}

2. Generic initializers for String.Index and Range

Concrete versions of this exist parameterized over String, but versions generic over StringProtocol are missing.

extension String.Index {
  /// Creates an index in the given string that corresponds exactly to the
  /// specified position.
  ///
  /// If the index passed as `sourcePosition` represents the start of an
  /// extended grapheme cluster---the element type of a string---then the
  /// initializer succeeds.
  ///
  /// The following example converts the position of the Unicode scalar `"e"`
  /// into its corresponding position in the string. The character at that
  /// position is the composed `"é"` character.
  ///
  ///     let cafe = "Cafe\u{0301}"
  ///     print(cafe)
  ///     // Prints "Café"
  ///
  ///     let scalarsIndex = cafe.unicodeScalars.firstIndex(of: "e")!
  ///     let stringIndex = String.Index(scalarsIndex, within: cafe)!
  ///
  ///     print(cafe[...stringIndex])
  ///     // Prints "Café"
  ///
  /// If the index passed as `sourcePosition` doesn't have an exact
  /// corresponding position in `target`, the result of the initializer is
  /// `nil`. For example, an attempt to convert the position of the combining
  /// acute accent (`"\u{0301}"`) fails. Combining Unicode scalars do not have
  /// their own position in a string.
  ///
  ///     let nextScalarsIndex = cafe.unicodeScalars.index(after: scalarsIndex)
  ///     let nextStringIndex = String.Index(nextScalarsIndex, within: cafe)
  ///
  ///     print(nextStringIndex)
  ///     // Prints "nil"
  ///
  /// - Parameters:
  ///   - sourcePosition: A position in a view of the `target` parameter.
  ///     `sourcePosition` must be a valid index of at least one of the views
  ///     of `target`.
  ///   - target: The string referenced by the resulting index.
  public init?<S: StringProtocol>(
    _ sourcePosition: String.Index, within target: S
  )
}

extension Range where Bound == String.Index {
    public init?<S: StringProtocol>(_ range: NSRange, in string: __shared S)
}

3. Substring provides access to its base

Slice, the default SubSequence type, provides base for accessing the original Collection. Substring, String’s SubSequence, should as well.

extension Substring {
  /// Returns the underlying string from which this Substring was derived.
  public var base: String { get }
}

4. Add in missing views on Character

Character’s UTF8View and UTF16View has existed internally, but we should make it public.


extension Character {
  /// A view of a character's contents as a collection of UTF-8 code units. See
  /// String.UTF8View for more information
  public typealias UTF8View = String.UTF8View

  /// A UTF-8 encoding of `self`.
  public var utf8: UTF8View { get }

  /// A view of a character's contents as a collection of UTF-16 code units. See
  /// String.UTF16View for more information
  public typealias UTF16View = String.UTF16View

  /// A UTF-16 encoding of `self`.
  public var utf16: UTF16View { get }
}

5. Add in a RandomAccessCollection UTF8View on Unicode.Scalar

Unicode.Scalar has a UTF16View with is a RandomAccessCollection, but not a UTF8View.

extension Unicode.Scalar {
  public struct UTF8View {
    internal init(value: Unicode.Scalar)
    internal var value: Unicode.Scalar
  }

  public var utf8: UTF8View { get }
}

extension Unicode.Scalar.UTF8View : RandomAccessCollection {
  public typealias Indices = Range<Int>

  /// The position of the first code unit.
  public var startIndex: Int { get }

  /// The "past the end" position---that is, the position one
  /// greater than the last valid subscript argument.
  ///
  /// If the collection is empty, `endIndex` is equal to `startIndex`.
  public var endIndex: Int { get }

  /// Accesses the code unit at the specified position.
  ///
  /// - Parameter position: The position of the element to access. `position`
  ///   must be a valid index of the collection that is not equal to the
  ///   `endIndex` property.
  public subscript(position: Int) -> UTF8.CodeUnit
}

Source compatibility

All changes are additive.

Effect on ABI stability

All changes are additive. ABI-relevant attributes are provided in “Detailed design”.

Effect on API resilience

Unicode encoding additions and Substring.base are trivial and can never change in definition, so their implementations are exposed.
String.Index initializers are resilient and versioned.
Character’s views already exist as inlinable in 5.0, we just replace internal with public
Unicode.Scalar.UTF8View's implementation is fully exposed (for performance), but is versioned

Alternatives considered

Do Nothing

Various flavors of “do nothing” include stating a given API is not useful or waiting for a rethink of some core concept. Each of these API gaps frequently come up on the forums, bug reports, or seeing developer usage in the wild. Rethinks are unlikely to happen anytime soon. We believe these gaps should be closed immediately.

Do More

This proposal is meant to round out holes and provide some simple additions, keeping the scope narrow for Swift 5.1. We could certainly do more in all of these areas, but that would require a more design iteration and could be dependent on other missing functionality.

allevato · February 27, 2019, 11:29pm

These all look like welcome additions to me!

I don't want to risk distracting from these proposed changes to ask a "what about this other thing?" question (let me know if I should start a separate thread on this!), but do you have any thoughts on APIs to create Strings from external byte buffers without a copy (e.g., by taking ownership of an existing byte buffer, or borrowing an unowned pointer, though the latter would produce a pretty unsafe value)?

Michael_Ilseman · February 27, 2019, 11:34pm

Yes! That's the shared string concept which we were able to fit into the 5.0 ABI but isn't currently exposed. I'm very interested in exposing that, definitely as a separate pitch. The 5.1 schedule is pretty tight and that's a new concept that didn't previously exist (i.e. there might be some debate/churn).

There's other pressing changes I want to get in 5.1 and I don't think I'd have time to primarily drive / champion it. Are you interested in helping to drive this?

Nevin · February 27, 2019, 11:35pm

I am curious why the boolean tests (isASCII, etc.) are defined as static functions, rather than instance properties.

I would expect to be able to write if x.isASCII { … }, rather than jumping through hoops with static functions.

Michael_Ilseman · February 27, 2019, 11:45pm

The issue is that many of these queries are specific to the encoding. Even isASCII can be specific to an encoding, it just so happens that all of the stdlib's encodings are idempotent for scalars <= 0x7F. But you could imagine a InvertedBitsUTF32, or EBCDIC, or some such encoding in which ASCII-ness requires some interpretation.

We could also add isASCII to all (unsigned?) integers under the interpretation that we're talking about the integer as a Unicode scalar value. While I find that appealing, it's unprecedented to stick Stringy/Unicody queries into integer namespaces and completion lists.

Avi · February 28, 2019, 2:58am

Nitpick: shouldn't that be UTF8.width(anApple)?

Nevin · February 28, 2019, 3:33am

Redacted

Right, you’d have, eg:

extension Unicode.UTF8.CodeUnit {
  var isASCII: Bool { … }
}

Instead of:

extension Unicode.UTF8 {
  func isASCII(_ x: CodeUnit) -> Bool { … }
}

So at the point of use, with x an instance of Unicode.UTF8.CodeUnit, it would be:

if x.isASCII { … }

Instead of:

if Unicode.UTF8.CodeUnit.isASCII(x) { … }

Edit:

And now, having written all that out, I see that UTF8.CodeUnit is a typealias for UInt8. So nevermind, I understand now. We don’t want UInt8 to have an isASCII method.

woolsweater · February 28, 2019, 4:15am

Michael_Ilseman:

///     let apple = "🍎"
///     for unit in apple.utf8 {
///         print(UTF8.isLeadByte(unit))
///     }
///     // Prints "false"
///     // Prints "true"
///     // Prints "true"
///     // Prints "true"

Typo on the third line of this example; should be UTF8.isContinuationByte(unit)

bobergj · February 28, 2019, 4:16am

Wouldn't a caller of isLeadByte usually be interested in how many continuation bytes it indicates?

Alternative to isLeadByte, isContinuationByte functions:

extension Unicode.UTF8 {
   public enum ByteType {
       case leading(continuationByteCount: Int)
       case continuation
   }
   
   public static func byteType(_ x: CodeUnit) -> ByteType { ... }
}

let apple = "🍎"
for unit in apple.utf8 {
    print(UTF8.byteType(unit))
}

// Prints "leading(continuationByteCount: 3)"
// Prints "continuation"
// Prints "continuation"
// Prints "continuation"

The ByteType enum could also have isLeading, isContinuation properties for convenience..

duan · February 28, 2019, 4:25am

These looks great!

Would be great if we can get some guarantee for UTF8View random access being O(1). But I understand NSString is a thing. Alas.

Michael_Ilseman · February 28, 2019, 6:08pm

Whoops, good catch

Definitely, and this is something we use internally in the implementation of String. An enum is one approach, or we could have an overload for width for UTF16 and UTF8 that takes a leading code unit and tells you how long the rest of the scalar is.

Yeah, but this might get better over time and my next pitch is "Contiguous Strings" which can help give you some more assurances.

bobergj · March 1, 2019, 1:13am

Ok. One doubt i have whether it is a good idea at all to expose (and thus encourage) one-scalar at a time UTF8 decoding functions. Likely the standard library implementation will move to SIMD-accelerated utf8 validation and decoding now that the Swift SIMD API is getting standardised?

benrimmington · March 1, 2019, 5:14pm

@Michael_Ilseman Unicode.UTF8.isContinuationByte(_:) already exists with a different name (and without @_alwaysEmitIntoClient).

github.com

apple/swift/blob/003850fb0d886493305f7f7a0b23dbadc2bbfd07/stdlib/public/core/Unicode.swift#L280-L300


      
          /// Returns a Boolean value indicating whether the specified code unit is a
          /// UTF-8 continuation byte.
          ///
          /// Continuation bytes take the form `0b10xxxxxx`. For example, a lowercase
          /// "e" with an acute accent above it (`"é"`) uses 2 bytes for its UTF-8
          /// representation: `0b11000011` (195) and `0b10101001` (169). The second
          /// byte is a continuation byte.
          ///
          ///     let eAcute = "é"
          ///     for codeUnit in eAcute.utf8 {
          ///         print(codeUnit, UTF8.isContinuation(codeUnit))
          ///     }
          ///     // Prints "195 false"
          ///     // Prints "169 true"
          ///
          /// - Parameter byte: A UTF-8 code unit.
          /// - Returns: `true` if `byte` is a continuation byte; otherwise, `false`.
          @inlinable
          public static func isContinuation(_ byte: CodeUnit) -> Bool {
            return byte & 0b11_00__0000 == 0b10_00__0000
          }

Michael_Ilseman · March 1, 2019, 11:55pm

We could do something like this, but I don't think it's worth exposing at this point:

extension UTF16 {
  enum CodeUnitClassification {
    case scalar(Unicode.Scalar)
    case leadingSurrogate(payload: UInt16)
    case trailingSurrogate(payload: UInt16)
  }
}
extension UTF8 {
  enum CodeUnitClassification {
    case ascii(Unicode.Scalar)
    case leadingByte(payload: UInt8, width: Int)
    case continuationByte(payload: UInt8)
    case invalid
  }
}

But, this isn't really how one would want to use the result for decoding or analysis, though everything else could be built on top of it (assuming it all gets optimized to something reasonable).

We don't have enough from SIMD yet. We want to pack a 4-element 4-bit lookup table into a 16 byte register for scalar width, but IIUC we don't have access to that. We'd also want to figure out the aligned load model and what the behavior is for dangling (but unread) bytes. @scanon knows more details.

Michael_Ilseman · March 1, 2019, 11:57pm

Hah, I should of scrolled down further. I think we should still add isASCII. For determining if a byte is a leading byte of a multi-code-unit sequence, it can be done with !isASCII && !isContinuation. I don't know if we'd still want a isLeadByte and if so, what we should name it. isLead makes it feel like it's made out of Pb, isLeading is much better but at odds with UTF16.isLeadSurrogate, and as @bobergj pointed out, it might not be worth adding at this point.

dhoepfl · March 6, 2019, 8:10am

First of all: Unicode.ASCII is currently undocumented. It should be documented and it should be documented to mean ISO/IEC 646:1991. Otherwise, ASCII could mean ASA X3.4-1965 where \u{00AC} was part of ASCII (at Code 124). This would break isASCII’s documented and real behaviour of being a short form of (x <= CodeUnit(127)).

That being said: I really like @Michael_Ilseman’s enums but would rename ascii to scalar (better: scalar in both UTF8/UTF16 to selfcontained/singlecodeunit/whatever). Because that’s what it is about: Is the given code unit selfcontained or not.

My thought behind it: Why bless ASCII over e.g. ISO 8859? Unicode’s first 127 code points are ASCII. That makes it somewhat special. But maybe there is a way to support all encodings equally? What about having func availableIn(_ encoding: String.Encoding) -> Bool on Unicode.Scalar? Maybe even better: Add func encodesLosslessly(_ scalar: Unicode.Scalar) -> Bool to String.Encoding.

Michael_Ilseman · March 6, 2019, 11:42pm

Hmm, could you help me understand this situation? I'm not familiar with all of the history here.

The ASCII that is supported in Unicode.ASCII is the ASCII which is a subset of Unicode. From the Unicode standard:

Unicode follows ISO/IEC 8859-1 in the layout of Latin letters up to U+00FF. ISO/IEC 8859-1, in turn, is based on older standards—among others, ASCII (ANSI X3.4), which is identical to ISO/IEC 646:1991-IRV.

This reads to me that specifying any particular standard would be redundant with saying Unicode.

Those enums are deferred as future work (and yeah, I just chose the name ascii quickly as a somewhat more specific variant of scalar).

Looking at code in the wild, I see checking ASCII to be very frequent and checking ISO 8859 to be relatively rare. ASCII is also special in that it is trivially-encoded in all Unicode encodings provided by the standard library, is normalization-invariant, etc.

I do like your idea of availableIn on Unicode.Scalar, but what does "available" mean exactly? Do you mean that an encoding which is a subset of Unicode (such as ASCII) would make the check, and that all other encodings always return true? Or is the idea that the scalar is representable by a single code unit? And if so, would it be further confined to being trivially-encoded?

For encodesLosslessly what is lost?

bzamayo · March 8, 2019, 4:58pm

Dropping by to +1 the view additions to Character, of which I have missed their absence previously.

The static functions being added to the Unicode codecs makes sense to me too.

dhoepfl · March 12, 2019, 11:49am

Currently, Unicode.ASCII is undocumented. (Aside from being in the namespace of Unicode)

Saying it follows the “Unicode standard” is, too, not enough: If Swift supports Unicode 11, "\u{1e91f}\u{1e94b}".count will return 2 but starting with Unicode 12, it will be 1. (I may be wrong here but you get the idea: The Unicode version supported might change the result of an API call. Thus the supported version of the standard needs to be mentioned (fixed) in the documentation. Changing the supported version of the standard is a potentially API breaking change.)

After having tried to explain it, I must admit that are a very good question.

What I wanted to have are functions that answer the following two questions:

Can I use the scalar in the given target encoding: available(in:)
Can I convert a string containing the given scalar into the target encoding and back, getting the back the input: encodesLosslessly

Examples:

let tm = Unicode.Scalar(0x2122)

tm.available(in: String.Encoding.macOSRoman) // false
String.Encoding.macOSRoman.encodesLosslessly(tm) // true

let oe = Unicode.Scalar(0x00f6)
oe.available(in: String.Encoding.isoLatin1) // true
oe.available(in: String.Encoding.ascii) // false
oe.available(in: String.Encoding.nonLossyASCII) // false

String.Encoding.isoLatin1.encodesLosslessly(oe) // true
String.Encoding.ascii.encodesLosslessly(oe) // false
String.Encoding.nonLossyASCII.encodesLosslessly(oe) // true

All of this works on Scalar, not on CodeUnits. To see if a CodeUnit is a Scalar, the CodeUnitClassification enums would help.

Michael_Ilseman · March 13, 2019, 12:30am

You should not bake in knowledge of grapheme breaking statically in your code, as it is a run-time concept.

This is actually why the standard library must not state the version of Unicode supported in documentation. The version of the standard library in the SDK that you build with and read the documentation for is not the same as the version that you will link with at run time. The version of Unicode supported is a run time concept.

There has been some desire for something like Unicode.version or similar as a static variable, so that you can guard against it at run time if necessary.

These read the same to me. For the example of .nonLossyASCII or punycode, which is capable of encoding and decoding all of Unicode. What does available(in:) then mean, if we can encode and decode to and from Unicode scalar values? Do you mean that the value is trivially-encoded, i.e. its integer value corresponds directly the a (truncation of) the Unicode scalar value?

A simpler example: what is the result of available(in:) for UTF-16 on a non-BMP scalar?