Conform `Unicode.Scalar` to `Strideable`

Conform Unicode.Scalar to Strideable

Introduction

This proposal extends Unicode.Scalar so that it conforms to Strideable.

Motivation

In the standard library, only fixed-width integers (and unsafe pointers) can be used in "countable" ranges.

let codePoints: ClosedRange<UInt32> = 0 ... 0x10FFFF
let unicodeScalars: ClosedRange<Unicode.Scalar> = "\0" ... "\u{10FFFF}"

codePoints.count      //-> 1_114_112
unicodeScalars.count  //-> ERROR:
// referencing property 'count' on 'ClosedRange' requires
// that 'Unicode.Scalar' conform to 'Strideable'; and also
// that 'Unicode.Scalar.Stride' conform to 'SignedInteger'

There are generic CountableRange and CountableClosedRange type aliases with the same constraints.

Proposed solution

If the standard library extends Unicode.Scalar so that it conforms to Strideable, then ranges of Unicode scalar values will have access to sequence and collection APIs.

let codePoints: ClosedRange<UInt32> = 0 ... 0x10FFFF
let unicodeScalars: ClosedRange<Unicode.Scalar> = "\0" ... "\u{10FFFF}"

codePoints.count      //-> 1_114_112
unicodeScalars.count  //-> 1_112_064

Notice that there are 2048 fewer Unicode scalar values, because the Surrogates Area is excluded.

Detailed design

The standard library will implement the following Strideable requirements (with suitable @available attributes).

extension Unicode.Scalar: Strideable {

  public typealias Stride = Int

  public func distance(to other: Self) -> Stride

  public func advanced(by n: Stride) -> Self

  public static func _step(
    after current: (index: Int?, value: Self),
    from start: Self,
    by distance: Stride
  ) -> (index: Int?, value: Self)
}

The distance between "\u{D7FF}" and "\u{E000}" will be ±1, because the Surrogates Area is excluded.

The hidden _step(after:from:by:) requirement is used by the stride(from:to:by:) and stride(from:through:by:) APIs.

Source compatibility

Existing clients might have incompatible retroactive conformances.

ABI compatibility

This proposal is purely an extension of the ABI of the standard library.

Implications on adoption

A new version of the standard library will be required.

7 Likes

Hmm. This makes me pretty uncomfortable. Strideable is documented as

A type representing continuous, one-dimensional values that can be offset and measured.

I suppose the argument could be made that mathematically, all the integer types that conform to Strideable aren't actually continuous either. But those integer types don't have gaps within their defined bounds.

The key difference for me is that the Comparable conformance that allows one to define a ClosedRange<Unicode.Scalar> is defined in terms of its underlying numeric representation, and it's very common for folks who are doing work directly with ASCII or Unicode code points to switch back and forth between the scalar representation and the numeric. (Although perhaps this is less so for Unicode scalars; I'm thinking more of tricks to switch between upper- and lowercase ASCII characters via math or bit manipulation, but those don't usually hold for the haphazard assignment of Unicode code points.)

Still, I worry about the consequences of introducing a relationship where someScalar + n is not the same as Unicode.Scalar(someScalar.value + n) for some values of those two operands.

15 Likes

Being heavily involved in a project dealing with character sets and ranges of characters, the lack of Strideable been a little bit of a nuisance for me.

I don't see the same problem with this. If you need numbers evenly spaced then use Int; is not the purpose of Strideable to bring "increment" and "range" semantics to data types that are irregular or non-euclidean?

Me too. Is there anything to be gained if we try wrapping Int in a custom, non-initializable stride type?


Nope, you're giving this protocol too much credit. Strideable is the protocol to which types conform for use with stride(from:to:by:) and its ilk. The protocol's design is workable for integer types with bit width less than or equal to that of Int, sort of workable for larger integer types, just barely so for floating-point types using the underscored customization points I retrofitted, and poorly with anything that deviates from these types.

3 Likes

I agree that it would be great if the standard library offered a way to iterate over `UnicodeScalar` values. Having recently written a Unicode-capable parsing virtual machine and several text processing algorithms, I would like to suggest an alternative approach: adding a dedicated `ContiguousClosedUnicodeScalarRange` type.

A `ContiguousClosedUnicodeScalarRange` would trivialize iteration by checking the surrogate range once at initialization. This approach also simplifies other data structures. A hypothetical `UnicodeScalarRangeSet` can safely insert `ContiguousClosedUnicodeScalarRange`(s) without repeated surrogate range checks. A `ContiguousClosedUnicodeScalarRangeIterator` can run faster than a `ClosedRange` because the incrementing addition cannot overflow, for example.

1 Like

I don't think someScalar + n is possible (except as pseudocode) because the operators are only available for pointer arithmetic. However, I understand your general concern.


The conditional conformances require a signed-integer stride. A wrapper type would need to implement many BinaryInteger requirements — including the initializers.


A new range type (ideally with a shorter name) is a possibility.


Another option is to implement Strideable with preconditions, so that some ranges may incur a runtime error (only when used as a sequence or collection).