[pitch] Character.unicodeScalars


(Ben Cohen) #1

Hi swift-evolution,

A short string-related pitch for you.

Add unicodeScalars property to Character

Proposal: SE-NNNN <file:///Users/ben/src/evolution/proposals/NNNN-character-unicode-view.md>
Authors: Ben Cohen <https://github.com/airspeedswift>
Review Manager: TBD
Status: Awaiting review

Introduction

This proposal adds a unicodeScalar view to Character, similar to that on String.

Motivation

The Character element type of String is currently a black box that provides little functionality besides comparison, literal construction, and to be used as an argument to String.init.

Many operations on String could be neatly/readbily implemented as operations on each character in the string, if Character exposed its scalars more directly. Many useful things can be determined by examining the scalars in a grapheme (for example is this an ASCII character?).

For example, in Swift 4 you can write this:

let s = "one two three"
s.index(of: " ")
But you cannot write this:

let ws = CharacterSet.whitespacesAndNewlines
s.index { $0.unicodeScalars.contains(where: ws.contains) }

Proposed solution

Add a unicodeScalars property to Character, presending a lazy view of the scalars in the character, along similar lines to the one on String.

Unlike the view on String, this will not be a mutable view – it will be read-only. The preferred method for creating and manipulating non-literal Character values will be through String. While there may be some good use cases to manipulating a Character directly, these are outweighed by the complexity of ensuring the invariant that it contain exactly one grapheme.

Detailed design

Add the following nested type to Character:

extension Character {
  public struct UnicodeScalarView : BidirectionalCollection {
    public struct Index
    public var startIndex: Index
    public var endIndex: Index
    public func index(after i: Index) -> Index
    public func index(before i: Index)
    public subscript(i: Index) -> UnicodeScalar
  }
}
Additionally, this type will conform to appropriate convenience protocols such as CustomStringConvertible.

All initializers will be declared internal, as unlike the String equivalent, this type will only ever be vended by Character.

Source compatibility

Purely additive, so no impact.

Effect on ABI stability

Purely additive, so no impact.

Effect on API resilience

Purely additive, so no impact.

Alternatives considered

Adding other views, such as utf8 or utf16, was considered but not deemed useful enough compared to using these operations on String instead.

In future, this feature could be used to implement convenience methods such as isASCII on Character. This could be done additively, given this building block, and is outside the scope of this initial proposal.


(Félix Cloutier) #2

+1.

···

Le 10 mai 2017 à 16:51, Ben Cohen via swift-evolution <swift-evolution@swift.org> a écrit :

Hi swift-evolution,

A short string-related pitch for you.

Add unicodeScalars property to Character

Proposal: SE-NNNN <file:///Users/ben/src/evolution/proposals/NNNN-character-unicode-view.md>
Authors: Ben Cohen <https://github.com/airspeedswift>
Review Manager: TBD
Status: Awaiting review

Introduction

This proposal adds a unicodeScalar view to Character, similar to that on String.

Motivation

The Character element type of String is currently a black box that provides little functionality besides comparison, literal construction, and to be used as an argument to String.init.

Many operations on String could be neatly/readbily implemented as operations on each character in the string, if Character exposed its scalars more directly. Many useful things can be determined by examining the scalars in a grapheme (for example is this an ASCII character?).

For example, in Swift 4 you can write this:

let s = "one two three"
s.index(of: " ")
But you cannot write this:

let ws = CharacterSet.whitespacesAndNewlines
s.index { $0.unicodeScalars.contains(where: ws.contains) }

Proposed solution

Add a unicodeScalars property to Character, presending a lazy view of the scalars in the character, along similar lines to the one on String.

Unlike the view on String, this will not be a mutable view – it will be read-only. The preferred method for creating and manipulating non-literal Character values will be through String. While there may be some good use cases to manipulating a Character directly, these are outweighed by the complexity of ensuring the invariant that it contain exactly one grapheme.

Detailed design

Add the following nested type to Character:

extension Character {
  public struct UnicodeScalarView : BidirectionalCollection {
    public struct Index
    public var startIndex: Index
    public var endIndex: Index
    public func index(after i: Index) -> Index
    public func index(before i: Index)
    public subscript(i: Index) -> UnicodeScalar
  }
}
Additionally, this type will conform to appropriate convenience protocols such as CustomStringConvertible.

All initializers will be declared internal, as unlike the String equivalent, this type will only ever be vended by Character.

Source compatibility

Purely additive, so no impact.

Effect on ABI stability

Purely additive, so no impact.

Effect on API resilience

Purely additive, so no impact.

Alternatives considered

Adding other views, such as utf8 or utf16, was considered but not deemed useful enough compared to using these operations on String instead.

In future, this feature could be used to implement convenience methods such as isASCII on Character. This could be done additively, given this building block, and is outside the scope of this initial proposal.

_______________________________________________
swift-evolution mailing list
swift-evolution@swift.org
https://lists.swift.org/mailman/listinfo/swift-evolution


(Brent Royal-Gordon) #3

Might it make sense conform `Character` itself to `Collection`, rather than using a view?

Otherwise, I'm in favor. (Though it'd be nice to have *some* way to manipulate the `UnicodeScalar`s inside a `Character`, even if `RangeReplaceableCollection`'s interface would make preserving its invariants too difficult. That could wait, though.)

···

On May 10, 2017, at 4:51 PM, Ben Cohen via swift-evolution <swift-evolution@swift.org> wrote:

Add a unicodeScalars property to Character, presending a lazy view of the scalars in the character, along similar lines to the one on String.

--
Brent Royal-Gordon
Architechies


(Ben Cohen) #4

Add a unicodeScalars property to Character, presending a lazy view of the scalars in the character, along similar lines to the one on String.

Might it make sense conform `Character` itself to `Collection`, rather than using a view?

Hmm. I don’t think this would be right. The composition of Character is not fundamental to its very being (unlike String’s composition, where being of element type Character is an important principle for Swift) – it’s a lower-level thing that a user can poke at for specific purposes.

Also, one of the discoveries we’ve made while making String a Collection is it has some unfortunate effects on code that uses flatMap inapporpriately. You can use flatMap with a function (Element)->T, and it has the same effect as map because the function is implicitly converted to (Element)->T? and then the elements are unwrapped again by the flatMap. But if you were doing this on String, and then String becomes a Collection, suddenly you get the more appropriate flatMap that flattens nested collections, and you get a [Character] back instead of the expected [String]. We’ve been able to put in compatibility shims to detect this specific case so people can be warned in Swift 3 compatibility mode, but I fear making Character a collection too may itself may introduce even more problems, possibly ones we can’t work around without compiler features. This reason alone might not be enough to rule out making Character a collection in the future, but it probably rules it out for Swift 4.

···

On May 10, 2017, at 11:13 PM, Brent Royal-Gordon <brent@architechies.com> wrote:

On May 10, 2017, at 4:51 PM, Ben Cohen via swift-evolution <swift-evolution@swift.org <mailto:swift-evolution@swift.org>> wrote:

Otherwise, I'm in favor. (Though it'd be nice to have *some* way to manipulate the `UnicodeScalar`s inside a `Character`, even if `RangeReplaceableCollection`'s interface would make preserving its invariants too difficult. That could wait, though.)

--
Brent Royal-Gordon
Architechies


(Michael Ilseman) #5

Add a unicodeScalars property to Character, presending a lazy view of the scalars in the character, along similar lines to the one on String.

Might it make sense conform `Character` itself to `Collection`, rather than using a view?

Hmm. I don’t think this would be right. The composition of Character is not fundamental to its very being (unlike String’s composition, where being of element type Character is an important principle for Swift) – it’s a lower-level thing that a user can poke at for specific purposes.

Also, one of the discoveries we’ve made while making String a Collection is it has some unfortunate effects on code that uses flatMap inapporpriately. You can use flatMap with a function (Element)->T, and it has the same effect as map because the function is implicitly converted to (Element)->T? and then the elements are unwrapped again by the flatMap. But if you were doing this on String, and then String becomes a Collection, suddenly you get the more appropriate flatMap that flattens nested collections, and you get a [Character] back instead of the expected [String]. We’ve been able to put in compatibility shims to detect this specific case so people can be warned in Swift 3 compatibility mode, but I fear making Character a collection too may itself may introduce even more problems, possibly ones we can’t work around without compiler features. This reason alone might not be enough to rule out making Character a collection in the future, but it probably rules it out for Swift 4.

Also, it’s not clear which unicode scalars should be exposed if Character were a collection. The unicode scalar view inside of Character reflects the scalars that just so happened to comprise the Character in the original String from which it came (*view* being an important word here). If Character were a proper collection, perhaps it makes more sense for it to be a collection of (insert-your-favorite-form) normalized scalars instead. This should be evaluated later, of course.

···

On May 11, 2017, at 12:56 PM, Ben Cohen via swift-evolution <swift-evolution@swift.org> wrote:

On May 10, 2017, at 11:13 PM, Brent Royal-Gordon <brent@architechies.com <mailto:brent@architechies.com>> wrote:

On May 10, 2017, at 4:51 PM, Ben Cohen via swift-evolution <swift-evolution@swift.org <mailto:swift-evolution@swift.org>> wrote:

Otherwise, I'm in favor. (Though it'd be nice to have *some* way to manipulate the `UnicodeScalar`s inside a `Character`, even if `RangeReplaceableCollection`'s interface would make preserving its invariants too difficult. That could wait, though.)

--
Brent Royal-Gordon
Architechies

_______________________________________________
swift-evolution mailing list
swift-evolution@swift.org
https://lists.swift.org/mailman/listinfo/swift-evolution


(Guillaume Lessard) #6

[aside]

The implicit lifting of A to Optional<A> is great, but why is it that this case lifts from ((Element)->T) to ((Element)->Optional<T>) rather than Optional<(Element)->T> \?
Is that generally more desirable?

Guillaume Lessard

···

On May 11, 2017, at 13:56, Ben Cohen via swift-evolution <swift-evolution@swift.org> wrote:

You can use flatMap with a function (Element)->T, and it has the same effect as map because the function is implicitly converted to (Element)->T? and then the elements are unwrapped again by the flatMap.