Pitch: Renaming CharacterSet to UnicodeScalarSet


(Erica Sadun) #1

Chris: "Also, it is worth saying that any source breaking change still has to have an
ultra-compelling reason to be worth considering. Despite having a framework to
support some source breaking changes, we still want to minimize them where
ever possible."

Since it seems to be open season on introducing a few, highly focused
breaking changes, let me throw this one out there.

Pitch: Renaming CharacterSet to UnicodeScalarSet

In Swift, String is defined as "a Unicode string value." and a "CharacterSet"
represents a set of Unicode-compliant characters.

A CharacterSet's initializers are:
init()
init<S>(S)
init(arrayLiteral: UnicodeScalar...)
init(bitmapRepresentation: Data)
init(charactersIn: ClosedRange<UnicodeScalar>)
init(charactersIn: String)
init(charactersIn: Range<UnicodeScalar>)
init?(contentsOfFile: String)

Why not rename `CharacterSet` to `UnicodeScalarSe`t, and update the initializers
to reflect they're being initialized from the unicode scalars in strings and ranges?
I think the few places where the word `character` is left mentioned (in convenience
properties) can be better named from `punctuationCharacters` to `punctuation`,
`controlCharacters` to `controlAndFormat`, etc.

-- E


(Xiaodi Wu) #2

Is this really correct? Character and UnicodeScalar are not synonyms. The
Character type represents a character made up of one or more Unicode
scalars (i.e. an extended grapheme cluster). Is a CharacterSet a set of
Unicode-compliant characters that happens to be restricted to those
characters each made up of only a single Unicode scalar, or is it meant to
be a set of Unicode scalars? My read of the Foundation documentation is
that it is the former.

···

On Wed, Sep 28, 2016 at 4:27 PM, Erica Sadun via swift-evolution < swift-evolution@swift.org> wrote:

Chris: "*Also, it is worth saying that any source breaking change still
has to have an *
*ultra-compelling reason to be worth considering. Despite having a
framework to *
*support some source breaking changes, we still want to minimize them
where *
*ever possible.*"

Since it seems to be open season on introducing a few, highly focused
breaking changes, let me throw this one out there.

*Pitch: Renaming CharacterSet to UnicodeScalarSet*

In Swift, String is defined as "a Unicode string value." and a
"CharacterSet"
represents a set of Unicode-compliant characters.

A CharacterSet's initializers are:

   - init()
   - init<S>(S)
   - init(arrayLiteral: UnicodeScalar...)
   - init(bitmapRepresentation: Data)
   - init(charactersIn: ClosedRange<UnicodeScalar>)
   - init(charactersIn: String)
   - init(charactersIn: Range<UnicodeScalar>)
   - init?(contentsOfFile: String)

Why not rename `CharacterSet` to `UnicodeScalarSe`t, and update the
initializers
to reflect they're being initialized from the unicode scalars in strings
and ranges?
I think the few places where the word `character` is left mentioned (in
convenience
properties) can be better named from `punctuationCharacters` to
`punctuation`,
`controlCharacters` to `controlAndFormat`, etc.

-- E

_______________________________________________
swift-evolution mailing list
swift-evolution@swift.org
https://lists.swift.org/mailman/listinfo/swift-evolution


(Ben Rimmington) #3

I agree, but `UnicodeScalarSet` was rejected during the SE-0069 discussion:

<https://lists.swift.org/pipermail/swift-evolution/Week-of-Mon-20160425/015685.html>

-- Ben

···

On 28 Sep 2016, at 22:27, Erica Sadun wrote:

Why not rename `CharacterSet` to `UnicodeScalarSe`t, and update the initializers
to reflect they're being initialized from the unicode scalars in strings and ranges?


(Dave Abrahams) #4

Chris: "Also, it is worth saying that any source breaking change still has to have an
ultra-compelling reason to be worth considering. Despite having a framework to
support some source breaking changes, we still want to minimize them where
ever possible."

Since it seems to be open season on introducing a few, highly focused
breaking changes, let me throw this one out there.

Pitch: Renaming CharacterSet to UnicodeScalarSet

Hi Erica,

This is out-of-scope for this list, because CharacterSet is part of
corelibs-foundation. I suggest posting on
https://lists.swift.org/mailman/listinfo/swift-corelibs-dev, where the
Foundation people hang.

···

on Wed Sep 28 2016, Erica Sadun <swift-evolution@swift.org> wrote:

In Swift, String is defined as "a Unicode string value." and a "CharacterSet"
represents a set of Unicode-compliant characters.

A CharacterSet's initializers are:
init()
init<S>(S)
init(arrayLiteral: UnicodeScalar...)
init(bitmapRepresentation: Data)
init(charactersIn: ClosedRange<UnicodeScalar>)
init(charactersIn: String)
init(charactersIn: Range<UnicodeScalar>)
init?(contentsOfFile: String)

Why not rename `CharacterSet` to `UnicodeScalarSe`t, and update the initializers
to reflect they're being initialized from the unicode scalars in strings and ranges?
I think the few places where the word `character` is left mentioned (in convenience
properties) can be better named from `punctuationCharacters` to `punctuation`,
`controlCharacters` to `controlAndFormat`, etc.

-- E
_______________________________________________
swift-evolution mailing list
swift-evolution@swift.org
https://lists.swift.org/mailman/listinfo/swift-evolution

--
-Dave


(Erica Sadun) #5

http://i.imgur.com/h6W5kYc.jpg

http://i.imgur.com/q50PSld.jpg

-- E

···

On Sep 28, 2016, at 3:58 PM, Xiaodi Wu <xiaodi.wu@gmail.com> wrote:

Is this really correct? Character and UnicodeScalar are not synonyms. The Character type represents a character made up of one or more Unicode scalars (i.e. an extended grapheme cluster). Is a CharacterSet a set of Unicode-compliant characters that happens to be restricted to those characters each made up of only a single Unicode scalar, or is it meant to be a set of Unicode scalars? My read of the Foundation documentation is that it is the former.


(Erica Sadun) #6

Why not rename `CharacterSet` to `UnicodeScalarSe`t, and update the initializers
to reflect they're being initialized from the unicode scalars in strings and ranges?

I agree, but `UnicodeScalarSet` was rejected during the SE-0069 discussion:

<https://lists.swift.org/pipermail/swift-evolution/Week-of-Mon-20160425/015685.html>

-- Ben

D'erp. I missed that. And that's an unambiguous answer.

So let me move on to part B of the pitch: I think CharacterSets are broken.

Xiaodi Wu: "isn't the problem you're presenting really an argument that the type should be fleshed out to handle characters (grapheme clusters) containing more than one Unicode scalar?"

-- E

···

On Sep 28, 2016, at 6:14 PM, Ben Rimmington <me@benrimmington.com> wrote:

On 28 Sep 2016, at 22:27, Erica Sadun wrote:


(Charles Srstka) #7

It seems that it already does handle such characters:

(done in Objective-C so we can log the length of the range as a count of UTF-16 code units)

#import <Foundation/Foundation.h>

int main(int argc, char *argv[]) {
    @autoreleasepool {
        NSCharacterSet *bikeSet = [NSCharacterSet characterSetWithCharactersInString:@":bike:"];
        NSString *str = @"foo🚲bar";
        
        NSRange range = [str rangeOfCharacterFromSet:bikeSet];
        
        NSLog(@"location: %lu length: %lu", range.location, range.length);
    }
}

- - - - - - -

2016-09-28 22:20:00.622471 test[15577:2433912] location: 3 length: 2
Program ended with exit code: 0

- - - - - - -

As we can see, the character from the set is recognized as consisting of two code units. There are a few bugs in the system, though. See the cocoa-dev thread “Where is my bicycle?” from about a year ago: http://prod.lists.apple.com/archives/cocoa-dev/2015/Apr/msg00074.html

Charles

···

On Sep 28, 2016, at 9:57 PM, Erica Sadun via swift-evolution <swift-evolution@swift.org> wrote:

D'erp. I missed that. And that's an unambiguous answer.

So let me move on to part B of the pitch: I think CharacterSets are broken.

Xiaodi Wu: "isn't the problem you're presenting really an argument that the type should be fleshed out to handle characters (grapheme clusters) containing more than one Unicode scalar?"


(Xiaodi Wu) #8

D'erp. I missed that. And that's an unambiguous answer.

So let me move on to part B of the pitch: I think CharacterSets are broken.

Xiaodi Wu: "isn't the problem you're presenting really an argument that
the type should be fleshed out to handle characters (grapheme clusters)
containing more than one Unicode scalar?"

It seems that it already does handle such characters:

(done in Objective-C so we can log the length of the range as a count of
UTF-16 code units)

#import <Foundation/Foundation.h>

int main(int argc, char *argv[]) {
    @autoreleasepool {
        NSCharacterSet *bikeSet = [NSCharacterSet
characterSetWithCharactersInString:@":bike:"];
        NSString *str = @"foo🚲bar";

        NSRange range = [str rangeOfCharacterFromSet:bikeSet];

        NSLog(@"location: %lu length: %lu", range.location, range.length);
    }
}

- - - - - - -

*2016-09-28 22:20:00.622471 test[15577:2433912] location: 3 length: 2*
*Program ended with exit code: 0*

- - - - - - -

As we can see, the character from the set is recognized as consisting of
two code units. There are a few bugs in the system, though. See the
cocoa-dev thread “Where is my bicycle?” from about a year ago:
http://prod.lists.apple.com/archives/cocoa-dev/2015/Apr/msg00074.html

The bike emoji might be two code units, but it is one Unicode scalar
(U+1F6B2). However, the Canadian flag emoji, for instance, is two Unicode
scalars (U+1F1E8 U+1F1E6) but nonetheless one character.

Charles

···

On Wed, Sep 28, 2016 at 10:23 PM, Charles Srstka via swift-evolution < swift-evolution@swift.org> wrote:

On Sep 28, 2016, at 9:57 PM, Erica Sadun via swift-evolution < > swift-evolution@swift.org> wrote:

_______________________________________________
swift-evolution mailing list
swift-evolution@swift.org
https://lists.swift.org/mailman/listinfo/swift-evolution


(Xiaodi Wu) #9

D'erp. I missed that. And that's an unambiguous answer.

So let me move on to part B of the pitch: I think CharacterSets are
broken.

Xiaodi Wu: "isn't the problem you're presenting really an argument that
the type should be fleshed out to handle characters (grapheme clusters)
containing more than one Unicode scalar?"

It seems that it already does handle such characters:

(done in Objective-C so we can log the length of the range as a count of
UTF-16 code units)

#import <Foundation/Foundation.h>

int main(int argc, char *argv[]) {
    @autoreleasepool {
        NSCharacterSet *bikeSet = [NSCharacterSet
characterSetWithCharactersInString:@":bike:"];
        NSString *str = @"foo🚲bar";

        NSRange range = [str rangeOfCharacterFromSet:bikeSet];

        NSLog(@"location: %lu length: %lu", range.location, range.length
);
    }
}

- - - - - - -

*2016-09-28 22:20:00.622471 test[15577:2433912] location: 3 length: 2*
*Program ended with exit code: 0*

- - - - - - -

As we can see, the character from the set is recognized as consisting of
two code units. There are a few bugs in the system, though. See the
cocoa-dev thread “Where is my bicycle?” from about a year ago:
http://prod.lists.apple.com/archives/cocoa-dev/2015/Apr/msg00074.html

The bike emoji might be two code units, but it is one Unicode scalar
(U+1F6B2). However, the Canadian flag emoji, for instance, is two Unicode
scalars (U+1F1E8 U+1F1E6) but nonetheless one character.

To illustrate in code how CharacterSet doesn't actually handle characters
made up of multiple Unicode scalars:

import Foundation

let str1 = "🇦🇩"
let first = CharacterSet(charactersIn: str1) // this actually crashes
corelibs-foundation
let str2 = "🇦🇺"
let second = CharacterSet(charactersIn: str2)
let intersection = first.intersection(second)
print(intersection.isEmpty)
// actual output: false
// obviously, if we were really dealing with characters, the intersection
should be empty
···

On Wed, Sep 28, 2016 at 10:34 PM, Xiaodi Wu <xiaodi.wu@gmail.com> wrote:

On Wed, Sep 28, 2016 at 10:23 PM, Charles Srstka via swift-evolution < > swift-evolution@swift.org> wrote:

On Sep 28, 2016, at 9:57 PM, Erica Sadun via swift-evolution < >> swift-evolution@swift.org> wrote:

Charles

_______________________________________________
swift-evolution mailing list
swift-evolution@swift.org
https://lists.swift.org/mailman/listinfo/swift-evolution


(Jay) #10

Yes - this is totally confusing. CharacterSet and Set<Character> are
completely different things with different semantics.

I don't know the history, but is CharacterSet simply to have a Swift
equivalent of NSCharacterSet? That seems to be what it is, but since Swift
redefined characters in a better way, this should be removed or called
something else to avoid confusion. You shouldn't have to qualify what you
mean by 'character' in a type name because it diverges from the definition
in the rest of the language.

···

On Thu, 29 Sep 2016 at 04:48 Xiaodi Wu via swift-evolution < swift-evolution@swift.org> wrote:

On Wed, Sep 28, 2016 at 10:34 PM, Xiaodi Wu <xiaodi.wu@gmail.com> wrote:

On Wed, Sep 28, 2016 at 10:23 PM, Charles Srstka via swift-evolution < >> swift-evolution@swift.org> wrote:

On Sep 28, 2016, at 9:57 PM, Erica Sadun via swift-evolution < >>> swift-evolution@swift.org> wrote:

D'erp. I missed that. And that's an unambiguous answer.

So let me move on to part B of the pitch: I think CharacterSets are
broken.

Xiaodi Wu: "isn't the problem you're presenting really an argument that
the type should be fleshed out to handle characters (grapheme clusters)
containing more than one Unicode scalar?"

It seems that it already does handle such characters:

(done in Objective-C so we can log the length of the range as a count of
UTF-16 code units)

#import <Foundation/Foundation.h>

int main(int argc, char *argv[]) {
    @autoreleasepool {
        NSCharacterSet *bikeSet = [NSCharacterSet
characterSetWithCharactersInString:@":bike:"];
        NSString *str = @"foo🚲bar";

        NSRange range = [str rangeOfCharacterFromSet:bikeSet];

        NSLog(@"location: %lu length: %lu", range.location, range.length
);
    }
}

- - - - - - -

*2016-09-28 22:20:00.622471 test[15577:2433912] location: 3 length: 2*
*Program ended with exit code: 0*

- - - - - - -

As we can see, the character from the set is recognized as consisting of
two code units. There are a few bugs in the system, though. See the
cocoa-dev thread “Where is my bicycle?” from about a year ago:
http://prod.lists.apple.com/archives/cocoa-dev/2015/Apr/msg00074.html

The bike emoji might be two code units, but it is one Unicode scalar
(U+1F6B2). However, the Canadian flag emoji, for instance, is two Unicode
scalars (U+1F1E8 U+1F1E6) but nonetheless one character.

To illustrate in code how CharacterSet doesn't actually handle characters
made up of multiple Unicode scalars:

import Foundation

let str1 = "🇦🇩"
let first = CharacterSet(charactersIn: str1) // this actually crashes
corelibs-foundation
let str2 = "🇦🇺"
let second = CharacterSet(charactersIn: str2)
let intersection = first.intersection(second)
print(intersection.isEmpty)
// actual output: false
// obviously, if we were really dealing with characters, the intersection
should be empty

Charles

_______________________________________________
swift-evolution mailing list
swift-evolution@swift.org
https://lists.swift.org/mailman/listinfo/swift-evolution

_______________________________________________

swift-evolution mailing list
swift-evolution@swift.org
https://lists.swift.org/mailman/listinfo/swift-evolution


(Xiaodi Wu) #11

Afaik, every Unicode scalar can be its own character, so IMO it's not
bothersome that there are overloads that take Unicode scalar arguments.
However, since the stated purpose of the type is to be a set of characters,
isn't the problem you're presenting really an argument that the type should
be fleshed out to handle characters (grapheme clusters) containing more
than one Unicode scalar?

···

On Wed, Sep 28, 2016 at 18:43 Erica Sadun <erica@ericasadun.com> wrote:

On Sep 28, 2016, at 3:58 PM, Xiaodi Wu <xiaodi.wu@gmail.com> wrote:

Is this really correct? Character and UnicodeScalar are not synonyms. The
Character type represents a character made up of one or more Unicode
scalars (i.e. an extended grapheme cluster). Is a CharacterSet a set of
Unicode-compliant characters that happens to be restricted to those
characters each made up of only a single Unicode scalar, or is it meant to
be a set of Unicode scalars? My read of the Foundation documentation is
that it is the former.

http://i.imgur.com/h6W5kYc.jpg

http://i.imgur.com/q50PSld.jpg

-- E


(Xiaodi Wu) #12

CharacterSet is a Foundation value type. It was a subject of the following
proposal:

https://github.com/apple/swift-evolution/blob/master/proposals/0069-swift-mutability-for-foundation.md

We might be able improve on the implementation, but I don't think
re-arguing the name is an option.

···

On Wed, Sep 28, 2016 at 11:59 PM Jay Abbott <jay@abbott.me.uk> wrote:

Yes - this is totally confusing. CharacterSet and Set<Character> are
completely different things with different semantics.

I don't know the history, but is CharacterSet simply to have a Swift
equivalent of NSCharacterSet? That seems to be what it is, but since Swift
redefined characters in a better way, this should be removed or called
something else to avoid confusion. You shouldn't have to qualify what you
mean by 'character' in a type name because it diverges from the definition
in the rest of the language.

On Thu, 29 Sep 2016 at 04:48 Xiaodi Wu via swift-evolution < > swift-evolution@swift.org> wrote:

On Wed, Sep 28, 2016 at 10:34 PM, Xiaodi Wu <xiaodi.wu@gmail.com> wrote:

On Wed, Sep 28, 2016 at 10:23 PM, Charles Srstka via swift-evolution < >>> swift-evolution@swift.org> wrote:

On Sep 28, 2016, at 9:57 PM, Erica Sadun via swift-evolution < >>>> swift-evolution@swift.org> wrote:

D'erp. I missed that. And that's an unambiguous answer.

So let me move on to part B of the pitch: I think CharacterSets are
broken.

Xiaodi Wu: "isn't the problem you're presenting really an argument that
the type should be fleshed out to handle characters (grapheme clusters)
containing more than one Unicode scalar?"

It seems that it already does handle such characters:

(done in Objective-C so we can log the length of the range as a count
of UTF-16 code units)

#import <Foundation/Foundation.h>

int main(int argc, char *argv[]) {
    @autoreleasepool {
        NSCharacterSet *bikeSet = [NSCharacterSet
characterSetWithCharactersInString:@":bike:"];
        NSString *str = @"foo🚲bar";

        NSRange range = [str rangeOfCharacterFromSet:bikeSet];

        NSLog(@"location: %lu length: %lu", range.location, range.
length);
    }
}

- - - - - - -

*2016-09-28 22:20:00.622471 test[15577:2433912] location: 3 length: 2*
*Program ended with exit code: 0*

- - - - - - -

As we can see, the character from the set is recognized as consisting
of two code units. There are a few bugs in the system, though. See the
cocoa-dev thread “Where is my bicycle?” from about a year ago:
http://prod.lists.apple.com/archives/cocoa-dev/2015/Apr/msg00074.html

The bike emoji might be two code units, but it is one Unicode scalar
(U+1F6B2). However, the Canadian flag emoji, for instance, is two Unicode
scalars (U+1F1E8 U+1F1E6) but nonetheless one character.

To illustrate in code how CharacterSet doesn't actually handle characters
made up of multiple Unicode scalars:

import Foundation

let str1 = "🇦🇩"
let first = CharacterSet(charactersIn: str1) // this actually crashes
corelibs-foundation
let str2 = "🇦🇺"
let second = CharacterSet(charactersIn: str2)
let intersection = first.intersection(second)
print(intersection.isEmpty)
// actual output: false
// obviously, if we were really dealing with characters, the intersection
should be empty

Charles

_______________________________________________
swift-evolution mailing list
swift-evolution@swift.org
https://lists.swift.org/mailman/listinfo/swift-evolution

_______________________________________________

swift-evolution mailing list
swift-evolution@swift.org
https://lists.swift.org/mailman/listinfo/swift-evolution


(David Sweeris) #13

IIUC, Jay wasn't arguing for renaming CharacterSet, but replacing it with Swift's existing Set mechanism. If/when generics get to the point that we can say 'extension Set<Character> {...}', I think the transition could simply be putting 'typealias CharacterSet = Set<Character>' somewhere in the framework (although I don't know how Obj-C interop would be affected by such a change).

- Dave Sweeris

···

On Sep 29, 2016, at 00:30, Xiaodi Wu via swift-evolution <swift-evolution@swift.org> wrote:

CharacterSet is a Foundation value type. It was a subject of the following proposal:

https://github.com/apple/swift-evolution/blob/master/proposals/0069-swift-mutability-for-foundation.md

We might be able improve on the implementation, but I don't think re-arguing the name is an option.

On Wed, Sep 28, 2016 at 11:59 PM Jay Abbott <jay@abbott.me.uk> wrote:

Yes - this is totally confusing. CharacterSet and Set<Character> are completely different things with different semantics.

I don't know the history, but is CharacterSet simply to have a Swift equivalent of NSCharacterSet? That seems to be what it is, but since Swift redefined characters in a better way, this should be removed or called something else to avoid confusion. You shouldn't have to qualify what you mean by 'character' in a type name because it diverges from the definition in the rest of the language.

On Thu, 29 Sep 2016 at 04:48 Xiaodi Wu via swift-evolution <swift-evolution@swift.org> wrote:

On Wed, Sep 28, 2016 at 10:34 PM, Xiaodi Wu <xiaodi.wu@gmail.com> wrote:

On Wed, Sep 28, 2016 at 10:23 PM, Charles Srstka via swift-evolution <swift-evolution@swift.org> wrote:

On Sep 28, 2016, at 9:57 PM, Erica Sadun via swift-evolution <swift-evolution@swift.org> wrote:

D'erp. I missed that. And that's an unambiguous answer.

So let me move on to part B of the pitch: I think CharacterSets are broken.

Xiaodi Wu: "isn't the problem you're presenting really an argument that the type should be fleshed out to handle characters (grapheme clusters) containing more than one Unicode scalar?"

It seems that it already does handle such characters:

(done in Objective-C so we can log the length of the range as a count of UTF-16 code units)

#import <Foundation/Foundation.h>

int main(int argc, char *argv[]) {
    @autoreleasepool {
        NSCharacterSet *bikeSet = [NSCharacterSet characterSetWithCharactersInString:@":bike:"];
        NSString *str = @"foo🚲bar";
        
        NSRange range = [str rangeOfCharacterFromSet:bikeSet];
        
        NSLog(@"location: %lu length: %lu", range.location, range.length);
    }
}

- - - - - - -

2016-09-28 22:20:00.622471 test[15577:2433912] location: 3 length: 2
Program ended with exit code: 0

- - - - - - -

As we can see, the character from the set is recognized as consisting of two code units. There are a few bugs in the system, though. See the cocoa-dev thread “Where is my bicycle?” from about a year ago: http://prod.lists.apple.com/archives/cocoa-dev/2015/Apr/msg00074.html

The bike emoji might be two code units, but it is one Unicode scalar (U+1F6B2). However, the Canadian flag emoji, for instance, is two Unicode scalars (U+1F1E8 U+1F1E6) but nonetheless one character.

To illustrate in code how CharacterSet doesn't actually handle characters made up of multiple Unicode scalars:

import Foundation

let str1 = "🇦🇩"
let first = CharacterSet(charactersIn: str1) // this actually crashes corelibs-foundation
let str2 = "🇦🇺"
let second = CharacterSet(charactersIn: str2)
let intersection = first.intersection(second)
print(intersection.isEmpty)
// actual output: false
// obviously, if we were really dealing with characters, the intersection should be empty

Charles

_______________________________________________
swift-evolution mailing list
swift-evolution@swift.org
https://lists.swift.org/mailman/listinfo/swift-evolution

_______________________________________________
swift-evolution mailing list
swift-evolution@swift.org
https://lists.swift.org/mailman/listinfo/swift-evolution

_______________________________________________
swift-evolution mailing list
swift-evolution@swift.org
https://lists.swift.org/mailman/listinfo/swift-evolution