Pitch: Character and String properties

Michael_Ilseman · April 7, 2018, 11:18pm

Welcome to String!

Thank you for bringing this concern up. I think a good, pragmatic approach is to reserve ourselves the ability to alter and tweak behavior surrounding unanticipated corner cases and changes. Your concern convinces me that these properties should definitely be resilient, at least for now.

However, I do want to make sure we don’t fall victim to Unicode-FUD. As written, it sounds like your concern is that we're defining our own semantics and that Unicode might later somehow change our own semantics. I assume you’re either concerned that Unicode might take a dramatic turn in the semantics of the underlying scalar properties, or you’re concerned that Unicode might decide to define properties on graphemes and Swift will be left in its dust.

For the former, the scalar properties used are one of the more stable aspects of Unicode. Everything else in String is far more prone to breakage if Unicode suddenly reinvented itself by abandoning its key principles. This is even more hazardous for things inside the Unicode namespace. If the unthinkable happened, then yes, when we roll out a total API breakage of String we will also have to consider these properties.

For the latter, keeping these Character properties resilient would allow us significant leeway in adapting to a new world without rebuilding old code.

(When I say resilient, we still might have inlineable fast-paths for ASCII, unless we fear the semantics of ASCII changing too).

I want to be clear that I am not generalizing scalar properties to graphemes. I am defining specific semantics on specific graphemes, utilizing specific scalar properties to drive them. Scalar properties cannot, in general, be generalized to graphemes.

Sure, why not? (more serious answer below :-)

I'm not trying to be glib. There is no right answer, so we give the best answer we can. This is the same for String being a collection of graphemes, comparison honoring canonical equivalence, etc. All of which are not the right answer, but are the best answer we can give. Demanding that we provide either the "right" answer or no answer at all is what landed us with Swift 3-era String, and no one wants to go back to that for obvious reasons. If you make a type so obnoxious to use, people will misuse it in far worse ways.

Making String be a collection of Characters, as was done in Swift 4, violates the purity of an algebra of collections. I can constructs two strings such that a.count + b.count > (a+b).count. String has append(_:Character), yet I can craft a non-identity Character such that appending it does not alter the count, but both modifies the last element and invalidates the last index. Nevertheless, String should be a collection of Characters, even though there are situations that can cause it to violate concatenation theory.

I view properties on Character as the next logical progression in String’s long march towards being ergonomic. When someone is new to Swift, whether experienced from other platforms or new to programming, String is the first type they encounter. If the response to the question “what can a String tell me” involves a deep dive into Unicode as it did in Swift 3, then we’re doing something wrong. Now it’s a collection of Character, which is nice. If our response to “what can a Character tell me” is to point to the expert-use Unicode.Scalar.Properties and give them a stern here-be-dragons warning, then we’re doing something wrong.

(Not that there’s anything wrong with deep dives into Unicode. I just wouldn’t wish it upon my ~~enemies~~ users).

Coming back to the specific question of whether or not exotic graphemes starting with whitespace should be considered whitespace. Since there is no perfect answer, I think a good answer would be that a String containing whitespace also returns true for myStr.contains { $0.isWhitespace }.

itaiferber:

Similarly, too, goes the discussion regarding casing. I think there’s a danger here of being correct in a way that many would find unexpected. Consider some naïve code which attempts to loop through a string to find the first uppercased/lowercased letter (if any):

var firstUppercase: Character? = nil
var firstLowercase: Character? = nil
for char in str {
if char.isUppercased {
if firstUppercase == nil { firstUppercase = char }
} else {
if firstLowercase == nil { firstLowercase = char }
}
}

Ignoring the inefficiency of the above, having characters which are considered both uppercased and lowercased would be surprising to many. I think many people assume upper- and lower-caseness to be mutually exclusive, so the above code (and variants of it) runs rampant in the wild. Given str = "Hello", (firstUppercase, firstLowercase) == ("H", "e"); given str = "ₕᵢ", (firstUppercase, firstLowercase) == ("ₕ", nil). This might seem like a really weird edge case, but this would actually be much more common; including digits in this definition leads to more confusion: str = "abc123" results in ("1", "a"). (For many, too, the concept of casedness is tied to “is this thing a letter or not?”, which is a complex question in and of itself, but things which deviate from this definition feel wrong.)

This is an excellent example, thank you for pointing it out. This example, and the confusion it causes even sophisticated users like @nicklockwood, demonstrate that I proposed the wrong semantics. I think a better approach would be one that came up in discussion with @allevato’s point #2 (ignore the other points, we discovered flaws therein).

The semantics I proposed were easy to specify in terms of Unicode constructs, but are not intuitive. Since these are not in the Unicode namespace, I think it should instead be:

extension Character {
  var isUppercase: Bool { return String(self) == self.uppercased() && self.isCased }
  var isCased: Bool { return String(self) != self.uppercased() || String(self) != self.lowercased() || String(self) != self.titlecased() }
}

That is, a Character is uppercase if it is invariant under case mapping to upper and it varies under some other case mapping.

Actually, it is this very tendency that I find to be an argument for Character properties. Users define these properties themselves, poorly, but we can give a more robust answer efficiently.