Pitch: Character and String properties

itaiferber · April 9, 2018, 5:01pm

Oh, I know the pain

To clarify: my concern is the latter — it's about defining certain properties on graphemes which may later contradict a definition provided by Unicode itself. This sentence shapes the majority of my concern on the issue:

If the definitions of these properties were to somehow change (and I doubt they would! but still), we would either:

Be out of sync with the Unicode spec, or
Cause a lot of implicit code breakage should we choose to change the semantics to match Unicode

I am not so concerned with this from a resilience perspective, but from the perspective of the fact that these semantics can't really be captured by any sort of type system. Which means that your old code (which didn't even need to be rebuilt!) now behaves differently in potentially subtle ways, and no type system could warn you. Of course, this problem is nothing new — any framework you link against can change out from underneath you in an incompatible way; but the concern with strings specifically is that they are:

Extremely integral to how the vast majority of written code in the wild behaves (both in terms of their importance/prevalence, and in terms of how integral strings are to various programming languages)
Informed by, and inform, a "world view" of how things work. Considering the loop code we wrote below — say that we decide to follow the newly proposed semantics of "isUppercased returns true if the character could change under case translation, but currently does not"; in this sense, the naïve code would work pretty much correctly. It would be reasonable for a developer to assume these semantics because they make intuitive sense. However, if Unicode then later mandated that grapheme clusters must follow the semantics of the isUppercased implementation as originally proposed here, the existing code would return nonsensical results inconsistent with the developer's original world view of how strings behave

Yes, to clarify — the concerns detailed apply to the specific properties applied to the specific graphemes, and...

Michael_Ilseman:

I'm not trying to be glib. There is no right answer, so we give the best answer we can. This is the same for String being a collection of graphemes, comparison honoring canonical equivalence, etc. All of which are not the right answer, but are the best answer we can give. Demanding that we provide either the "right" answer or no answer at all is what landed us with Swift 3-era String, and no one wants to go back to that for obvious reasons. If you make a type so obnoxious to use, people will misuse it in far worse ways.

Making String be a collection of Characters, as was done in Swift 4, violates the purity of an algebra of collections. I can constructs two strings such that a.count + b.count > (a+b).count. String has append(_:Character), yet I can craft a non-identity Character such that appending it does not alter the count, but both modifies the last element and invalidates the last index. Nevertheless, String should be a collection of Characters, even though there are situations that can cause it to violate concatenation theory.

... to be clear, I am not in disagreement that we should offer better solutions than we have today! I think we can and should do better (even at the cost of some amount of "correctness", for some definition of "correctness"), but I think there are many ways we can go about it. I am in full support of doing this.

And I completely agree. One of the more infuriating messages from Swift 3 was the unavailability statement on String.count, even for someone who is aware of what might be going on under the hood.

I agree that there is no perfect answer here, but I think we can markedly do better. Naming has a lot of power here, considering how complex the situation can be. Even something as innocuous as isWhitespace can lead someone to believe things which may or may not be true. Two concerns with the name:

"is": I think that the word "is" is dangerous, and potentially ambiguous here. "Is" "\u{0020}\u{0301}" whitespace? Yes. "Is" it also not whitespace? Yes! "Is" can be a reductive word, like "just", and I think that therein lies a danger. "Is it only whitespace?" No. "Does it have whitespace in it?" Yes. A clear delineation here would improve things leaps and bounds, I think; a simple change in terminology could be sufficient to get around the danger. If we choose "is" to mean "is exclusively", then, for instance "\u{0020}\u{0301}".isWhitespace == false, while "\u{0020}\u{0301}".hasWhitespace == true. If we are concerned about an API explosion, then it might be sufficient to offer has<Property>(exclusively: Bool = false): $0.hasWhitespace() == true, $0.hasWhitespace(exclusively: true) == false
This is a much smaller concern than the above, but "whitespace" here can have different implications for different use cases. "Whitespace" to someone might mean "a character used to control spacing which does not draw anything on the screen", while someone else might be more concerned with it being "a character used to delineate words as typed by a human being". What does it mean for a string to be delineated by "\u{0020}\u{0301}"s? ¯\(ツ)/¯ If I split the string on $0.isWhitespace, would I get what I expect? ¯\(ツ)/¯ But there are different expectations here.

I think that if instead of defining what "is" "whitespace" and we start offering "does this have whitespace", and "does this contain exclusively things which are whitespace", we also avoid the risk of further influencing developer's views on what "is" and "is not". I think a lot of misconceptions exist today because of poor API naming, and I think that we should not only offer answers here, but do better than our predecessors.

Note also that the above is totally straw-man naming. IIRC, the other thread somewhat covered the concept of making these properties an OptionSet; it's also conceivable that we would pivot this to a somewhat simpler $0.has(.whitespace, exclusively: true), or something, but coming up with the specifics is a separate discussion. I just want to express my concern with defining what "is" and "is not", rather than saying what "has" or "does not have".

[Besides the fact that delineating these two types of properties can help developers make potentially more informed decisions about the types of operations that they want to perform.]

Michael_Ilseman:

This is an excellent example, thank you for pointing it out. This example, and the confusion it causes even sophisticated users like @nicklockwood, demonstrate that I proposed the wrong semantics. I think a better approach would be one that came up in discussion with @allevato’s point #2 (ignore the other points, we discovered flaws therein).

The semantics I proposed were easy to specify in terms of Unicode constructs, but are not intuitive. Since these are not in the Unicode namespace, I think it should instead be:
extension Character {
  var isUppercase: Bool { return String(self) == self.uppercased() && self.isCased }
  var isCased: Bool { return String(self) != self.uppercased() || String(self) != self.lowercased() || String(self) != self.titlecased() }
}
That is, a Character is uppercase if it is invariant under case mapping to upper and it varies under some other case mapping.

Yes, this is I think both a useful and importantly, intuitive solution. isUppercase/isLowercase returning true for characters which are not cased to begin with is confusing.

As an aside, though — these aren't the actual proposed implementations, but just examples of how they can be implemented today with public extensions, right? As written, these seem like very expensive operations. Considering the following (somewhat reasonable) code:

for char in str {
    if char.isUppercase {
        // Do uppercase thing
    } else if char.isLowercase {
        // Do lowercase thing
    } else {
        // Do no case thing
    }
}

Hitting the else will have created at least 16 intermediate Strings in the intervening checks. I am assuming that these are just stand-in suggestions, and we'll have significantly more optimized implementations, yeah?

So, to sum up — I agree with you! I think we need to offer the best solutions for developers we can, because not doing something for pure pedantic correctness leads to both frustration and incorrect code. I think we can do this in a way that leads not only to more intuitive API, but that suggests a world of complexity without diving into the details, and while still being useful. Terminology here is powerful, and I think that without declaring how things "are", we can both achieve what we want to do, and avoid influencing developer's views of what "is" and "is not".