Oh, I know the pain
To clarify: my concern is the latter — it's about defining certain properties on graphemes which may later contradict a definition provided by Unicode itself. This sentence shapes the majority of my concern on the issue:
If the definitions of these properties were to somehow change (and I doubt they would! but still), we would either:
- Be out of sync with the Unicode spec, or
- Cause a lot of implicit code breakage should we choose to change the semantics to match Unicode
I am not so concerned with this from a resilience perspective, but from the perspective of the fact that these semantics can't really be captured by any sort of type system. Which means that your old code (which didn't even need to be rebuilt!) now behaves differently in potentially subtle ways, and no type system could warn you. Of course, this problem is nothing new — any framework you link against can change out from underneath you in an incompatible way; but the concern with strings specifically is that they are:
- Extremely integral to how the vast majority of written code in the wild behaves (both in terms of their importance/prevalence, and in terms of how integral strings are to various programming languages)
- Informed by, and inform, a "world view" of how things work. Considering the loop code we wrote below — say that we decide to follow the newly proposed semantics of "
isUppercased
returnstrue
if the character could change under case translation, but currently does not"; in this sense, the naïve code would work pretty much correctly. It would be reasonable for a developer to assume these semantics because they make intuitive sense. However, if Unicode then later mandated that grapheme clusters must follow the semantics of theisUppercased
implementation as originally proposed here, the existing code would return nonsensical results inconsistent with the developer's original world view of how strings behave
Yes, to clarify — the concerns detailed apply to the specific properties applied to the specific graphemes, and...
... to be clear, I am not in disagreement that we should offer better solutions than we have today! I think we can and should do better (even at the cost of some amount of "correctness", for some definition of "correctness"), but I think there are many ways we can go about it. I am in full support of doing this.
And I completely agree. One of the more infuriating messages from Swift 3 was the unavailability statement on String.count
, even for someone who is aware of what might be going on under the hood.
I agree that there is no perfect answer here, but I think we can markedly do better. Naming has a lot of power here, considering how complex the situation can be. Even something as innocuous as isWhitespace
can lead someone to believe things which may or may not be true. Two concerns with the name:
- "is": I think that the word "is" is dangerous, and potentially ambiguous here. "Is"
"\u{0020}\u{0301}"
whitespace? Yes. "Is" it also not whitespace? Yes! "Is" can be a reductive word, like "just", and I think that therein lies a danger. "Is it only whitespace?" No. "Does it have whitespace in it?" Yes. A clear delineation here would improve things leaps and bounds, I think; a simple change in terminology could be sufficient to get around the danger. If we choose "is" to mean "is exclusively", then, for instance"\u{0020}\u{0301}".isWhitespace == false
, while"\u{0020}\u{0301}".hasWhitespace == true
. If we are concerned about an API explosion, then it might be sufficient to offerhas<Property>(exclusively: Bool = false)
:$0.hasWhitespace() == true
,$0.hasWhitespace(exclusively: true) == false
- This is a much smaller concern than the above, but "whitespace" here can have different implications for different use cases. "Whitespace" to someone might mean "a character used to control spacing which does not draw anything on the screen", while someone else might be more concerned with it being "a character used to delineate words as typed by a human being". What does it mean for a string to be delineated by
"\u{0020}\u{0301}"
s? ¯\(ツ)/¯ If I split the string on$0.isWhitespace
, would I get what I expect? ¯\(ツ)/¯ But there are different expectations here.
I think that if instead of defining what "is" "whitespace" and we start offering "does this have whitespace", and "does this contain exclusively things which are whitespace", we also avoid the risk of further influencing developer's views on what "is" and "is not". I think a lot of misconceptions exist today because of poor API naming, and I think that we should not only offer answers here, but do better than our predecessors.
Note also that the above is totally straw-man naming. IIRC, the other thread somewhat covered the concept of making these properties an OptionSet
; it's also conceivable that we would pivot this to a somewhat simpler $0.has(.whitespace, exclusively: true)
, or something, but coming up with the specifics is a separate discussion. I just want to express my concern with defining what "is" and "is not", rather than saying what "has" or "does not have".
[Besides the fact that delineating these two types of properties can help developers make potentially more informed decisions about the types of operations that they want to perform.]
Yes, this is I think both a useful and importantly, intuitive solution. isUppercase
/isLowercase
returning true
for characters which are not cased to begin with is confusing.
As an aside, though — these aren't the actual proposed implementations, but just examples of how they can be implemented today with public extensions, right? As written, these seem like very expensive operations. Considering the following (somewhat reasonable) code:
for char in str {
if char.isUppercase {
// Do uppercase thing
} else if char.isLowercase {
// Do lowercase thing
} else {
// Do no case thing
}
}
Hitting the else
will have created at least 16 intermediate String
s in the intervening checks. I am assuming that these are just stand-in suggestions, and we'll have significantly more optimized implementations, yeah?
So, to sum up — I agree with you! I think we need to offer the best solutions for developers we can, because not doing something for pure pedantic correctness leads to both frustration and incorrect code. I think we can do this in a way that leads not only to more intuitive API, but that suggests a world of complexity without diving into the details, and while still being useful. Terminology here is powerful, and I think that without declaring how things "are", we can both achieve what we want to do, and avoid influencing developer's views of what "is" and "is not".