Pitch: Character and String properties


(Xiaodi Wu) #63

Well put. I share the same concern and only wish I could have expressed it so cogently.

I wonder if a large proportion of desired use cases would be served by a much more limited set of ASCII properties only.


(Michael Ilseman) #64

Welcome to String!

Thank you for bringing this concern up. I think a good, pragmatic approach is to reserve ourselves the ability to alter and tweak behavior surrounding unanticipated corner cases and changes. Your concern convinces me that these properties should definitely be resilient, at least for now.

However, I do want to make sure we don’t fall victim to Unicode-FUD. As written, it sounds like your concern is that we’re defining our own semantics and that Unicode might later somehow change our own semantics. I assume you’re either concerned that Unicode might take a dramatic turn in the semantics of the underlying scalar properties, or you’re concerned that Unicode might decide to define properties on graphemes and Swift will be left in its dust.

For the former, the scalar properties used are one of the more stable aspects of Unicode. Everything else in String is far more prone to breakage if Unicode suddenly reinvented itself by abandoning its key principles. This is even more hazardous for things inside the Unicode namespace. If the unthinkable happened, then yes, when we roll out a total API breakage of String we will also have to consider these properties.

For the latter, keeping these Character properties resilient would allow us significant leeway in adapting to a new world without rebuilding old code.

(When I say resilient, we still might have inlineable fast-paths for ASCII, unless we fear the semantics of ASCII changing too).

I want to be clear that I am not generalizing scalar properties to graphemes. I am defining specific semantics on specific graphemes, utilizing specific scalar properties to drive them. Scalar properties cannot, in general, be generalized to graphemes.

Sure, why not? (more serious answer below :-)

I’m not trying to be glib. There is no right answer, so we give the best answer we can. This is the same for String being a collection of graphemes, comparison honoring canonical equivalence, etc. All of which are not the right answer, but are the best answer we can give. Demanding that we provide either the “right” answer or no answer at all is what landed us with Swift 3-era String, and no one wants to go back to that for obvious reasons. If you make a type so obnoxious to use, people will misuse it in far worse ways.

Making String be a collection of Characters, as was done in Swift 4, violates the purity of an algebra of collections. I can constructs two strings such that a.count + b.count > (a+b).count. String has append(_:Character), yet I can craft a non-identity Character such that appending it does not alter the count, but both modifies the last element and invalidates the last index. Nevertheless, String should be a collection of Characters, even though there are situations that can cause it to violate concatenation theory.

I view properties on Character as the next logical progression in String’s long march towards being ergonomic. When someone is new to Swift, whether experienced from other platforms or new to programming, String is the first type they encounter. If the response to the question “what can a String tell me” involves a deep dive into Unicode as it did in Swift 3, then we’re doing something wrong. Now it’s a collection of Character, which is nice. If our response to “what can a Character tell me” is to point to the expert-use Unicode.Scalar.Properties and give them a stern here-be-dragons warning, then we’re doing something wrong.

(Not that there’s anything wrong with deep dives into Unicode. I just wouldn’t wish it upon my enemies users).

Coming back to the specific question of whether or not exotic graphemes starting with whitespace should be considered whitespace. Since there is no perfect answer, I think a good answer would be that a String containing whitespace also returns true for myStr.contains { $0.isWhitespace }.

This is an excellent example, thank you for pointing it out. This example, and the confusion it causes even sophisticated users like @nicklockwood, demonstrate that I proposed the wrong semantics. I think a better approach would be one that came up in discussion with @allevato’s point #2 (ignore the other points, we discovered flaws therein).

The semantics I proposed were easy to specify in terms of Unicode constructs, but are not intuitive. Since these are not in the Unicode namespace, I think it should instead be:

extension Character {
  var isUppercase: Bool { return String(self) == self.uppercased() && self.isCased }
  var isCased: Bool { return String(self) != self.uppercased() || String(self) != self.lowercased() || String(self) != self.titlecased() }
}

That is, a Character is uppercase if it is invariant under case mapping to upper and it varies under some other case mapping.

Actually, it is this very tendency that I find to be an argument for Character properties. Users define these properties themselves, poorly, but we can give a more robust answer efficiently.


(Michael Ilseman) #65

This seems like a similar design space that @jrose was mentioning in another thread:

Similarly, @Ben_Cohen’s blog post identifies a starter pitch for a lazy split collection.

It seems like there’s a general need for a (configurable) lazy split collection. The proposed String.Lines likewise would be inventing a (less general) collection for this purpose. It may make sense to spin off a discussion to design this general construct.


Corner-cases in `Character` classification of whitespace
(Itai Ferber) #66

Oh, I know the pain :upside_down_face:

To clarify: my concern is the latter — it’s about defining certain properties on graphemes which may later contradict a definition provided by Unicode itself. This sentence shapes the majority of my concern on the issue:

If the definitions of these properties were to somehow change (and I doubt they would! but still), we would either:

  1. Be out of sync with the Unicode spec, or
  2. Cause a lot of implicit code breakage should we choose to change the semantics to match Unicode

I am not so concerned with this from a resilience perspective, but from the perspective of the fact that these semantics can’t really be captured by any sort of type system. Which means that your old code (which didn’t even need to be rebuilt!) now behaves differently in potentially subtle ways, and no type system could warn you. Of course, this problem is nothing new — any framework you link against can change out from underneath you in an incompatible way; but the concern with strings specifically is that they are:

  1. Extremely integral to how the vast majority of written code in the wild behaves (both in terms of their importance/prevalence, and in terms of how integral strings are to various programming languages)
  2. Informed by, and inform, a “world view” of how things work. Considering the loop code we wrote below — say that we decide to follow the newly proposed semantics of "isUppercased returns true if the character could change under case translation, but currently does not"; in this sense, the naïve code would work pretty much correctly. It would be reasonable for a developer to assume these semantics because they make intuitive sense. However, if Unicode then later mandated that grapheme clusters must follow the semantics of the isUppercased implementation as originally proposed here, the existing code would return nonsensical results inconsistent with the developer’s original world view of how strings behave

Yes, to clarify — the concerns detailed apply to the specific properties applied to the specific graphemes, and…

… to be clear, I am not in disagreement that we should offer better solutions than we have today! I think we can and should do better (even at the cost of some amount of “correctness”, for some definition of “correctness”), but I think there are many ways we can go about it. I am in full support of doing this.

And I completely agree. One of the more infuriating messages from Swift 3 was the unavailability statement on String.count, even for someone who is aware of what might be going on under the hood. :slight_smile:

I agree that there is no perfect answer here, but I think we can markedly do better. Naming has a lot of power here, considering how complex the situation can be. Even something as innocuous as isWhitespace can lead someone to believe things which may or may not be true. Two concerns with the name:

  1. “is”: I think that the word “is” is dangerous, and potentially ambiguous here. “Is” "\u{0020}\u{0301}" whitespace? Yes. “Is” it also not whitespace? Yes! “Is” can be a reductive word, like “just”, and I think that therein lies a danger. “Is it only whitespace?” No. “Does it have whitespace in it?” Yes. A clear delineation here would improve things leaps and bounds, I think; a simple change in terminology could be sufficient to get around the danger. If we choose “is” to mean “is exclusively”, then, for instance "\u{0020}\u{0301}".isWhitespace == false, while "\u{0020}\u{0301}".hasWhitespace == true. If we are concerned about an API explosion, then it might be sufficient to offer has<Property>(exclusively: Bool = false): $0.hasWhitespace() == true, $0.hasWhitespace(exclusively: true) == false
  2. This is a much smaller concern than the above, but “whitespace” here can have different implications for different use cases. “Whitespace” to someone might mean “a character used to control spacing which does not draw anything on the screen”, while someone else might be more concerned with it being “a character used to delineate words as typed by a human being”. What does it mean for a string to be delineated by "\u{0020}\u{0301}"s? ¯\(ツ)/¯ If I split the string on $0.isWhitespace, would I get what I expect? ¯\(ツ)/¯ But there are different expectations here.

I think that if instead of defining what “is” “whitespace” and we start offering “does this have whitespace”, and “does this contain exclusively things which are whitespace”, we also avoid the risk of further influencing developer’s views on what “is” and “is not”. I think a lot of misconceptions exist today because of poor API naming, and I think that we should not only offer answers here, but do better than our predecessors. :slightly_smiling_face:

Note also that the above is totally straw-man naming. IIRC, the other thread somewhat covered the concept of making these properties an OptionSet; it’s also conceivable that we would pivot this to a somewhat simpler $0.has(.whitespace, exclusively: true), or something, but coming up with the specifics is a separate discussion. I just want to express my concern with defining what “is” and “is not”, rather than saying what “has” or “does not have”.

[Besides the fact that delineating these two types of properties can help developers make potentially more informed decisions about the types of operations that they want to perform.]

Yes, this is I think both a useful and importantly, intuitive solution. isUppercase/isLowercase returning true for characters which are not cased to begin with is confusing.

As an aside, though — these aren’t the actual proposed implementations, but just examples of how they can be implemented today with public extensions, right? As written, these seem like very expensive operations. Considering the following (somewhat reasonable) code:

for char in str {
    if char.isUppercase {
        // Do uppercase thing
    } else if char.isLowercase {
        // Do lowercase thing
    } else {
        // Do no case thing
    }
}

Hitting the else will have created at least 16 intermediate Strings in the intervening checks. I am assuming that these are just stand-in suggestions, and we’ll have significantly more optimized implementations, yeah?

So, to sum up — I agree with you! I think we need to offer the best solutions for developers we can, because not doing something for pure pedantic correctness leads to both frustration and incorrect code. I think we can do this in a way that leads not only to more intuitive API, but that suggests a world of complexity without diving into the details, and while still being useful. Terminology here is powerful, and I think that without declaring how things “are”, we can both achieve what we want to do, and avoid influencing developer’s views of what “is” and “is not”.


Corner-cases in `Character` classification of whitespace
(Michael Ilseman) #67

Could you provide more detail behind this concern? Are you afraid of a change that cannot be accommodated through resilience? What could this look like?

Again, the rest of String is at far greater risk. String.count, for instance, is known to vary significantly from version to version of Unicode.

What do you mean by capturing semantics in the type system? Swift does not have dependent types, and a String’s length is not captured by the type system. A type system cannot reject all faulty programs, much less read the developer’s mind about how they intend behavior to evolve over time without recompilation.

Is there something about these properties that are cause for greater concern than other functionality in String? The standard library will adapt to an ever-changing Unicode to the best of its ability.

:+1:

For naming, hasWhitespace is a good alternative name. Character.has(OptionSet<...>, exclusively: ...) is a valid alternative design.

Of course


(Michael Ilseman) #68

The feedback from this thread has been excellent and highly valuable, especially from those concerned with the approach. I’m spinning off all segmentation (e.g. lines/words) into a future pitch to align with an approach to lazy splitting. I’m focusing on Character properties, and expanding the properties to include queries that I think we can reasonably give good answers for while maintaining flexibility of evolution. All definitions provided are for illustration purposes and will likely land as some combination of inlineable ASCII fast paths and resilient fall-backs.

(Originally, I tried to keep these properties confined to “programmer concepts” such as whitespace and newlines. This was largely out of a fear of nailing down any semantics in the age of Unicode, though I now feel this fear was unfounded.)

Things in code quotes are pitched, things in “word quotes” are tentative, alternatives, or rejects. Queries are presented in my formulation; see alternatives below. A list of pitched Character properties and a brief description of their semantics:

  • isASCII: CR-LF or single-scalar <= 0x7F
  • asciiValue: UInt8? (with a comment explaining CR-LF value-normalizes to LF)
  • isWhitespace: Permissive: leading scalar with property White_Space
    • Alternative name: “hasWhitespace”
    • Alternative definition: Strict: CR-LF or single-scalar with property White_Space
  • isNewline: Permissive: leading scalar is either CR, LF, NEL, LS, or PS
  • isNumber: Permissive: leading numeric scalar (Numeric_Type != None)
    • isDecimalDigit: Strict: single-scalar with Numeric_Type=Decimal
    • isHexadecimalDigit: Strict: single-scalar with property Hex_Digit
    • decimalValue: Int?, hexadecimalValue: Int?
      • Alternative: “numericValue” with some kind of grapheme evaluation logic
      • Alternative: just “hexadecimalValue” (with a more general name), as “decimalValue” is a subset
    • Rejected: “isDigit”. Modern Unicode advises against the distinction between a non-decimal digit and a number.
  • isLetter: Permissive: leading scalar with derived property Alphabetic
    • isUppercase, isLowercase, isTitlecase: varies under other case conversion, but invariant to upper/lower/title.
      • Alternative: “isUppercased” that just means invariant under conversion to upper
    • isCased: varies under at least one case conversion
      • Alternative name: hasCase
    • deferred: “isIdeographic”
      • Deferring reasoning about ideographic description sequences for now (if ever), especially directly on Character.
  • isSymbol: Permissive: leading scalar General Category S*
    • isMathSymbol : Permissive: leading scalar has derived property Math (not a strict subset of isSymbol)
    • isCurrencySymbol: Permissive, leading scalar has General Category Sc
    • Alternative definition of “isSymbol”: leading scalar has General Category S* or derived property Math
  • isPunctuation: Permissive: leading scalar General Category P*
    • deferred: “isDash”, “isQuotationMark”, “isTerminalPunctuation”
      • Deferred out of fear of wading too far into linguistic analysis without a proper framework or toolset.
  • isEmoji (pending investigation)
    • deferred: further queries: “isFlagEmoji”, “isUndeadEmoji”, “has/getFooModifier”, …
  • deferred: categorization of abstract graphemes, e.g. “isGraphic”, “isFormat”, “isControl”

Alternative design: Character.has(_: OptionSet<…>, exclusively: Bool, …)

Alternative to this pitch: redesign CharacterSet to solve this, either through some new concept or incremental evolution to support graphemes.

(Note: Of course, definitions may evolve to match unanticipated changes in Unicode, potentially through a source-compatibility-preserving deprecation process).

edit: Added isNewline, which I initially forgot. Dropped one alternative, as isHexadecimalDigit is not a superset of isDecimalDigit.


(TJ Usiyan) #69

Thank you for this!
I’m a bit reluctant to request it but isOctalDigit seems like a good fit here. Just thinking about programming as a domain and looking at the swift grammar, octal is as relevant as hex.


(Michael Ilseman) #70

Octal wouldn’t really be that different as octal numerals are a subset of decimal numerals. For example, you can check for decimal and use decimalValue < 8. Hex is notably different from decimal as it uses letters to denote additional digits.

I’m not opposed to adding isOctalDigit, it just currently feels a little more out there than the others. The other pitched properties are not easily derivable from each other.


(Karl) #71

This looks like good.

I’m wondering about the decimalValue and hexadecimalValue properties, though. I understand that ICU exposes them that way, but it seems weird for Swift given that the code to parse an entire String as an integer lives as a failable initialiser on Int, rather than an optional property on String. Perhaps Int.init?(Character, radix: Int) would be better for single character conversion?

Note that this is only about the Character->Int conversion, not the isHexadecimalDigit & co. Boolean properties.


(Daryle Walker) #72

I’ve gotten messed up by that too. (“Why isn’t my CR-followed-by-LF parsing code being triggered in my test? Because Swift for some reason reads them together in a single CRLF-valued Character!”)


(Michael Ilseman) #73

Reason: https://www.unicode.org/reports/tr29/#GB3


(Itai Ferber) #74

I think I phrased my thoughts here poorly. To clarify: there is no cause for concern here greater than with anything else in Swift. :slightly_smiling_face: The consideration is just that if Unicode produced guidelines for treating grapheme clusters in a way that is inconsistent with the design that we’ve chosen (e.g. we decide that "\u{0020}\u{0301}".isWhitespace == true, and Unicode later dictates that "\u{0020}\u{0301}".isWhitespace == false), we would have to either:

  1. Change the behavior implicitly to match suit (bad), or
  2. Follow our standard process of deprecate-and-replace

The consideration there is that it would be harder to deprecate and replace a name so fundamental like isWhitespace. Is the possibility of this happening likely? Not at all. But given that we will be working to match an external spec, we should aim for API resilience and do our best to avoid painting ourselves into a corner.

Thanks for being receptive to my concerns — I’m happy with the underlying properties we’d be exposing here, but I want to make sure we get the naming of this right as best we can!


(Michael Ilseman) #75

I intentionally did not check how ICU exposes these values. They’re based off of the numeric values defined by the UCD.

There are a variety of ways to implement this. ICU’s dedicated APIs for something this specific tend to be discouraged in modern Unicode usage, so we may base the returned values off the more general Unicode.Scalar.Properties.numericValue.

We could certainly consider FixedWidthInteger.init(_:Character, radix: Int=10), which corresponds to the one for String. This would be restricted to ASCII, just like the String one.

However, Character.decimalValue should properly return 3 for the Thai numeral ๓.


(Jordan Rose) #76

Um. What does isTitlecase do on a single Character? What does that mean?


#77

It’s definitely a confusing name, and perhaps we can think of a better one. It’s explained here, e.g.

For example, U+01C7 (LJ) maps to U+01C8 (Lj) rather than to U+01C9 (lj).


(Michael Ilseman) #78

I opened a PR on top of some of @allevato’s scalar property work to play with. CharacterProperties.swift on that branch has a reference implementation.


(Xiaodi Wu) #79

This is a tricky one. For reasons that still elude me, Unicode recommendations specifically state that whether a character is presented as emoji can differ based on platform, application, etc. And, true to form, whether a character is presented as emoji really does differ based on platform, application, etc.

Therefore, the plain answer to “is this emoji?” will never correspond to any Swift method named isEmoji that isn’t hooked up to text rendering facilities. I could get behind a hasEmojiPresentation property (corresponding to Unicode \p{Emoji_Presentation} + characters that have an emoji presentation selector), but I think that may be the best we can do at the moment. It seems to me that any attempted standard library isEmoji would be actively misleading.


(Michael Ilseman) #80

Yes, isEmoji is pending investigations whether we can provide a reasonable answer. If not, it will be dropped. It could also be conditionally available based on whether platforms are capable of answering it.

edit: addendum

For example, a overly-inclusive strawman semantics could be (using emoji-data):

  • Character.isEmoji: leading scalar has property Emoji and doesn’t have property Emoji_Component, or first two scalars are regional indicators or keycap sequences.

This would permit even emoji whose default presentation is textual. I.e., this includes both points of ED-20 as well as flags and keycaps. Alternatively, we can be overly-restrictive by going off of default presentation like you mentioned, i.e. the first point of ED-20 only, and then include sequences for flags/keycaps.

Finally, we may decide to defer to the future as emoji are a newer concept and may continue to be in flux in the near term. This is my current leaning :sweat_smile:.


(Karl) #81

Hmm… I never considered it, but shouldn’t Int.init?(String, radix: Int) also be unicode-aware (with an ASCII fast-path)?

I mean, I guess non-10 radixes (radii?) would also only make sense for ASCII. But let’s say I write an App which parses user-input using this function and the user’s language is set to Thai; I would expect that they can write “๓” and I’ll see a 3.

I think the String->Int and Character->Int conversion methods should be consistent.


(Michael Ilseman) #82

I think a better approach would be to rename the existing one to have an explicit argument label (This would also simplify type checking as the init is failable). Example:

- FixedWidthInteger.init?<S: StringProtocol>(_: S, radix: Int = 10)
+ FixedWidthInteger.init?<S: StringProtocol>(ascii: S, radix: Int = 10)

+ FixedWithInteger.init?(ascii: Character, radix: Int = 10)

The initializer is intentionally simplistic, as the task of constructing a value from a String in a linguistic context deserves a much more involved design.

For Character.decimalDigitValue, however, a single Character which represents a decimal digit in supported writing systems is well defined and simpler than constructing a value from a String in the abstract. There is no radix, as the writing system itself is inherently base-10. Since these are single numbers, there is no further interpretation needed (e.g. writing direction doesn’t matter).

I was thinking it was a stretch to tackle this with this pitch, but I think it improves consistency. Thank you for pointing this out!

edit: grammar, decimalDigitValue