Pitch: Character and String properties

The feedback from this thread has been excellent and highly valuable, especially from those concerned with the approach. I’m spinning off all segmentation (e.g. lines/words) into a future pitch to align with an approach to lazy splitting. I’m focusing on Character properties, and expanding the properties to include queries that I think we can reasonably give good answers for while maintaining flexibility of evolution. All definitions provided are for illustration purposes and will likely land as some combination of inlineable ASCII fast paths and resilient fall-backs.

(Originally, I tried to keep these properties confined to “programmer concepts” such as whitespace and newlines. This was largely out of a fear of nailing down any semantics in the age of Unicode, though I now feel this fear was unfounded.)

Things in code quotes are pitched, things in “word quotes” are tentative, alternatives, or rejects. Queries are presented in my formulation; see alternatives below. A list of pitched Character properties and a brief description of their semantics:

  • isASCII: CR-LF or single-scalar <= 0x7F
  • asciiValue: UInt8? (with a comment explaining CR-LF value-normalizes to LF)
  • isWhitespace: Permissive: leading scalar with property White_Space
    • Alternative name: “hasWhitespace”
    • Alternative definition: Strict: CR-LF or single-scalar with property White_Space
  • isNewline: Permissive: leading scalar is either CR, LF, NEL, LS, or PS
  • isNumber: Permissive: leading numeric scalar (Numeric_Type != None)
    • isDecimalDigit: Strict: single-scalar with Numeric_Type=Decimal
    • isHexadecimalDigit: Strict: single-scalar with property Hex_Digit
    • decimalValue: Int?, hexadecimalValue: Int?
      • Alternative: “numericValue” with some kind of grapheme evaluation logic
      • Alternative: just “hexadecimalValue” (with a more general name), as “decimalValue” is a subset
    • Rejected: “isDigit”. Modern Unicode advises against the distinction between a non-decimal digit and a number.
  • isLetter: Permissive: leading scalar with derived property Alphabetic
    • isUppercase, isLowercase, isTitlecase: varies under other case conversion, but invariant to upper/lower/title.
      • Alternative: “isUppercased” that just means invariant under conversion to upper
    • isCased: varies under at least one case conversion
      • Alternative name: hasCase
    • deferred: “isIdeographic”
      • Deferring reasoning about ideographic description sequences for now (if ever), especially directly on Character.
  • isSymbol: Permissive: leading scalar General Category S*
    • isMathSymbol : Permissive: leading scalar has derived property Math (not a strict subset of isSymbol)
    • isCurrencySymbol: Permissive, leading scalar has General Category Sc
    • Alternative definition of “isSymbol”: leading scalar has General Category S* or derived property Math
  • isPunctuation: Permissive: leading scalar General Category P*
    • deferred: “isDash”, “isQuotationMark”, “isTerminalPunctuation”
      • Deferred out of fear of wading too far into linguistic analysis without a proper framework or toolset.
  • isEmoji (pending investigation)
    • deferred: further queries: “isFlagEmoji”, “isUndeadEmoji”, “has/getFooModifier”, …
  • deferred: categorization of abstract graphemes, e.g. “isGraphic”, “isFormat”, “isControl”

Alternative design: Character.has(_: OptionSet<…>, exclusively: Bool, …)

Alternative to this pitch: redesign CharacterSet to solve this, either through some new concept or incremental evolution to support graphemes.

(Note: Of course, definitions may evolve to match unanticipated changes in Unicode, potentially through a source-compatibility-preserving deprecation process).

edit: Added isNewline, which I initially forgot. Dropped one alternative, as isHexadecimalDigit is not a superset of isDecimalDigit.

3 Likes