Pitch: Character and String properties

(This has been updated with the most recent pitch)

Gist-formatting version: Pitch: Character and String properties · GitHub


Pitch: Character and String properties

A PR with the implementation can be found here

Introduction

@allevato (a co-author here) started at pitch at Adding Unicode properties to UnicodeScalar/Character - Pitches - Swift Forums, which exposes Unicode properties from the Unicode Character Database. These are Unicode expert/enthusiast oriented properties that give a finer granularity of control and can be used to answer many Unicody enquiries.

However, they are not very ergonomic and Swift makes no attempt to clarify their usage, as their meaning and proper interpretation is directly tied to the Unicode Standard and the version of Unicode available at run time.

There's some low-hanging ergo-fruit to be picked by exposing properties directly on Character.

Proposed Approach

(Note that Unicode does not define properties on graphemes in general. Swift is defining its own semantics in terms of Unicode semantics derived from scalar properties, semantics on strings, or both)

Character Properties

extension Character {
  /// Whether this Character is ASCII.
  @inlinable
  public var isASCII: Bool { return asciiValue != nil }

  /// Returns the ASCII encoding value of this Character, if ASCII.
  ///
  /// Note: "\r\n" (CR-LF) is normalized to "\n" (LF), which will return 0x0A
  @inlinable
  public var asciiValue: UInt8? {
    // TODO: Tune for codegen
    if _slowPath(self == ._crlf) { return 0x000A /* LINE FEED (LF) */ }
    if _slowPath(!_isSingleScalar || _firstScalar.value >= 0x80) { return nil }
    return UInt8(_firstScalar.value)
  }

  /// Whether this Character represents whitespace, including newlines.
  ///
  /// Examples:
  ///   * "\t" (U+0009 CHARACTER TABULATION)
  ///   * " " (U+0020 SPACE)
  ///   * U+2029 PARAGRAPH SEPARATOR
  ///   * U+3000 IDEOGRAPHIC SPACE
  ///
  public var isWhitespace: Bool { ... }

  /// Whether this Character represents a newline.
  ///
  /// * "\n" (U+000A): LINE FEED (LF)
  /// * "\r" (U+000D): CARRIAGE RETURN (CR)
  /// * "\r\n" (U+000A U+000D): CR-LF
  /// * U+0085: NEXT LINE (NEL)
  /// * U+2028: LINE SEPARATOR
  /// * U+2029: PARAGRAPH SEPARATOR
  ///
  public var isNewline: Bool { ... }

  /// Whether this Character represents a number.
  ///
  /// Examples:
  ///   * "7" (U+0037 DIGIT SEVEN)
  ///   * "⅚" (U+215A VULGAR FRACTION FIVE SIXTHS)
  ///   * "㊈" (U+3288 CIRCLED IDEOGRAPH NINE)
  ///   * "𝟠" (U+1D7E0 MATHEMATICAL DOUBLE-STRUCK DIGIT EIGHT)
  ///   * "๒" (U+0E52 THAI DIGIT TWO)
  ///
  public var isNumber: Bool { ... }

  /// Whether this Character represents a whole number. See
  /// `Character.wholeNumberValue`
  @inlinable
  public var isWholeNumber: Bool { return wholeNumberValue != nil }

  /// If this Character is a whole number, return the value it represents, else
  /// nil.
  ///
  /// Examples:
  ///   * "1" (U+0031 DIGIT ONE) => 1
  ///   * "५" (U+096B DEVANAGARI DIGIT FIVE) => 5
  ///   * "๙" (U+0E59 THAI DIGIT NINE) => 9
  ///   * "万" (U+4E07 CJK UNIFIED IDEOGRAPH-4E07) => 10_000
  ///
  public var wholeNumberValue: Int? { ... }

  /// Whether this Character represents a hexadecimal digit.
  ///
  /// Hexadecimal digits include 0-9, Latin letters a-f and A-F, and their
  /// fullwidth compatibility forms. To get their value, see
  /// `Character.hexadecimalDigitValue`
  @inlinable
  public var isHexadecimalDigit: Bool { return hexadecimalDigitValue != nil }

  /// If this Character is a hexadecimal digit, returns the value it represents,
  /// else nil.
  ///
  /// Hexadecimal digits include 0-9, Latin letters a-f and A-F, and their
  /// fullwidth compatibility forms.
  public var hexadecimalDigitValue: Int? { ... }

  /// Whether this Character is a letter.
  ///
  /// Examples:
  ///   * "A" (U+0041 LATIN CAPITAL LETTER A)
  ///   * "é" (U+0065 LATIN SMALL LETTER E, U+0301 COMBINING ACUTE ACCENT)
  ///   * "ϴ" (U+03F4 GREEK CAPITAL THETA SYMBOL)
  //   * "ڈ" (U+0688 ARABIC LETTER DDAL)
  ///   * "日" (U+65E5 CJK UNIFIED IDEOGRAPH-65E5)
  ///   * "ᚨ" (U+16A8 RUNIC LETTER ANSUZ A)
  ///
  public var isLetter: Bool { ... }

  /// Perform case conversion to uppercase
  ///
  /// Examples:
  ///   * "é" (U+0065 LATIN SMALL LETTER E, U+0301 COMBINING ACUTE ACCENT)
  ///     => "É" (U+0045 LATIN CAPITAL LETTER E, U+0301 COMBINING ACUTE ACCENT)
  ///   * "и" (U+0438 CYRILLIC SMALL LETTER I)
  ///     => "И" (U+0418 CYRILLIC CAPITAL LETTER I)
  ///   * "π" (U+03C0 GREEK SMALL LETTER PI)
  ///     => "Π" (U+03A0 GREEK CAPITAL LETTER PI)
  ///   * "ß" (U+00DF LATIN SMALL LETTER SHARP S)
  ///     => "SS" (U+0053 LATIN CAPITAL LETTER S, U+0053 LATIN CAPITAL LETTER S)
  ///
  /// Note: Returns a String as case conversion can result in multiple
  /// Characters.
  @inlinable
  public func uppercased() -> String { return String(self).uppercased() }

  /// Perform case conversion to lowercase
  ///
  /// Examples:
  ///   * "É" (U+0045 LATIN CAPITAL LETTER E, U+0301 COMBINING ACUTE ACCENT)
  ///     => "é" (U+0065 LATIN SMALL LETTER E, U+0301 COMBINING ACUTE ACCENT)
  ///   * "И" (U+0418 CYRILLIC CAPITAL LETTER I)
  ///     => "и" (U+0438 CYRILLIC SMALL LETTER I)
  ///   * "Π" (U+03A0 GREEK CAPITAL LETTER PI)
  ///     => "π" (U+03C0 GREEK SMALL LETTER PI)
  ///
  /// Note: Returns a String as case conversion can result in multiple
  /// Characters.
  @inlinable
  public func lowercased() -> String { return String(self).lowercased() }

  @inlinable
  internal var _isUppercased: Bool { return String(self) == self.uppercased() }
  @inlinable
  internal var _isLowercased: Bool { return String(self) == self.lowercased() }

  /// Whether this Character is considered uppercase.
  ///
  /// Uppercase Characters vary under case-conversion to lowercase, but not when
  /// converted to uppercase.
  ///
  /// Examples:
  ///   * "É" (U+0045 LATIN CAPITAL LETTER E, U+0301 COMBINING ACUTE ACCENT)
  ///   * "И" (U+0418 CYRILLIC CAPITAL LETTER I)
  ///   * "Π" (U+03A0 GREEK CAPITAL LETTER PI)
  ///
  @inlinable
  public var isUppercase: Bool { return _isUppercased && isCased }

  /// Whether this Character is considered lowercase.
  ///
  /// Lowercase Characters vary under case-conversion to lowercase, but not when
  /// converted to uppercase.
  ///
  /// Examples:
  ///   * "é" (U+0065 LATIN SMALL LETTER E, U+0301 COMBINING ACUTE ACCENT)
  ///   * "и" (U+0438 CYRILLIC SMALL LETTER I)
  ///   * "π" (U+03C0 GREEK SMALL LETTER PI)
  ///
  @inlinable
  public var isLowercase: Bool { return _isLowercased && isCased }

  /// Whether this Character changes under any form of case conversion.
  @inlinable
  public var isCased: Bool { return !_isUppercased || !_isLowercased }

  /// Whether this Character represents a symbol
  ///
  /// Examples:
  ///   * "®" (U+00AE REGISTERED SIGN)
  ///   * "⌹" (U+2339 APL FUNCTIONAL SYMBOL QUAD DIVIDE)
  ///   * "⡆" (U+2846 BRAILLE PATTERN DOTS-237)
  ///
  public var isSymbol: Bool { ... }

  /// Whether this Character represents a symbol used mathematical formulas
  ///
  /// Examples:
  ///   * "+" (U+002B PLUS SIGN)
  ///   * "∫" (U+222B INTEGRAL)
  ///   * "ϰ" (U+03F0 GREEK KAPPA SYMBOL)
  ///
  /// Note: This is not a strict subset of isSymbol. This includes characters
  /// used both as letters and commonly in mathematical formulas. For example,
  /// "ϰ" (U+03F0 GREEK KAPPA SYMBOL) is considered a both mathematical symbol
  /// and a letter.
  ///
  public var isMathSymbol: Bool { ... }

  /// Whether this Character represents a currency symbol
  ///
  /// Examples:
  ///   * "$" (U+0024 DOLLAR SIGN)
  ///   * "¥" (U+00A5 YEN SIGN)
  ///   * "€" (U+20AC EURO SIGN)
  public var isCurrencySymbol: Bool { ... }

  /// Whether this Character represents punctuation
  ///
  /// Examples:
  ///   * "!" (U+0021 EXCLAMATION MARK)
  //   * "؟" (U+061F ARABIC QUESTION MARK)
  ///   * "…" (U+2026 HORIZONTAL ELLIPSIS)
  ///   * "—" (U+2014 EM DASH)
  ///   * "“" (U+201C LEFT DOUBLE QUOTATION MARK)
  ///
  public var isPunctuation: Bool { ... }
}

// ... helpers
extension Unicode.GeneralCategory {
  @inlinable internal var _isSymbol: Bool { ... }
  @inlinable internal var _isPunctuation: Bool { ... }
}
extension Character {
  @inlinable internal var _firstScalar: Unicode.Scalar { ... }
  @inlinable internal var _isSingleScalar: Bool { ... }

  @inlinable static internal var _crlf: Character { return "\r\n" }
  @inlinable static internal var _lf: Character { return "\n" }
}

Detailed Semantics and Rationale

The rules of grapheme breaking allow for semantically meaningless, yet technically valid, Characters. Furthermore, fuzziness is inherent in modeling human writing systems. So, we make the best effort we can and try to discover some principle to follow. Principles are not hard rules that always leads one to a single clear answer, but are useful for evaluating tradeoffs.

The closest concept might be something similar to W3C’s Principle of Tolerance, paraphrased as “Be liberal in what you accept, conservative in what you produce”. Perhaps another phrasing could be “Be permissive in generality, restrictive in specificity”.

Restrictive in specificity

Properties that have a clearly prescribed use, or which the stdlib produces a specific value for, should be restrictive in specification. One example is isWholeNumber and wholeNumberValue, where the fact that the stdlib produces an Int for a Character means that we need to be restrictive in the semantics of what a whole number is.

isWholeNumber returns a value only on single-scalar graphemes whose sole scalar has an integral numeric value. Thus, we reject whole numbers that are modified by a subsequent combining character. “7̅” (7 followed by U+0305 COMBINING OVERLINE) is rejected as there is no clear interpretation of the value. Any attempt to produce a specific Int from it would be highly dubious.

Permissive in generality

Where there is no prescribed usage and no specific values to produce, we try to be as permissive as reasonable. For example, isLetter just queries the first scalar to see if it is “letter-like”, and thus handles unforeseeable combinations of a base letter-like scalar with myriad subsequent combining, modifying, or extending scalars. isLetter merely answers a general (fuzzy) question, but doesn’t produce specific values nor prescribes use.

API Semantics

Below is a grouping of semantics into “permissive”, which means accept/reject based only on properties of the first scalar, and “restrictive” which means accept/reject based on analysis of the entire grapheme.

Restrictive:

  • isASCII / asciiValue
  • isWholeNumber / wholeNumberValue
  • isHexDigit / hexDigitValue
  • isUppercase / uppercased(), isLowercase / lowercased(), isCased

Permissive:

  • isNumber
  • isLetter
  • isSymbol / isMathSymbol / isCurrencySymbol
  • isPunctuation
  • isWhitespace (maybe*)
  • isNewline (maybe*)

* Newlines are not just hard line-breaks in traditional written language, but common terminators for programmer strings. If a Character is “\n\u{301}” (a newline with a combining accent over it), is this a newline? Either interpretation can lead to inconsistencies. If true, then a program might skip the first scalar in a new entry (whatever such a combining scalar at the start could mean). If we say false, then a string with newline terminators inside of it would return false for myStr.contains { $0.isNewline }, which is counter-intuitive. This same reasoning may apply to whitespace.

A couple options:

  1. Permissive, to keep consistency with myStr.contains { $0.isNewline } and grapheme-by-grapheme processing in general
  2. Restrictive, to prevent the programmer from skipping over relevant scalars, at the risk of counter-intuitive string processing behavior
  3. Rename to hasNewline, which is permissive
  4. Drop from this pitch in favor of an eventual String.lines or something similar.

We think choice #1 is arguably less bad than #2, and more transparently reflects the reality of grapheme-by-grapheme processing. We slightly prefer #1 to choice #3 or #4 as #1 is a common sense query. Though it does permit some meaningless graphemes, we don’t see any clearly harmful behavior as a result for realistic inputs, nor anticipate malicious behavior for malicious inputs. But, we could easily be convinced either way (see rejected alternatives).

@inlinable and non-@inlineable properties

The sweet spot of @inlinable hits APIs that are extremely unlikely to change their semantics and are frequently used or often part of a program’s “hot path”. ASCII-related queries check both of these boxes. For other properties whose semantics can be expressed entirely in terms of other API, @inlinable allows the optimizer to optimize across multiple calls and specific usage without giving up a meaningful degree of library evolution. isWholeNumber checking wholeNumberValue for nil is an example of this, as these two methods are semantically tied and the optimizer could (in theory) reuse the result for subsequent calls. We can always safely supply a new finely-tuned implementation of isWholeNumber in future versions of the stdlib, provided semantics are preserved.

For properties where we may change our strategy (or details of implementation) to accommodate future directions of Unicode and unanticipated corner-cases, these should be kept non-@inlinable.

Rejected Additions and Alternatives

Titlecase

Titlecase can be useful for some legacy scalars (ligatures) as well as for Strings when combined with word-breaking logic. However, it seems pretty obscure to surface on Character directly.

String.Lines, String.Words

These have been deferred from this pitch, to keep focus and await a more generalized lazy split collection.

Rename permissive isFoo to hasFoo

This was mentioned above in discussion of isNewline semantics and could also apply to isWhitespace. However, it would be awkward for isNumber or isLetter. What the behavior should be for exotic whitespace and newlines is heavily debatable. We’re sticking to isNewline/isWhitespace for now, but are definitely open to argument.

Design as Character.has(OptionSet<…>, exclusively: …) or something similar

There could be something valuable to glean from this, but we reject this approach as somewhat un-Swifty with a poor discovery experience, especially for new or casual users of Character. It can, however, clarify the semantic distinctions above by making it explicit to the user.

Add failable FixedWidthInteger/FloatingPoint inits taking Character

In addition to (or perhaps instead of) properties like wholeNumberValue, add Character-based FixedWidthInteger.init?(_:Character). Similarly FloatingPoint.init?(_:Character) which includes vulgar fractions and the like (if single-scalar).

There’s practical drawbacks to label-less failable inits for type checking performance; @rudkx has more details here. Furthermore, we feel that Int(Character("๓"))’s semantics is less clear than Character("๓").wholeNumberValue, and we don’t think this is worth surfacing at the top level of Int.

We could consider the following changes:

- FixedWidthInteger.init?<S: StringProtocol>(_: S, radix: Int = 10)
+ FixedWidthInteger.init?<S: StringProtocol>(ascii: S, radix: Int = 10)
+ FixedWithInteger.init?(ascii: Character, radix: Int = 10)

which would clarify the semantics of the initializers (and help disambiguate overloads for type checking) and allow for construction from an ASCII character. But, we don’t feel a non-ASCII-allowing FixedWidthInteger.init?(_:Character) would carry its weight at this time. We’re also not sure if this change is worth a source-compatibility break (simple rename) at this point, but could be convinced otherwise.

Another alternative could be FixedWidthInteger.init?(wholeNumber: Character), which uses an argument label to clarify. This would allow for future expansion should we ever add a numeric grapheme or string evaluator.

--

Special thanks to @millenomi, @itaiferber, @davedelong, and @jrose for challenging aspects of the pitch and pushing for improvement. Thanks to @Karl, @griotspeak, and everyone else for their suggestions and improvements.

12 Likes

Sounds good! In addition to the isASCII query, would it also be useful to provide an ascii property, that produces the ASCII code or nil as an Int8 or UInt8 (or sequence thereof)? That seems to me like the inevitable next question someone will have after asking about a character’s ASCII-ness.

1 Like

That sounds like a good idea. What do you think the result should be for "\r\n"?

2 Likes

How do you propose to solve the redundancy between this and the CharacterSet class properties that have similar semantics?

AFAICT, CharacterSet's semantics are different (except for newline) from any properties we'd define across graphemes. Changing CharacterSet is outside the scope of this pitch. Do you have any ideas?

1 Like

I have some ideas about fixing CharacterSet here: Pitch: ContainmentSet

crlf ruins everything. I think many clients would be satisfied just receiving .some("\n"), even if that's technically a white lie.

4 Likes

These look like fantastic additions! Can you say a little bit about why the Character casing methods return strings? I'm imagining weird Unicode edge cases where uppercase a character produces multiple grapheme clusters, but it will surely be a surprise, if not a stumbling block, for users of that API.

2 Likes

That's a good point. I think we should unify documentation with the scalar proposal, e.g. at [SE-0211] Add Unicode properties to Unicode.Scalar by allevato · Pull Request #15593 · apple/swift · GitHub

Using @allevato 's comments, an example, with meta-comments enclosed in <...>:

/// The titlecase mapping of the scalar<likewise for Character>.
///
/// This returns a `String` because some mappings may transform a single
/// scalar<likewise Character> into multiple scalars. For example, the ligature "fi" (U+FB01 LATIN
/// SMALL LIGATURE FI) becomes "Fi" (U+0046 LATIN CAPITAL LETTER F, U+0069
/// LATIN SMALL LETTER I) when converted to titlecase.
///
func titlecased() -> String { ... }

/// The uppercase mapping of the scalar<likewise for Character>.
///
/// This returns a `String` because some mappings may transform a single
/// scalar<likewise for Character> into multiple scalars. For example, the German letter "ß" (U+00DF
/// LATIN SMALL LETTER SHARP S) becomes "SS" (U+0053 LATIN CAPITAL LETTER S,
/// U+0053 LATIN CAPITAL LETTER S) when converted to uppercase.
///
func uppercased() -> String { ... }

/// The lowercase mapping of the scalar<likewise for Character>.
///
/// This returns a `String` because some mappings may transform a single
/// scalar<likewise for Character> into multiple scalars<likewise for Character>. For example, the letter "İ" (U+0130
/// LATIN CAPITAL LETTER I WITH DOT ABOVE) becomes two scalars (U+0069 LATIN
/// SMALL LETTER I, U+0307 COMBINING DOT ABOVE) when converted to lowercase.
///
/// <TODO: is there an example of a grapheme becoming multiple graphemes?>
func lowercased() -> String { ... }

What do you think?

That sounds reasonable, but I wonder if it would be better to have something like:

extension Character {
  /// If this Character is ASCII, return the ASCII values that comprise it, 
  /// otherwise returns nil.
  ///
  /// For most ASCII Characters, there is only one value that comprises the 
  /// grapheme, returned in the first value of the result. The second value is 0.
  ///
  /// For "\r\n" (CR-LF), which is a single Character, this returns (0x0d, 0x0a)
  ///
  var asciiValue: (UInt8, UInt8)? { 
    ... 
  }
}

Or perhaps we have asciiValue return a single value, and also have some query for the CR-LF corner case. WDYT?

To throw an alternative out, even though it's not much of an improvement:

extension Character {
  enum ASCIIValue {
    case single(UInt8)
    case pair(UInt8, UInt8)
  }

  var asciiValue: ASCIIValue? { ... }
}

It's reaaaaaaally unfortunate to design an entire layer of abstraction over a single tricky CR-LF pair and I would hate to use the API that I just wrote above.

2 Likes

Great explanations. I wonder if we need something on the string side to support the common case where you still are getting a single character back. (I don't quite know what kind of tasks would use these APIs, though.)

extension String {
    /// The value of the string if the string consists of a
    /// single character; otherwise, `nil`.
    var asSingleCharacter: Character? {
        if isEmpty { return nil }
        let secondIndex = index(after: startIndex)
        return secondIndex == endIndex ? first : nil
    }
}

let lowercaseA = "a" as Character
let uppercaseAString = lowercaseA.uppercased()
if let uppercaseA = uppercaseAString.asSingleCharacter {
    // keep working with Character
} else {
    // alternate String path
}

I am kinda worried about this for a number of reasons:

  • The definition of these properties is on Character (which is a single grapheme, IIRC). How does it square with wanting to query these properties on multiple graphemes? If they're per-grapheme, what makes this approach worth the additional API surface vs. using/improving CharacterSet and testing for inclusion?

  • .isEmoji's comment belies its multiple issues already.

  • Is there going to be a proliferation of these properties?

I think this is less useful and more prone to accidental misuse. Rather than coming up with an obnoxious name to discourage it, I think it more naturally fits within Unicode.Scalar.Properties.

We originally omitted the single-scalar mapping in Unicode.Scalar.Properties as it's less robust, but if we have general case conversion methods on String/Character/Unicode.Scalar, then the simplistic mapping could have a place in this more specialized namespace.

ICU does expose single scalar mappings, but under the older less-modern APIs and discourages their general use.

Yeah, I think any more technically-correct API would be too cumbersome to be useful.

3 Likes

It looks like there is exactly one grapheme cluster in today's Unicode where changing the case of a single-grapheme-cluster string leads to multiple grapheme clusters in the result: German "ß". While there is an uppercase form of "ß", the traditional way to uppercase it is to write "SS". (Note: I am not a German speaker, but regardless of whether this is correct, it is what locale-independent Unicode does and Unicode has promised not to change it.)

Citation: ftp://ftp.unicode.org/Public/UCD/latest/ucd/SpecialCasing.txt
More info: FAQ - Character Properties, Case Mappings and Names

EDIT: not counting compatibility codepoints, like the "ff" ligature, which is definitely going to uppercase to two separate graphemes "FF".

How about:

• The isAscii property answers the question “Does this Character consist of exactly one ASCII code point?”
• The ascii property returns nil whenever isAscii returns false.

Thus:

("\r\n" as Character).isAscii     // false
("\r\n" as Character).ascii       // nil
3 Likes

This makes the most sense to me. Any special logic we try to add on top of that is likely to lead to misuse. Let's keep it simple and straightforward.

the ff ligature (and all other typographic ligature codepoints) are considered deprecated unicode. Modern fonts store ligatures in glyphs, not codepoints,, most of them store the ff glyph in its legacy codepoint (instead of a virtual encoding slot) just because it’s convenient. i don’t think we need to support it.

In that case, I would propose the following tweak: isASCII maintains it's behavior and asciiValue sanitizes to "\n" with a comment stating that if you care, consider also comparing against "\r\n". I.e. what @Joe_Groff said.

FWIW, there is a distinction between triviality of encoding and triviality of grapheme segmentation, which String will want to eventually expose, e.g. as query-able performance flags. I wouldn't want a future String.isASCII to have to maintain semantic parity with a Character.isASCII which excluded CR-LF.

1 Like