(This has been updated with the most recent pitch)
Gist-formatting version: Pitch: Character and String properties · GitHub
Pitch: Character and String properties
- Authors: Michael Ilseman, Tony Allevato
A PR with the implementation can be found here
Introduction
@allevato (a co-author here) started at pitch at Adding Unicode properties to UnicodeScalar/Character - Pitches - Swift Forums, which exposes Unicode properties from the Unicode Character Database. These are Unicode expert/enthusiast oriented properties that give a finer granularity of control and can be used to answer many Unicody enquiries.
However, they are not very ergonomic and Swift makes no attempt to clarify their usage, as their meaning and proper interpretation is directly tied to the Unicode Standard and the version of Unicode available at run time.
There's some low-hanging ergo-fruit to be picked by exposing properties directly on Character
.
Proposed Approach
(Note that Unicode does not define properties on graphemes in general. Swift is defining its own semantics in terms of Unicode semantics derived from scalar properties, semantics on strings, or both)
Character Properties
extension Character {
/// Whether this Character is ASCII.
@inlinable
public var isASCII: Bool { return asciiValue != nil }
/// Returns the ASCII encoding value of this Character, if ASCII.
///
/// Note: "\r\n" (CR-LF) is normalized to "\n" (LF), which will return 0x0A
@inlinable
public var asciiValue: UInt8? {
// TODO: Tune for codegen
if _slowPath(self == ._crlf) { return 0x000A /* LINE FEED (LF) */ }
if _slowPath(!_isSingleScalar || _firstScalar.value >= 0x80) { return nil }
return UInt8(_firstScalar.value)
}
/// Whether this Character represents whitespace, including newlines.
///
/// Examples:
/// * "\t" (U+0009 CHARACTER TABULATION)
/// * " " (U+0020 SPACE)
/// * U+2029 PARAGRAPH SEPARATOR
/// * U+3000 IDEOGRAPHIC SPACE
///
public var isWhitespace: Bool { ... }
/// Whether this Character represents a newline.
///
/// * "\n" (U+000A): LINE FEED (LF)
/// * "\r" (U+000D): CARRIAGE RETURN (CR)
/// * "\r\n" (U+000A U+000D): CR-LF
/// * U+0085: NEXT LINE (NEL)
/// * U+2028: LINE SEPARATOR
/// * U+2029: PARAGRAPH SEPARATOR
///
public var isNewline: Bool { ... }
/// Whether this Character represents a number.
///
/// Examples:
/// * "7" (U+0037 DIGIT SEVEN)
/// * "⅚" (U+215A VULGAR FRACTION FIVE SIXTHS)
/// * "㊈" (U+3288 CIRCLED IDEOGRAPH NINE)
/// * "𝟠" (U+1D7E0 MATHEMATICAL DOUBLE-STRUCK DIGIT EIGHT)
/// * "๒" (U+0E52 THAI DIGIT TWO)
///
public var isNumber: Bool { ... }
/// Whether this Character represents a whole number. See
/// `Character.wholeNumberValue`
@inlinable
public var isWholeNumber: Bool { return wholeNumberValue != nil }
/// If this Character is a whole number, return the value it represents, else
/// nil.
///
/// Examples:
/// * "1" (U+0031 DIGIT ONE) => 1
/// * "५" (U+096B DEVANAGARI DIGIT FIVE) => 5
/// * "๙" (U+0E59 THAI DIGIT NINE) => 9
/// * "万" (U+4E07 CJK UNIFIED IDEOGRAPH-4E07) => 10_000
///
public var wholeNumberValue: Int? { ... }
/// Whether this Character represents a hexadecimal digit.
///
/// Hexadecimal digits include 0-9, Latin letters a-f and A-F, and their
/// fullwidth compatibility forms. To get their value, see
/// `Character.hexadecimalDigitValue`
@inlinable
public var isHexadecimalDigit: Bool { return hexadecimalDigitValue != nil }
/// If this Character is a hexadecimal digit, returns the value it represents,
/// else nil.
///
/// Hexadecimal digits include 0-9, Latin letters a-f and A-F, and their
/// fullwidth compatibility forms.
public var hexadecimalDigitValue: Int? { ... }
/// Whether this Character is a letter.
///
/// Examples:
/// * "A" (U+0041 LATIN CAPITAL LETTER A)
/// * "é" (U+0065 LATIN SMALL LETTER E, U+0301 COMBINING ACUTE ACCENT)
/// * "ϴ" (U+03F4 GREEK CAPITAL THETA SYMBOL)
// * "ڈ" (U+0688 ARABIC LETTER DDAL)
/// * "日" (U+65E5 CJK UNIFIED IDEOGRAPH-65E5)
/// * "ᚨ" (U+16A8 RUNIC LETTER ANSUZ A)
///
public var isLetter: Bool { ... }
/// Perform case conversion to uppercase
///
/// Examples:
/// * "é" (U+0065 LATIN SMALL LETTER E, U+0301 COMBINING ACUTE ACCENT)
/// => "É" (U+0045 LATIN CAPITAL LETTER E, U+0301 COMBINING ACUTE ACCENT)
/// * "и" (U+0438 CYRILLIC SMALL LETTER I)
/// => "И" (U+0418 CYRILLIC CAPITAL LETTER I)
/// * "π" (U+03C0 GREEK SMALL LETTER PI)
/// => "Π" (U+03A0 GREEK CAPITAL LETTER PI)
/// * "ß" (U+00DF LATIN SMALL LETTER SHARP S)
/// => "SS" (U+0053 LATIN CAPITAL LETTER S, U+0053 LATIN CAPITAL LETTER S)
///
/// Note: Returns a String as case conversion can result in multiple
/// Characters.
@inlinable
public func uppercased() -> String { return String(self).uppercased() }
/// Perform case conversion to lowercase
///
/// Examples:
/// * "É" (U+0045 LATIN CAPITAL LETTER E, U+0301 COMBINING ACUTE ACCENT)
/// => "é" (U+0065 LATIN SMALL LETTER E, U+0301 COMBINING ACUTE ACCENT)
/// * "И" (U+0418 CYRILLIC CAPITAL LETTER I)
/// => "и" (U+0438 CYRILLIC SMALL LETTER I)
/// * "Π" (U+03A0 GREEK CAPITAL LETTER PI)
/// => "π" (U+03C0 GREEK SMALL LETTER PI)
///
/// Note: Returns a String as case conversion can result in multiple
/// Characters.
@inlinable
public func lowercased() -> String { return String(self).lowercased() }
@inlinable
internal var _isUppercased: Bool { return String(self) == self.uppercased() }
@inlinable
internal var _isLowercased: Bool { return String(self) == self.lowercased() }
/// Whether this Character is considered uppercase.
///
/// Uppercase Characters vary under case-conversion to lowercase, but not when
/// converted to uppercase.
///
/// Examples:
/// * "É" (U+0045 LATIN CAPITAL LETTER E, U+0301 COMBINING ACUTE ACCENT)
/// * "И" (U+0418 CYRILLIC CAPITAL LETTER I)
/// * "Π" (U+03A0 GREEK CAPITAL LETTER PI)
///
@inlinable
public var isUppercase: Bool { return _isUppercased && isCased }
/// Whether this Character is considered lowercase.
///
/// Lowercase Characters vary under case-conversion to lowercase, but not when
/// converted to uppercase.
///
/// Examples:
/// * "é" (U+0065 LATIN SMALL LETTER E, U+0301 COMBINING ACUTE ACCENT)
/// * "и" (U+0438 CYRILLIC SMALL LETTER I)
/// * "π" (U+03C0 GREEK SMALL LETTER PI)
///
@inlinable
public var isLowercase: Bool { return _isLowercased && isCased }
/// Whether this Character changes under any form of case conversion.
@inlinable
public var isCased: Bool { return !_isUppercased || !_isLowercased }
/// Whether this Character represents a symbol
///
/// Examples:
/// * "®" (U+00AE REGISTERED SIGN)
/// * "⌹" (U+2339 APL FUNCTIONAL SYMBOL QUAD DIVIDE)
/// * "⡆" (U+2846 BRAILLE PATTERN DOTS-237)
///
public var isSymbol: Bool { ... }
/// Whether this Character represents a symbol used mathematical formulas
///
/// Examples:
/// * "+" (U+002B PLUS SIGN)
/// * "∫" (U+222B INTEGRAL)
/// * "ϰ" (U+03F0 GREEK KAPPA SYMBOL)
///
/// Note: This is not a strict subset of isSymbol. This includes characters
/// used both as letters and commonly in mathematical formulas. For example,
/// "ϰ" (U+03F0 GREEK KAPPA SYMBOL) is considered a both mathematical symbol
/// and a letter.
///
public var isMathSymbol: Bool { ... }
/// Whether this Character represents a currency symbol
///
/// Examples:
/// * "$" (U+0024 DOLLAR SIGN)
/// * "¥" (U+00A5 YEN SIGN)
/// * "€" (U+20AC EURO SIGN)
public var isCurrencySymbol: Bool { ... }
/// Whether this Character represents punctuation
///
/// Examples:
/// * "!" (U+0021 EXCLAMATION MARK)
// * "؟" (U+061F ARABIC QUESTION MARK)
/// * "…" (U+2026 HORIZONTAL ELLIPSIS)
/// * "—" (U+2014 EM DASH)
/// * "“" (U+201C LEFT DOUBLE QUOTATION MARK)
///
public var isPunctuation: Bool { ... }
}
// ... helpers
extension Unicode.GeneralCategory {
@inlinable internal var _isSymbol: Bool { ... }
@inlinable internal var _isPunctuation: Bool { ... }
}
extension Character {
@inlinable internal var _firstScalar: Unicode.Scalar { ... }
@inlinable internal var _isSingleScalar: Bool { ... }
@inlinable static internal var _crlf: Character { return "\r\n" }
@inlinable static internal var _lf: Character { return "\n" }
}
Detailed Semantics and Rationale
The rules of grapheme breaking allow for semantically meaningless, yet technically valid, Characters. Furthermore, fuzziness is inherent in modeling human writing systems. So, we make the best effort we can and try to discover some principle to follow. Principles are not hard rules that always leads one to a single clear answer, but are useful for evaluating tradeoffs.
The closest concept might be something similar to W3C’s Principle of Tolerance, paraphrased as “Be liberal in what you accept, conservative in what you produce”. Perhaps another phrasing could be “Be permissive in generality, restrictive in specificity”.
Restrictive in specificity
Properties that have a clearly prescribed use, or which the stdlib produces a specific value for, should be restrictive in specification. One example is isWholeNumber
and wholeNumberValue
, where the fact that the stdlib produces an Int for a Character means that we need to be restrictive in the semantics of what a whole number is.
isWholeNumber
returns a value only on single-scalar graphemes whose sole scalar has an integral numeric value. Thus, we reject whole numbers that are modified by a subsequent combining character. “7̅” (7 followed by U+0305 COMBINING OVERLINE) is rejected as there is no clear interpretation of the value. Any attempt to produce a specific Int from it would be highly dubious.
Permissive in generality
Where there is no prescribed usage and no specific values to produce, we try to be as permissive as reasonable. For example, isLetter
just queries the first scalar to see if it is “letter-like”, and thus handles unforeseeable combinations of a base letter-like scalar with myriad subsequent combining, modifying, or extending scalars. isLetter
merely answers a general (fuzzy) question, but doesn’t produce specific values nor prescribes use.
API Semantics
Below is a grouping of semantics into “permissive”, which means accept/reject based only on properties of the first scalar, and “restrictive” which means accept/reject based on analysis of the entire grapheme.
Restrictive:
- isASCII / asciiValue
- isWholeNumber / wholeNumberValue
- isHexDigit / hexDigitValue
- isUppercase / uppercased(), isLowercase / lowercased(), isCased
Permissive:
- isNumber
- isLetter
- isSymbol / isMathSymbol / isCurrencySymbol
- isPunctuation
- isWhitespace (maybe*)
- isNewline (maybe*)
* Newlines are not just hard line-breaks in traditional written language, but common terminators for programmer strings. If a Character is “\n\u{301}” (a newline with a combining accent over it), is this a newline? Either interpretation can lead to inconsistencies. If true, then a program might skip the first scalar in a new entry (whatever such a combining scalar at the start could mean). If we say false, then a string with newline terminators inside of it would return false for myStr.contains { $0.isNewline }
, which is counter-intuitive. This same reasoning may apply to whitespace.
A couple options:
- Permissive, to keep consistency with
myStr.contains { $0.isNewline }
and grapheme-by-grapheme processing in general - Restrictive, to prevent the programmer from skipping over relevant scalars, at the risk of counter-intuitive string processing behavior
- Rename to
hasNewline
, which is permissive - Drop from this pitch in favor of an eventual
String.lines
or something similar.
We think choice #1 is arguably less bad than #2, and more transparently reflects the reality of grapheme-by-grapheme processing. We slightly prefer #1 to choice #3 or #4 as #1 is a common sense query. Though it does permit some meaningless graphemes, we don’t see any clearly harmful behavior as a result for realistic inputs, nor anticipate malicious behavior for malicious inputs. But, we could easily be convinced either way (see rejected alternatives).
@inlinable and non-@inlineable properties
The sweet spot of @inlinable hits APIs that are extremely unlikely to change their semantics and are frequently used or often part of a program’s “hot path”. ASCII-related queries check both of these boxes. For other properties whose semantics can be expressed entirely in terms of other API, @inlinable allows the optimizer to optimize across multiple calls and specific usage without giving up a meaningful degree of library evolution. isWholeNumber
checking wholeNumberValue
for nil is an example of this, as these two methods are semantically tied and the optimizer could (in theory) reuse the result for subsequent calls. We can always safely supply a new finely-tuned implementation of isWholeNumber
in future versions of the stdlib, provided semantics are preserved.
For properties where we may change our strategy (or details of implementation) to accommodate future directions of Unicode and unanticipated corner-cases, these should be kept non-@inlinable.
Rejected Additions and Alternatives
Titlecase
Titlecase can be useful for some legacy scalars (ligatures) as well as for Strings when combined with word-breaking logic. However, it seems pretty obscure to surface on Character directly.
String.Lines, String.Words
These have been deferred from this pitch, to keep focus and await a more generalized lazy split collection.
Rename permissive isFoo
to hasFoo
This was mentioned above in discussion of isNewline
semantics and could also apply to isWhitespace
. However, it would be awkward for isNumber
or isLetter
. What the behavior should be for exotic whitespace and newlines is heavily debatable. We’re sticking to isNewline/isWhitespace
for now, but are definitely open to argument.
Design as Character.has(OptionSet<…>, exclusively: …)
or something similar
There could be something valuable to glean from this, but we reject this approach as somewhat un-Swifty with a poor discovery experience, especially for new or casual users of Character. It can, however, clarify the semantic distinctions above by making it explicit to the user.
Add failable FixedWidthInteger/FloatingPoint inits taking Character
In addition to (or perhaps instead of) properties like wholeNumberValue
, add Character-based FixedWidthInteger.init?(_:Character)
. Similarly FloatingPoint.init?(_:Character)
which includes vulgar fractions and the like (if single-scalar).
There’s practical drawbacks to label-less failable inits for type checking performance; @rudkx has more details here. Furthermore, we feel that Int(Character("๓"))
’s semantics is less clear than Character("๓").wholeNumberValue
, and we don’t think this is worth surfacing at the top level of Int
.
We could consider the following changes:
- FixedWidthInteger.init?<S: StringProtocol>(_: S, radix: Int = 10)
+ FixedWidthInteger.init?<S: StringProtocol>(ascii: S, radix: Int = 10)
+ FixedWithInteger.init?(ascii: Character, radix: Int = 10)
which would clarify the semantics of the initializers (and help disambiguate overloads for type checking) and allow for construction from an ASCII character. But, we don’t feel a non-ASCII-allowing FixedWidthInteger.init?(_:Character)
would carry its weight at this time. We’re also not sure if this change is worth a source-compatibility break (simple rename) at this point, but could be convinced otherwise.
Another alternative could be FixedWidthInteger.init?(wholeNumber: Character)
, which uses an argument label to clarify. This would allow for future expansion should we ever add a numeric grapheme or string evaluator.
--
Special thanks to @millenomi, @itaiferber, @davedelong, and @jrose for challenging aspects of the pitch and pushing for improvement. Thanks to @Karl, @griotspeak, and everyone else for their suggestions and improvements.