Pitch: Character and String properties

allevato · April 3, 2018, 8:10pm

Since those ligatures aren't the only situation where a single scalar becomes multiple graphemes via a case mapping, it's not really a question of whether we support some but not others—we should just be correct w.r.t. what the standard defines.

Even if ß -> SS were the only case where it mattered, we would still need to have the output be a String simply for that reason.

Michael_Ilseman · April 3, 2018, 8:15pm

I don't understand. CharacterSet has little to do with graphemes. How could CharacterSet be used/improved, and how would that be relevant to whether Character should or should not have these properties?

Could you elaborate? If it's something that's not implementable, it will be dropped from the proposal. This is a huge benefit of a review policy that requires a prototype implementation.

This proposal introduces a handful; I don't know whether there will be future proposals. Why is this a cause for worry?

allevato · April 3, 2018, 8:39pm

I had some concerns about whether silently eliding the "\r" in this case would be problematic, but let's consider the following assumptions:

In the majority of text that contains "\r" that most users are going to deal with, it is going to be followed immediately by "\n". Even if it's not, then .ascii at least still returns 13, not 10.
In the majority of processing of such text, most users only care about "is there a line break here", not what kind of line break it is specifically.
Users who need to differentiate between different kinds of ASCII line breaks can do so by comparing them directly. Or, if they care about "\r" in any significant way, they may be working at the UnicodeScalarView already anyway.

If we believe those to hold, then it makes me more comfortable with your suggestion being a more ergonomic API.

Michael_Ilseman · April 3, 2018, 9:55pm

This seems like a reasonable assumption. Out of curiosity, why does the relative frequency of CR vs CR-LF make you more comfortable? They don't compare equal, so they are different Characters.

Right, "\r" as a grapheme has the ASCII value of 0x0d.

Slightly refined to: In the vast majority of uses of asciiValue, users don't care to differentiate between CR-LF and LF.

"Normalizing" (not to be confused with Unicode normal forms) CR-LF to LF is also common practice. E.g. the XML spec requires it.

Correct, and the comment can point out explicit comparison to CR-LF.

Or on Character, as "\r", "\n", and "\r\n" are considered distinct Characters. This does mean that asciiValue, if it does give a result, should not be the basis for strict Character equality.

allevato · April 3, 2018, 10:01pm

I suppose I'm just reaching back to my early experience in Swift where I tried writing code that parsed text by iterating over Characters and tried to handle "\r" in the "obvious" way from every other language I've used (by testing if the next character was equal to it, and then skipping it) and then handle the "\n" separately after that. For text that contained "\r\n", this resulted in me never detecting any line breaks, because my character never equalled either "\n" and "\r", it was "\r\n". It didn't occur to me until I debugged it that "\r\n" was treated as its own cluster.

So I'm not really opposed to the proposed behavior, so much as I want to make sure that we've considered all the assumptions people might make and make sure that our documentation hammers the right points home.

Michael_Ilseman · April 3, 2018, 11:07pm

Updated to incorporate @Joe_Groff's idea of asciiValue and @nnnnnnnn's request for comments.

millenomi · April 3, 2018, 11:20pm

An obvious way would be to introduce .contains(_ character: Character). So, for instance: CharacterSet.uppercaseLetters.contains("C" as Character).

It is true that it cannot express all properties here (like isEmoji), but it's still existing API that does the same thing as some of this proposal.

My bad: we already have some definition of emoji as graphemes (e.g. "👩‍👩‍👧‍👧".count == 1), so this does feel like a more natural addition.

Any new API entry point needs to be maintained, and I expect things tied to internationalization to require some amount of care to avoid regressions.

Michael_Ilseman · April 3, 2018, 11:48pm

Unfortunately, this is not the same semantics as Character.isUppercased. You can see an explanation of the different notions of casing in Unicode in my post on the scalar properties thread. Character.isUppercased follows the 3rd notion of casing in that post, while Character follows the first.

Sorry for all the questions; I want to cut to the root of your concern. Do you view a proliferation of properties on Character to be a greater cause of concern than elsewhere in the standard library? That is, the first part of the argument seems like a general statement regarding any addition to the Swift standard library, and I'm very interested in why you think properties on Character are different.

As far as being tied to internationalization, these properties are Swift-specific semantics defined in terms of Unicode.Scalar.Properties or operations on String, which are kept up-to-date through use of ICU (as well as needing to be kept up to date anyways).

griotspeak · April 4, 2018, 3:59am

These two could be combined if the ascii code just returned an optional.

Is digit/notDigit, too much to ask for here, as well? That can get tricky, I know.

allevato · April 4, 2018, 5:02am

I promised Michael that I'd write up a few paragraphs to expand on this part. Here's what I'm thinking—we can refine it and include it in the formal write-up as the thread goes on:

Given the difficulty of word breaking, our best approach here is to expose an API that acts as a simplified version of ICU's word break iterators, as the code comments suggest. It would be the wrong move to take on the burden of implementing and maintaining complex logic like this ourselves.

WBIs have a somewhat unique API. You ask them for the next (or previous) index of a word break (basically the start or end of a word). Each break index is associated with a "rule class" that indicates what kind of word break it is.

These rule classes are defined as integer ranges. For example, word boundaries representing letters have rule classes in the range 200..<300. When you call a WBI API and ask what rule class the current break is, you're supposed to check for containment in that range, not equality against a specific value. This is for future expansion—it lets the ICU API refine those groups in the future while still maintaining the broader categorization for older clients. This makes these rule classes fairly awkward to express in Swift; they can't be enums, because subranges should also match as their superranges. (There aren't actually any such refinements yet, but that could change.)

The details of these word classes is probably more than most users need anyway. We would better serve them by providing a simpler API that just lets users lazily query for the collection of words—that is, substrings that start at an index with rule class UBRK_WORD_NONE..<UBRK_WORD_NONE_LIMIT and end at the next index with rule class >= UBRK_WORD_NONE_LIMIT. This effectively gives us the list of words without any intervening spaces or punctuation. As an example, the following string:

"This is the test, isn't it?"

Would produce the following collection of Substrings based on the rule class logic above:

["This", "is", "the", "test", "isn't", "it"]

Notice that spaces and word-breaking punctuation is excluded, but an apostrophe in the word "isn't" is handled correctly as part of that word.

If users want access to the intervening spaces/punctuation, they can still do so; given a string S two adjacent words W1 and W2, the content between those words is S[W1.endIndex..<W2.startIndex].

IMO, this strikes a nice balance between a clean API for the majority of users' needs and correctness. If we find that we need to more completely expose word break iterators in Swift, we can do so later; but designing a complete and ergonomic WBI API for Swift is non-trivial and is likely far more advanced than most users would need.

Implementation-wise, the Words collection can maintain a word break iterator for its string, and have an index type that encapsulates the underlying index of the WBI; this should let us determine the next/previous word from a particular position. Computing the count would be O(n) on string length, though, because it requires scanning the entire string to determine how many breaks there are.

nicklockwood · April 4, 2018, 10:15am

Minor point, but shouldn't it be isUppercase and not isUppercased ?

I assume that the choice to use the verb uppercased for the Swift String conversion function was to indicate that this was an action being performed on the String (as opposed to the adjective form uppercaseString used in Objective C, which refers to the value being returned), but since isUppercased is a property, it's describing the current state of the string (i.e. whether it is uppercase or not), it's not referring to a previous uppercasing action that has been performed.

zwaldowski · April 4, 2018, 2:05pm

Re: Lines and Words, would we be seeking to eliminate enumerateSubstrings and getLineStart et. al. overlaid from Foundation? If so, would it be able to be migrated in any meaningful way? In the spirit of full parity, would Paragraphs and Sentences make sense too?

Michael_Ilseman · April 4, 2018, 2:12pm

It's certainly not too much to ask for; I'm very interested in your use case. Do you have an example? Is it that you want to skip over the digits, do you want their numeric value (assuming some semantics about what that even means), etc? Would you want this to be strictly restricted to ASCII or also include half-width numerals and numerals with various combining things after?

Michael_Ilseman · April 4, 2018, 2:32pm

I chose "Uppercased" because I thought that name better fit the subtle choice in casing semantics, that uppercased means invariant under case conversion to uppercase. This is a slightly different distinction between whether a letter is considered an uppercase letter in a traditional bi-cameral alphabet.

For example, ʰ (U+02B0: MODIFIER LETTER SMALL H) is considered lowercase via its UCD entry (and traditional interpretation in bi-cameral alphabets), however Unicode defines case mapping to uppercase on ʰ to result in ʰ, that is to be invariant. In this sense, ʰ is both lowercase and uppercased.

A simpler example would be the ASCII digit 7, whose scalar is not considered uppercase or lowercase in the UCD, but is uppercased and lowercased in that it is invariant under case conversion.

Since binary properties on a single scalar cannot be generalized to a sequence of scalars (e.g. a string or grapheme), and casing is tricky, we went with Unicode's recommendation of casing as applicable to a sequence of scalars (it's basically a concatMap of case conversion). That is, we chose to consider graphemes as more like a sequence of scalars than a scalar itself.

Does this make any sense? I definitely want to clarify this a bit more in the comments, preferably without a deep dive into Unicode casing :-). Any ideas how to more succinctly communicate this distinction?

nicklockwood · April 4, 2018, 2:50pm

Personally I was quite surprised to discover that [NS]CharacterSet's definition of decimalDigits included unicode characters outside of the ASCII 0-9 range.

It's possible that that is useful to somebody, but my naive assumption is that the common use-case for an isDigit property would be for numeric input validation, or as part of something like a programming language or mathematical expression parser, so the expectation is that it would only match 0-9.

Michael_Ilseman · April 4, 2018, 2:52pm

Not necessarily. E.g., my understanding is that NSString's methods also accommodate localization.

Any deprecated imports from Foundation will have a corresponding @available(renamed:) entry that tells the compiler/IDE/migrator the new name for a method. It gets a little tricky for remapping APIs that differ, but that could be done with a custom migration rule (IIRC).

I think these are outside the scope of this pitch, though I'm open to argument. The properties in this pitch are not meant to be used for linguistic processing necessarily, though if they happen to do the right thing all the better.

I think Paragraphs and Sentences are petty far along the diminishing-returns curve of general applicability. They make more sense linguistically, or as part of a UI framework, while something like Character.isWhitespace is useful for processing Strings in a language-agnostic non-UI context (e.g. source code).

"String.Lines" does not do linguistic analysis to determine places to perform word wrapping, but rather maps onto a programmer's notion of a newline as a terminator/separator. Unicode's recommendation:

String.Lines would be a collection of Substrings containing the lines themselves without the terminator (which can always be recovered by accessing the Substring between the two slices).

String.Words falls in a fuzzy middle ground, but it seems like it could be generally useful and avoid a robustness trap exposed by Character.isWhitespace.

nicklockwood · April 4, 2018, 2:53pm

That seems reasonable.

Michael_Ilseman · April 4, 2018, 2:54pm

I debated adding a whole slew of isASCIIFoo properties onto Character for convenience. It was quite the API surface area, though. Do you have any ideas how to expose this kind of functionality? Perhaps an OptionSet of ASCII-like properties?

edit: My draft for these was something like:

extension Character {
  // <CF+LF, or single scalar <= 0x7F>
  var isASCII: Bool { get }

  // <CR, LF, CR+LF, maybe VT/FF?>
  var isASCIINewline: Bool { get } 

  // <defined to be isASCIINewline, space, tabs, etc.>
  var isASCIIWhitespace: Bool { get }

  var isASCIIUppercase: Bool { get }
  var isASCIILowercase: Bool { get }
  var isASCIILetter: Bool { get }
  var isASCIINumeric: Bool { get }
  var isASCIIHexDigit: Bool { get }
  var isASCIIAlphanumeric: Bool { get }
  var isASCIIControl: Bool { get }
  var isASCIIPunctuation: Bool { get }
  var isASCIIGraphic: Bool { get }
  // ...
}
extension BinaryInteger {
  init?(ascii: Character, radix: Int = 10) { ... }
}

allevato · April 4, 2018, 3:05pm

This seems like a reasonable approach at first glance. There were other issues that prevented us from considering OptionSets for the Unicode.Scalar properties (perfomance of having to read all the properties up front, and the fact that we'd start running up against the limits of UInt64 with the number of properties we have), but those don't exist here—there's a small, well-defined, and immutable set of ASCII character classes we care about, and we can delay computing the option set until that specific property of the character is requested from the user.

So we'd have something like this:

("A" as Character).asciiClasses == [.uppercase, .letter, .hexDigit, .alphanumeric, .graphic]
("A" as Character).asciiClasses.contains(.letter) == true
("A" as Character).asciiClasses.contains(.numeric) == false

I'm not married to the property name (feel free to improve it), but WDYT?

nicklockwood · April 4, 2018, 3:17pm

This seems like it's getting close to a reinvention of Foundation's CharacterSet class, but with an ASCII spin.

Maybe CharacterSet should be given an ASCIICharacterSet counterpart? Then the API would be more like

let isDigit = ASCIICharacterSet.digits.contains(character)