Pitch: Character and String properties

jrose · April 13, 2018, 5:30pm

<pedantry>Well, the writing system isn't necessarily base-10. There are base-60 values in Unicode. But the base only matters if there's more than one digit anyway, and even then that's only true in a numeric system that uses position. "ⅭⅯ" (U+216D U+216F) would have a numeric value of 900, and 「三十万二百」 is 300200.</pedantry>

Michael_Ilseman · April 13, 2018, 5:31pm

<pedantry> Character.decimalDigitValue returns nil for those. That's my point. </pedantry>

edit: it's spelled decimalDigitValue

jrose · April 13, 2018, 6:42pm

Oh. Why? People shouldn't build their own base-10 parsers on top of decimalDigitValue, that would exclude other ways of writing numbers.

Michael_Ilseman · April 13, 2018, 6:47pm

We could do a more general query on graphemes:

If a Character unambiguously represents a decimal digit in a well-defined writing system, it makes sense to be able to access that value.

We also totally should have a more general number parsing solution. I'm interested in this, but outside the scope of this pitch.

jrose · April 13, 2018, 10:18pm

I don't understand why "decimal digit" is an interesting case but「万」is not. Did I miss a point in the thread above? (If so, sorry!)

Michael_Ilseman · April 13, 2018, 10:37pm

万 would be considered a number, but not a decimal digit. Are you asking why it is not a decimal digit, or why we are not providing an API that covers it?

jrose · April 13, 2018, 10:39pm

I'm asking why there's a specific API for isDecimalDigit or decimalDigitValue at all. "Hex_Digit" is special, you justified that above, but "decimal digit" doesn't seem to be, even though it's one of the properties Unicode provides.

Michael_Ilseman · April 13, 2018, 10:41pm

How is isDecimalDigit not special but isHexadecimalDigit is? How else can you get the value of a character that represents a decimal digit?

jrose · April 13, 2018, 10:45pm

A numericValue would behave correctly on all values that respond to isNumber. The only reason to ask isDecimalDigit is to build a string-to-numeric converter that handles multiple scripts but not any non-decimal number formats, which is not something you'd want in a production program. (You'd either want to just handle ASCII, or include handling of other number formats.) Hex is special because that takes Characters that are not numbers and gets a value for them, and it's limited specifically to the Roman script representations of hexadecimal (at least today).

But eh, maybe I should withdraw this objection. I've been against "simple-but-not-100%-correct" APIs and the tide seems to be moving away from that view in general (cf. String-as-Collection, which I was also against).

Michael_Ilseman · April 13, 2018, 10:49pm

No, not as formulated in this pitch. We would need a grapheme evaluator for that.

We could do an alternative where we have a single-scalar restricted query and an associated numericValue. Would you want that to also cover vulgar fractions?

Michael_Ilseman · April 14, 2018, 12:11am

Here is a Venn diagram of numbers as we're discussing them:

I'm all for broadening the decimal digit API to include single scalar whole numbers, but am shying away from representing vulgar fractions, etc. Semantics would be single-scalar number with a whole number numeric value. What should this be called?

Some options:

isWholeNumber and wholeNumberValue: Int
isDigit and digitValue: Int, noting this is not the same as Unicode's notion of digit and it includes things like 万 and 𐏕 which are definitely numbers, but not sure if they colloquially quality as "digits".
(throws up hands) add numericValue: Double which is only defined on single-scalar numbers, but includes vulgar fractions etc.

Karl · April 16, 2018, 1:11pm

Why not a failable initialiser? One that isn't necessarily limited to ASCII.

So it would be:

// Rename the existing Int.init?(String) to indicate ASCII limitation.
- FixedWidthInteger.init?<S: StringProtocol>(_: S, radix: Int = 10)
+ FixedWidthInteger.init?<S: StringProtocol>(ascii: S, radix: Int = 10)

// Add Int.init?(ASCII/Unicode Char) initialisers
+ FixedWidthInteger.init?(ascii: Character, radix: Int = 10)
+ FixedWidthInteger.init?(_: Character)

This also leaves room for future expansion:

// Int.init?(Unicode String)
+ FixedWidthInteger.init?<S: StringProtocol>(_: S)

// Handling vulgar fractions
+ FloatingPoint.init?(_: Character)

Michael_Ilseman · April 26, 2018, 1:56am

Just posted an updated pitch, where I tried to address this. I think it's a good idea, just have to weigh against other practical concerns such as type checking complexity, source compatibility, and whether we want the non-ASCII visible directly on e.g. Int.

Michael_Ilseman · April 27, 2018, 5:50pm

@rudkx do you think it's worth including the following (rename source-break) in this pitch? What are some of the reasons it improves type checking consistency? I think it's an API improvement by itself, but would like to form more justification:

- FixedWidthInteger.init?<S: StringProtocol>(_: S, radix: Int = 10)
+ FixedWidthInteger.init?<S: StringProtocol>(ascii: S, radix: Int = 10)

+ FixedWithInteger.init?(ascii: Character, radix: Int = 10)

If we have a good argument, then @Karl's suggestions become more interesting as well!

griotspeak · April 27, 2018, 8:46pm

I, for one, think that this change is the correct move.

dlbuckley · December 20, 2018, 11:54am

I was casually looking through some of the proposals that have been implemented in Swift 5 and saw that SE-221 has gone in which adds the proposed additions to Character which was at the core of this pitch.

I saw in the accepted proposal that the additional properties to String were omitted due to:

keep focus and await a more generalized lazy split collection

Since the pitch has been accepted and implemented the focus could now be moved onto getting the String properties added. I attempted to find the reference to the "generalized lazy split collection" discussion but couldn't seem to find it.

So my question is whats the current status of this and is it something that simply needs bandwidth to sort out or are their deeper implications to this that I'm just not aware of?

(Sorry for bumping up this thread but the search just doesn't seem want to return results for SE-221)

Michael_Ilseman · December 20, 2018, 4:12pm

The plan is to revisit Range based Collection mutations, which includes subsequence find/replace and more generalized splitting. Doing this generically (Collection/OrderedCollection/BidirectionCollection/whatever) would mean that they apply to String and all it's views, so you can e.g. operate against the stream of Unicode.Scalar values that comprise a String.