Consistent Numeric representation in Strings and Literals

mgriebling · May 10, 2023, 11:32am

At the moment the following are valid numeric literals:

123_456_789, 0xf456_123e, and 0b1011_0001

However the equivalent string representations of these numbers are not allowed.
For example Int("123_456_789") results in a nil return value.

This dichotomy of numeric representations presents an unfortunate stumbling block to newcomers to Swift who expect these similar-looking numeric string and literals to both work in the same way. Can the string conversions be "fixed" to allow the underscore and radix prefixes that are supported in the numeric literals? I know a number of libraries that support these string variants and return the expected numbers, thus resulting in even more confusion when Swift's built-in types do not allow similar strings.

Dante-Broggi · May 10, 2023, 5:51pm

At a first thought, there should at least be an optional format parameter which would allow selecting between the current digits only behavior, the proposed literal format behavior, and any other formats we may add support for, eg: C literals.

tera · May 10, 2023, 7:39pm

So would Int("0x123"). And there is no integer literal matching Int("1234", radix: 5) e.g. looking like 0₅1234.

"Integer", "Integer literal", and "Integer string representation" are just three different things, plus the forth "Integer localised string representation".

mgriebling · May 10, 2023, 8:03pm

Support for all literal integer variants was what I wanted. Since the integer literals don't have support for bases other than 2, 8, 10, and 16, neither would the string integer representation. I'm just asking for a consistent number representation in all cases.

Perhaps, but to a novice they seem identical and are easily confused. For example, I wouldn't know what an "Integer localised string representation" would look like.

mgriebling · May 10, 2023, 8:07pm

Sounds a bit like a C-style format argument. Ducking,

Seriously, that may be one way to handle different numeric variants as well as @tera was implying.

taylorswift · May 10, 2023, 9:17pm

you’re essentially asking for (a small subset) of the swift syntax parser to ship with the standard library.

how would such a thing evolve if swift gains support for more kinds of integer literals in the future?

SDGGiesbrecht · May 10, 2023, 9:29pm

١٢٣ = 123 (Arabic)
४५६ = 456 (Hindi)

78 901 (SI) > 78,901 (French) = 78.901 (English)

Karl · May 10, 2023, 10:01pm

I actually don't think this is very desirable.

Strings are messy, and for initialisers which parse runtime strings, I think there is value in limiting the set of accepted inputs instead of being overly permissive. If I was parsing some data, and the string "0_0" parsed as an integer, I would find that unexpected. Same with "0_0____3___". Both are accepted by the compiler as integer literals, though.

Literals are different from runtime strings. The value is right there staring at you, so we can build some conveniences in to the language with relatively low risk of surprising people.

I would recommend reading The Harmful Consequences of the Robustness Principle, an informational IETF document which argues that prioritising permissiveness has led to a gradual decline in quality for internet standards. To summarise:

The robustness principle, often phrased as "be conservative in what you send, and liberal in what you accept", has long guided the design and implementation of Internet protocols. The posture this statement advocates promotes interoperability in the short term, but can negatively affect the protocol ecosystem over time.

[...]

Applying the principle defers the effort of dealing with interoperability problems, which prioritizes progress. However, deferral can amplify the ultimate cost of handling interoperability problems.

Divergent implementations of a specification emerge over time. When variations occur in the interpretation or expression of semantic components, implementations cease to be perfectly interoperable.

Implementation bugs are often identified as the cause of variation, though it is often a combination of factors. Application of a protocol to uses that were not anticipated in the original design, or ambiguities and errors in the specification are often confounding factors. Disagreements on the interpretation of specifications should be expected over the lifetime of a protocol.

Even with the best intentions, the pressure to interoperate can be significant. No implementation can hope to avoid having to trade correctness for interoperability indefinitely.

An implementation that reacts to variations in the manner recommended in the robustness principle sets up a feedback cycle. Over time:

Implementations progressively add logic to constrain how data is transmitted, or to permit variations in what is received.

Errors in implementations or confusion about semantics are permitted or ignored.

These errors can become entrenched, forcing other implementations to be tolerant of those errors.

Consider what would happen if Swift applications parsing integers using Int(someString) suddenly started accepting underscores -- they would permit new variations in data they receive (the first point). If those Swift applications gain some traction, other applications will be pressured to support underscores as well for the sake of interoperability. Even if the authors of that Swift application never intended to support this notation, they do and it has now spread throughout some unsuspecting ecosystem.

I can point to similar situations elsewhere in computing. For example, when parsing an IPv4 address in a URL, most browsers in the 90s would defer to libc's inet_aton function, which seemed to be the obvious choice (much like Int(String) is the obvious way to parse an integer). Unfortunately, this function accepts a wide variety of inputs -- far more than was actually intended to be supported in URLs.

Decades later, those inputs still need to be supported, for fear that somebody may be depending on them. The result is that https://0xbadf00d/ is a valid URL, and equivalent to https://11.173.240.13/. That adds significant implementation complexity, and some accepted inputs have skirted on the edge of being security vulnerabilities. All to support a feature that nobody ever wanted in the first place.

So yeah - it is good to have a simple, strict parser without these kinds of surprising edge-cases.

As a secondary matter, the added implementation complexity could potentially hurt performance. I would consider Int(String) to be a high-impact function, and I don't think supporting these notations is so important that everybody who uses the function should pay for it.

If you want to parse integer literals like the compiler does when interpreting Swift source code, I think the swift-syntax library should provide that.

tera · May 10, 2023, 10:26pm

I made this simple app

let formatter = NumberFormatter()
let n = NSNumber(123456789123)
var variants: [String: String] = [:]

for id in Locale.availableIdentifiers {
    formatter.locale = Locale(identifier: id)
    formatter.numberStyle = .decimal
    let string = formatter.string(from: n)!
    variants[string] = id
}
print(variants)

to create a table:

123'456'789'123		en_CH
123 456 789 123		uk_UA // NO-BREAK SPACE
123’456’789’123		it_CH
१,२३,४५,६७,८९,१२३		mr
١٢٣٬٤٥٦٬٧٨٩٬١٢٣		ar_YE
123,456,789,123		am
123 456 789 123		fr_TD // NARROW NO-BREAK SPACE
༡,༢༣,༤༥,༦༧,༨༩,༡༢༣		dz
1,23,45,67,89,123	ml
၁၂၃,၄၅၆,၇၈၉,၁၂၃	my
১,২৩,৪৫,৬৭,৮৯,১২৩	bn_BD
꯱꯲꯳,꯴꯵꯶,꯷꯸꯹,꯱꯲꯳	mni_Mtei_IN
123ወ456ወ789ወ123	gez_ET
१२३,४५६,७८९,१२३		sat_Deva
᱑᱒᱓,᱔᱕᱖,᱗᱘᱙,᱑᱒᱓		sat
𑄷,𑄸𑄹,𑄺𑄻,𑄼𑄽,𑄾𑄿,𑄷𑄸𑄹	ccp_BD
۱۲۳٬۴۵۶٬۷۸۹٬۱۲۳		pa_Arab_PK
১২৩,৪৫৬,৭৮৯,১২৩		mni_Beng_IN
۱٬۲۳٬۴۵٬۶۷٬۸۹٬۱۲۳	ks_Aran_IN
123.456.789.123		es
123456789123		en_US_POSIX
𞥑𞥒𞥓⹁𞥔𞥕𞥖⹁𞥗𞥘𞥙⹁𞥑𞥒𞥓		ff_Adlm_NE

The second row shows a representative locale identifier (out of possible many) that leads to the string shown on the left.

Besides the obvious difference in the digits themselves there are about 10 different thousand separators and in some locales digits are grouped by two instead of three with some interesting rules (see ccp_BD for example).

taylorswift · May 10, 2023, 10:36pm

i don’t know if FixedWidthInteger.init(_:radix:) is inlinable, but if we wanted it to align with swift integer literals, it could not possibly be inlinable because it would need to evolve with additions to the language.

Karl · May 10, 2023, 10:56pm

I can't imagine us adding new integer literals, but even if we did, I don't think there's a problem with evolving the language.

Inlinable functions can still evolve; you just need to be aware that some clients may be using the old implementation. It seems reasonable to ask clients to recompile to get the new functionality (especially since any new formats wouldn't have been considered integer literals when those clients were built, so they never had an expectation of those strings being able to parse).

taylorswift · May 10, 2023, 11:16pm

it would be really weird if a string that parses in client code fails to parse when passed to an API in a (binary) dependency that calls the exact same initializer but was compiled with an older toolchain.

it would be even weirder if the behavior was different depending on if the library API itself was inlinable or not, because the inlinable APIs (that expose the call to Int.init(_:radix:)) would themselves be using inlining to achieve the new behavior.

tera · May 10, 2023, 11:31pm

Imagining what we could've used for arbitrary radix number literals.

10x123    // 123
16x7B     // 0x7B
8x173     // 0o173
2x1111011 // 0b1111011
5x443     // example of currently impossible base

Or some other separator to make it less similar to a multiplication.

xwu · May 11, 2023, 12:28am

This post makes a very, very good point: even if a change seems obviously an enhancement to the standard library, we can't just have everyone opt into more permissive string parsing without very plausibly making that very same enhancement someone else's bug.

As far as I'm aware, what's been a valid string for integer parsing hasn't changed in all the publicly released versions of Swift, and changing it will break clients and probably can't now be done absent a compelling issue that absolutely forces our hand.

Internal thought process:
"Haha, surely this won't just work if I paste this into my browser's address bar..."
Cmd + V
"Oh...oh no. Nonononono. Nope nope nope nope. How do I unsee this?"

The one point of fact I will bring into this discussion is that, in fact, numeric literals are simultaneously more permissive in some ways and more restricted in others as compared to what's accepted for string conversion.

This thread starts off with an example of the former; as an example of the latter, ".5" is an acceptable string that can be converted to a floating-point value but .5 is not an acceptable float literal. In the rush of the moment an integer literal example isn't immediately coming to mind but my recollection is that there are non-trivial examples.

For reasons that will be perhaps more intuitive, it would break clients to make conversion from string more strict than it currently is.

bbrk24 · May 11, 2023, 12:36am

I can’t remember whether Swift allows 0 for octal literals, but at least in JS you have:

> 'use strict'; 066
Uncaught SyntaxError: Octal literals are not allowed in strict mode.
> +'066'
66

taylorswift · May 11, 2023, 12:42am

octals use 0o, like the 0o664 permissions mask

scanon · May 11, 2023, 1:05am

... and on the flip side, 066 is not an octal literal in Swift, it's decimal, hence has value 66.

Jumhyn · May 11, 2023, 1:09am

+1. I would perhaps view it slightly differently if the docs for Int.init(String) were more vague, but as written they convey very clear validation conditions for the string to be converted to an integer value. Changing this behavior is breaking the contract that we’ve committed to.

scanon · May 11, 2023, 1:13am

If we wanted to do this, I would introduce a new explicitly-labeled API to interpret a string exactly how we interpret program integer literals, rather than change the behavior of an existing API (does this already exist in, say, swift-syntax? Somewhere like that seems like a more likely home than the stdlib).

mgriebling · May 11, 2023, 1:36am

Not really, it's a couple of lines of code.
I really wonder about the compiler if such a thing is difficult to parse.
I've written a few compilers and this is trivial stuff even when processing strings. There should be no performance hit, especially considering how bad the current string to number conversions are.

It's actually pretty easy to improve performance of string to number conversions by at least an order of magnitude.

Anyway, thought this would be helpful, many people in the industry have adopted these string conversions and, at least, allow radix prefixes. I find string literals difficult to parse and definitely like how integer literals make numbers easier to read.

Quick quiz for the nay sayers: how many zeros in this number "1000000000000000000000000"? And again, using the horrible underscores, "1_000_000_000_000_000_000_000_000".

At least we still have integer literals and the new StaticBigInt literal.
It will definitely make life easier when dealing with the larger integer types that are coming. We'll probably see far fewer numeric strings in the future.