[Pitch] Simple Enhancement: make String/Substring isASCII available

This is a very small pitch to gauge interest and figure out whether there are any sharp edges I can’t see. So, without further ado:

Expose whether Swift strings are ASCII

Introduction

This proposal introduces an API that exposes whether a Swift String or Substring is known to be ASCII.

Motivation

There are a number of contexts in which a String that is known to be ASCII can be handled in a more-optimal fashion than other kinds of strings. This is particularly common in network programming, where non-ASCII strings are frequently either forbidden or require special handling.

Swift strings already keep track of whether they are known to be ASCII, but this is not currently API. This means that to answer this question, programmers are forced to do a scan of the String to look at every byte. This is entirely redundant computation that could easily be elided.

As an example of the need, I will note that Apple itself has implemented String-is-ASCII checks at least 5 times in public repositories:

Presumably even more exist in the wider world.

Generally speaking it is considered good practice not to perform unnecessary computations. In this case, all of these computations are unnecessary. All Swift string objects know, statically, whether they are ASCII or not. This is determined by a simple static computation that relies on bitmasking a number of fields in the string.

This pitch proposes making this access public, and regaining the cycles spent checking for something we already know.

Detailed design

This pitch proposes to add a new public API on String and Substring:

public var isASCII: Bool { get }

This API will be implemented on top of the existing APIs on _StringGuts:

  @inlinable @inline(__always)
  internal var isASCII: Bool  {
    return _object.isASCII
  }

This API will be public and @_alwaysEmitIntoClient. The reason for the latter is that it enables us to back-deploy this accessor. The APIs it relies upon are as available as String, so making this API AEIC ensures that we are able to offer this API without requiring an availability guard.

Alternatives considered

There are very few alternatives. The current status quo is one.

We could consider promoting the full testing API surface to public API, to enable more introspection of String representations. This is probably out of scope. String ‘s representational choices are mostly of no interest to users of Swift. Whether a String is small or not is rarely immediately relevant. Similarly, its capacity and count are already exposed in other places.

Source compatibility

This proposal has no impact on source compatibility.

Effect on ABI stability

This proposal has no impact on ABI stability, as no new symbols will be added.

21 Likes

Would this property act as a definitive source of information, or will isASCII only be true when the string is already known to be ASCII? In other words, does isASCII == false mean that the String definitely contains at least one non-ASCII scalar, or just that it is not known to be strictly all ASCII? I ask because the isASCII property reads as the former to me (i.e. isASCII == false means that there is at least one non-ASCII character in the string) but I had thought that String's isASCII bit represented the latter and is not guaranteed to be known (for cases like lazily bridged NSStrings, etc.) but I might be wrong on that understanding.

9 Likes

Isn’t this precisely why the existing API on UTF8Span is called isKnownASCII?

7 Likes

Yeah, the name needs some work. It might need to be isDefinitelyASCII, because there are some contexts in which this field returns false but the answer is actually true. For example, this field doesn’t change after the string was initialized:


  1> var f = "Hello, world, it's me ☔️"
f: String = "Hello, world, it\'s me ☔️"
  2> f._classify()._isASCII
$R1: Bool = false
  3> f.removeLast()
$R2: String.Element = "☔️"
  4> f
$R3: String = "Hello, world, it\'s me "
  5> f._classify()._isASCII
$R4: Bool = false

I also believe foreign strings have the same pattern.

Another option would be to have a different behaviour on false, whereby we fall back to doing an explicit scan of the String. This would cause this API to silently get slower on certain Strings, which is probably suboptimal.

I approve @ksluder’s proposal for isKnownASCII, and if I take this forward to a proposal I’ll amend the name to that.

9 Likes

This seems useful to have, and I agree with the previous commenters that isKnownASCII is an accurate and precedented name for it.

Just for the record, if I had a string consisting of ASCII-only characters but without the _object.ASCII bit set (i.e. not yet known ASCII), what would be the process of converting it into a String value that did?

There is another property on UTF8Span that does exactly this: checkForASCII()

The APIs go even further, introducing another function for checking if the string is in Normalization Form C, if not already ASCII, which can be useful in certain implementations (e.g. in IDNA): checkForNFC()

1 Like

The checkForASCII() function on UTF8Span does that as well (it’s a mutating function), although only on the span itself, which while makes sense, might be suboptimal.

There is probably some way to make String re-check for ASCII and set its isASCII bit, but I haven’t noticed any direct or public APIs.

Well, we just went through this with isIdentical/isKnownIdentical, and the same objections are probably relevant here.

Could I tentatively suggest we go with [brand-new] tradition and consider isTrivallyASCII or something along that line? Or at least avoid having the long discussion about '…Known…' again?

3 Likes

In response to myself, UTF8Span does require a good amount of avilability guards though, so that can be the 1 thing in favor of introducing the APIs this pitch is suggesting.

I'll be happy for us to have a more direct way for an is-ascii check as well to be honest. It's a very basic thing to have.

My main issue with asking people to generate UTF8Spans to call this is that doing so is nontrivially expensive for bridged NSStrings in some cases. Given that I’m somewhat in favor of replicating it on String itself.

2 Likes

I see no reason not to expose this property, nameshedding aside. +1.

2 Likes

The isKnownASCII and isKnownNFC names have already been established on UTF8Span by SE-0464. They do not need to be and should not be relitigated for other string types.

The debate about ...Known... for SE-0494 was more around the specific concept of identicality for arbitrary types and its relationship to the underlying memory representation of those types.

7 Likes

I don't want to push too hard on this, and the SE-0494 discussion was very tangled with other issues, but the "known" part of the discussion wasn't about the concept of identity. It was about the semantics of what it is to be known (in this context).

For future developers who aren't aware how we got to a function name using "known", there's a plausible expectation that it means something metaphysically complex, when it's actually the opposite — the name is reaching for something more … um … trivial.

I feel you're being a bit overly prescriptive here. I see that you were the review manager for SE-0464, and I understand you may feel a natural sense of ownership of the result of the review. However, it's not necessary that litigation should be off the table for String, or even that the established name needs to be mirrored here.

An established, prelitigated name has an advantage in this horse race, but it doesn't own the race. :slight_smile:

Finally:
I was also trying to subtly suggest that we seem to have found ourselves a class of functions across various types that have the same naming challenge — the function is intended to be understood to cheaply produce its result, but a false result doesn't mean the absence of an intuitive property, rather just the inability to determine its presence.

Maybe there's value in adopting a consistent term for Swift names that helps developers avoid making incorrect semantic assumptions. :person_shrugging:

1 Like

FWIW the authors of SE-0494 could — and did — claim that the isIdentical name was already established by SE-0447. Individual members of the LSG put forward the claim that these methods from Span et al were not necessarily a "prior art" because of their opinion about the "nature" of Span… but AFAIK the LSG in the aggregate never reached or endorsed this conclusion.

The LSG approved isTriviallyIdentical(to:) for:

It seems to me that if we already chose isKnownASCII for UTF8Span there should be nothing stopping us from choosing a new name for String if LSG collectively decides that new name is "better". Whether or not the isKnownASCII method on UTF8Span is or is not a legit prior art can be discussed… but its presence in SE-0464 does not necessarily make it a prior art that must — or should — be followed.

While of course it's the case that we can decide to revisit the name of any API, the main points about Span.isIdentical compared to the other types in SE-0494 were that

  1. Span.isIdentical was a single API among a fairly large proposed API surface, and so it may have not received the depth of discussion that it would have gotten had we planned to make it the first of many APIs with that name and behavior.
  2. Even given that last point, isIdentical was a perfectly good name for that API on Span because a Span is tautologically its identity. But given the opportunity to choose a different name for types where the definition of identity was not so clear, it made sense for consistency to update Span to use that new name as well.

Neither of these points hold for isKnownASCII and isKnownNFC—they're completely different classes of API from isIdentical, their meaning for UTF8Span and String are exactly the same, and I can assure you that their names received ample consideration by the Language Steering Group during the discussion of UTF8Span.

2 Likes