Provide a nameAlias for more Unicode scalars

The stdlib's Unicode.Scalar.Properties.nameAlias property returnsnil for some Unicode scalars despite the fact that Unicode does publish official name alias(es) for them. Among other Unicode scalars, this affects the common ASCII control characters 0–31.

Example

For example, take the Unicode scalars U+0009 (tab) and U+000A (line feed):

let tab = Unicode.Scalar(0x9)!
tab.properties.name      // nil (correct!)
tab.properties.nameAlias // nil ❌

let lineFeed = Unicode.Scalar(0xA)!
lineFeed.properties.name      // nil (correct!)
lineFeed.properties.nameAlias // nil ❌

Observations:

  • The fact that these scalars don't have a name is correct: the "Name" field for these (and all other ASCII 0–32 values) in the Unicode Character Database (UCD) is blank.
  • The fact that nameAlias is nil is arguably wrong. These scalars do have one or more official alias names, as listed in the UCD file NameAliases.txt.

Name aliases in the Unicode Character Database

Excerpt from NameAliases.txt for tab (U+0009) and line feed (U+000A):

# semicolon-separated fields
# Format: code point;alias;alias type
0009;CHARACTER TABULATION;control
0009;HORIZONTAL TABULATION;control
0009;HT;abbreviation
0009;TAB;abbreviation
000A;LINE FEED;control
000A;NEW LINE;control
000A;END OF LINE;control
000A;LF;abbreviation
000A;NL;abbreviation
000A;EOL;abbreviation

As you can see, these code points have multiple aliases with different types, in this case 'control' ("ISO 6429 names for C0 and C1 control functions, and other commonly occurring names for control codes") and 'abbreviation' ("Commonly occurring abbreviations (or acronyms) for control codes, format characters, spaces, and variation selectors").

NameAliases.txt defined 5 possible alias types in total: correction, control, alternate, figment, abbreviation.

The stdlib currently only considers 'correction' aliases and ignores the rest

Of the 5 possible alias types defined by the UCD our parsing code that translates NameAliases.txt into the tables that ultimately feed Unicode.Scalar.Properties only considers the 'correction' type and ignores the other 4 types:

…
// Name aliases are only found with correction attribute.
guard components[2] == "correction" else {
  continue
}
…

Proposal 1: Take all alias types into account

Should we change this and take all 5 alias types into account? I think yes, or at least I don't see a good reason why we shouldn't do this. At the very least, it would provide a usable nameAlias value for the frequently used ASCII control characters.

Given that nameAlias is of type String? and not an array of strings (or of string/aliasType pairs), we would need to prioritize which alias to pick if a scalar has multiple.

We could order the list of alias types by priority (say, from highest to lowest: correction, alternate, control, figment, abbreviation). And if a scalar has multiple aliases under the same type, we pick the first one in the list. By prioritizing 'correction' highest, we would be preserving the nameAlias value for any scalar that currently has a non-nil value.

Proposal 2: Add a nameAliases property

An even bigger change on top of proposal 1 (not instead of it) would be to add a new property that lists all official aliases. Something like this:

extension Unicode.Scalar.Properties {
    public var nameAliases: [(alias: String, type: Unicode.Scalar.AliasType)] { get }
}

extension Unicode.Scalar {
    // Not @frozen because Unicode might add another alias type
    public enum AliasType: String {
        case correction
        case control
        case alternate
        case figment
        case abbreviation
    }
}

What do you think?

4 Likes

The Swift documentation for nameAlias (accurately) only describes providing corrections:

The nameAlias property is provided to issue corrections if a name was issued erroneously.

Granted, it doesn't formalistically say that nameAlias is nil if and only if there has been no correction, but anything but a mathematical reading of the documentation strongly suggests as such.

If we want to be clearer about it, I think we could rename this to correctionNameAlias going forward, but I'm always leery of silently changing documented behavior, Hyrum's law and all.


nameAliases seems fine :)

1 Like

I‘d argue that the documentation (and the implementation) doesn‘t accurately reflect the accepted proposal SE-0211: Unicode Scalar Properties, which has this to say about nameAlias:

extension Unicode.Scalar.Properties {
  /// Corresponds to the `Name_Alias` Unicode property.
  public var nameAlias: String? { get }
}

And the Name_Alias category in Unicode is defined as all aliases listed in NameAliases.txt, not just corrections:

Name_Alias: Normative formal aliases for characters with erroneous names, for control characters and some format characters, and for character abbreviations, as described in Chapter 4, Character Properties in [Unicode]. Aliases tagged with the type "correction", as well as a selection of aliases of other types, are published in the Unicode Standard code charts.

(The last sentence of that quote shows some bias for corrections, but notably, the aliases for control characters are also among those listed in the Unicode code charts, e.g. page 3 in https://www.unicode.org/charts/PDF/U0000.pdf.)

I searched through the pitch and review threads for SE-0211 to check if the semantics of the nameAlias property were discussed back then, but I didn’t find anything significant.

1 Like

Yup, for sure: the API that was shipped doesn't exactly align with the accepted proposal, but it's shipped that way for over half a decade now, so the proposal would have to be "shrunk to fit" reality.

3 Likes

Speaking personally here, I would very much like to not introduce more Unicode data in the standard library. My own personal opinion is that SE-0211: Unicode Scalar Properties was an ok idea at the time when it could just reach into ICU to grab the data out, but since we've moved on from ICU in the stdlib, this data has become a pain point with regards to efforts like embedded Swift and statically linking executables in server environments. That's why I'm pretty hesitant to add new Unicode properties to this structure or provide data that didn't already exist. I would much rather prefer a solution that involved some separate Unicode/Internationalization module in the toolchain or to add this sort of stuff to FoundationInternationalization if it makes sense there too.

5 Likes

As the author of SE-0211, I agree completely with this.

As you said, at the time querying any part of ICU was "free" except for the API surface it would add to the stdlib, and it was the only option available at the time that was publicly evolvable. I don't think we were shipping any other side-car modules with stdlib at the time where it would have made sense to make it an optional dependency.

With Foundation now having a public evolution process, I could imagine a path forward being to add new extensions to Unicode.Scalar.Properties to FoundationInternationalization, where they can call into that framework's built-in copy of ICU. That would minimize the APIs feeling scattered—you just add an import and then you get new features added to the existing types.

3 Likes

I'd considered pitching Unicode.Script, Unicode.Block, and perhaps other missing APIs. They're needed for Regex, so FoundationInternationalization wouldn't be suitable.

Could the existing APIs and data be moved to the _StringProcessing module?

Would InlineArray with @constInitialized or @section allow implementation of the Unicode data in Swift (instead of C headers)?

1 Like