The stdlib's Unicode.Scalar.Properties.nameAlias
property returnsnil
for some Unicode scalars despite the fact that Unicode does publish official name alias(es) for them. Among other Unicode scalars, this affects the common ASCII control characters 0–31.
Example
For example, take the Unicode scalars U+0009
(tab) and U+000A
(line feed):
let tab = Unicode.Scalar(0x9)!
tab.properties.name // nil (correct!)
tab.properties.nameAlias // nil ❌
let lineFeed = Unicode.Scalar(0xA)!
lineFeed.properties.name // nil (correct!)
lineFeed.properties.nameAlias // nil ❌
Observations:
- The fact that these scalars don't have a
name
is correct: the "Name" field for these (and all other ASCII 0–32 values) in the Unicode Character Database (UCD) is blank. - The fact that
nameAlias
isnil
is arguably wrong. These scalars do have one or more official alias names, as listed in the UCD fileNameAliases.txt
.
Name aliases in the Unicode Character Database
Excerpt from NameAliases.txt
for tab (U+0009) and line feed (U+000A):
# semicolon-separated fields
# Format: code point;alias;alias type
0009;CHARACTER TABULATION;control
0009;HORIZONTAL TABULATION;control
0009;HT;abbreviation
0009;TAB;abbreviation
000A;LINE FEED;control
000A;NEW LINE;control
000A;END OF LINE;control
000A;LF;abbreviation
000A;NL;abbreviation
000A;EOL;abbreviation
As you can see, these code points have multiple aliases with different types, in this case 'control' ("ISO 6429 names for C0 and C1 control functions, and other commonly occurring names for control codes") and 'abbreviation' ("Commonly occurring abbreviations (or acronyms) for control codes, format characters, spaces, and variation selectors").
NameAliases.txt
defined 5 possible alias types in total: correction, control, alternate, figment, abbreviation.
The stdlib currently only considers 'correction' aliases and ignores the rest
Of the 5 possible alias types defined by the UCD our parsing code that translates NameAliases.txt
into the tables that ultimately feed Unicode.Scalar.Properties
only considers the 'correction' type and ignores the other 4 types:
…
// Name aliases are only found with correction attribute.
guard components[2] == "correction" else {
continue
}
…
Proposal 1: Take all alias types into account
Should we change this and take all 5 alias types into account? I think yes, or at least I don't see a good reason why we shouldn't do this. At the very least, it would provide a usable nameAlias
value for the frequently used ASCII control characters.
Given that nameAlias
is of type String?
and not an array of strings (or of string/aliasType pairs), we would need to prioritize which alias to pick if a scalar has multiple.
We could order the list of alias types by priority (say, from highest to lowest: correction, alternate, control, figment, abbreviation). And if a scalar has multiple aliases under the same type, we pick the first one in the list. By prioritizing 'correction' highest, we would be preserving the nameAlias
value for any scalar that currently has a non-nil
value.
Proposal 2: Add a nameAliases property
An even bigger change on top of proposal 1 (not instead of it) would be to add a new property that lists all official aliases. Something like this:
extension Unicode.Scalar.Properties {
public var nameAliases: [(alias: String, type: Unicode.Scalar.AliasType)] { get }
}
extension Unicode.Scalar {
// Not @frozen because Unicode might add another alias type
public enum AliasType: String {
case correction
case control
case alternate
case figment
case abbreviation
}
}
What do you think?