Unicode Enthusiasts Unite

Unicode Enthusiasts Unite

Hello all, I thought I’d share some thoughts on future API directions relating to Unicode.

The standard library provides lots of functionality through things like the Unicode namespace and Unicode Scalar Properties. String should continue to expand on this to deliver functionality to those unfortunate enthusiastic developers who interact with Unicody details on a frequent basis.

CC (@allevato, @xwu)

Normalized Scalar Views

Briefly discussed here, String.UnicodeScalarView should expose properties providing lazily-normalized views for NFC, NFD, NFkC, NFkD, and maybe FCC. We should also have a “Swift canonical form”, which is the unspecified form that’s used for efficient string comparisons, as well as ways to force a string into that normal form.

String’s implementation of comparison, in a fall-back path, performs a lazy normalization of a sliding window into NFC. Unfortunately, we pay some overhead in how we interact with ICU (e.g. transcoding). A native Swift implementation of normalization algorithms would greatly improve, and simplify, comparison’s implementation. This would also improve Swift’s portability by reducing one of the main ways the standard library currently depends on ICU.

Emoji Analysis

Whether a Character is or is not an emoji is actually complex and environment-dependent. For this reason emoji-analysis was postponed from SE-0221: Character Properties. We should revisit this, watching newer Unicode versions and perhaps consulting with the Unicode Consortium to understand how we can carefully balance usability with source-stability.

Bidirectional Properties

Unicode defines Bidirectional Class Values for scalars, and dictates how applications can use them to display text correctly. We can followup on the work in SE-0211 to add these.

extension Unicode {
  /// The bidirectional classification of a Unicode scalar.
  ///
  /// This classification is used for presenting directionality of text by the
  /// [Unicode Standard](https://www.unicode.org/reports/tr9/#Bidirectional_Character_Types)
  public enum BidirectionalClass {
    /// A strong left-to-right character.
    ///
    /// The value corresponds to the category `Left_To_Right` (abbreviated `L`) in the
    /// [Unicode Standard](https://unicode.org/reports/tr44/#Bidi_Class_Values).
    case leftToRight

    ...
  }
}

extension Unicode.Scalar {
  /// The bidirectional class of the scalar.
  ///
  /// This property corresponds to the "Bidi_Class" property in the
  /// [Unicode Standard](http://www.unicode.org/versions/latest/).
  public var bidiClass: BidirectionalClass { get } 
}

Current version

We should expose a way to get the run-time version of Unicode that’s available.

extension Unicode {
  var currentVersion: Unicode.Version { get }
}

SIMD-accelerated decoding and analysis

Pending more SIMD feature work (CC @scanon), the standard library should expose more of its internal decoding and analysis functionality.

This includes things such as classifying a code unit, accelerating length calculations between UTF-8 and UTF-16, transcoding, etc.

Out-of-scope for now: locale

Locale is currently considered out of scope for the standard library. Even simple operations such as determining the current user’s locale can depend on higher-level layers of the OS. Baking in any notion of locale at this stage could limit Swift’s applicability to systems programming.

Conventions for dealing with localized content is best left to the platform (e.g. Cocoa).

16 Likes

I see your point about using views for lazy normalization. However the final design ends up for the APIs for making strings contiguous UTF-8, though, I think we should seriously consider emulating that here so that users can choose to “make strings NFC” in the same way.

3 Likes

I realize now that I've had this PR open for a while that implements this: [stdlib] Add supportedVersion property to Unicode namespace. by allevato · Pull Request #18180 · apple/swift · GitHub

It looks like we needed to decide whether it, being a single property, needed to go through Evolution. It probably makes sense to bundle it with other related APIs so that we have something with enough "meat" to be proposal-worthy.

3 Likes

Is moving Swift off of the ICU dependency (partially or entirely) in scope for this effort?

This post talking about APIs. But in my view, moving Swift off of the ICU dependency (partially or entirely) is always in scope for the standard library.

8 Likes

Wouldn't that just be something like let myNFKDStr = String(myStr.unicodeScalars.nfkd)? That is, a String init taking a Collection of Unicode.Scalars.

edit: I misread your post, I thought you were bringing up the eager vs lazy debate. Yes, I agree that we are likely to add some kind of "put me in the best representation" API or initializer, which would include contiguous-immediate and NFC as well as setting any performance flags that we can.

2 Likes

I forgot to mention default case folding (for case-insensitive comparisons) ala String case folding and normalization APIs.

@Michael_Ilseman sorry to bump this after such a long time, but one thing isn’t entirely clear to me: has it been decided that we will eventually bundle Unicode data (e.g. for normalisation) with the standard library and cut the dependence on the platform’s ICU? Or is it at least a realistic option that the core team would consider?

I've been looking at this (particularly normalisation and case-folding) as something which I might try to implement once a few other projects settle down. I'm planning to implement IDNA transformations anyway, and that requires custom mapping tables defined by Unicode in a way which is similar to other normalisation algorithms. Given that I'm going to be doing a lot of that work anyway, I would prefer a full data + algorithms approach, but I'm not sure if that would be acceptable.

1 Like

Looking at Tony Allevato's PR for adding this, it occurred to me that Swift Package Manager already has a full-featured Version struct that represents a semantic version.

Since semantic versioning is in widespread use including SPM, would it make sense to have a SemanticVersion type in the standard library? (Especially since the standard library will likely have at least one nested in Unicode)

4 Likes