[Pitch] [Embedded] "Unicode" availability domain for APIs requiring the Unicode tables

Hi all,

In Embedded Swift, we've found that it can be pretty hard to make sure that your code reliably stays within the subset of the standard library that avoids use of the large (50kb-500kb+) Unicode tables. Specifically, accidentally using one of those APIs results in an obscure linker failure about missing "swift_stdlib" symbols that can be very far from the actual code you wrote that, e.g., tries to use == on two strings.

To address this, I have a pitch and implementation that introduces a custom availability domain named Unicode to describe all APIs that need those Unicode tables. Essentially, one annotates those APIs (and anything that uses them) as @available(Unicode), like we're used to from platform availability. Then, there is a compiler flag to removal all code that is marked @available(Unicode) wholesale, including the Unicode tables, so we can be sure it isn't in the final binary.

Proposal is here if you'd like to learn more. I'm not confident in all of the technical decisions in there and would love more eyes on it.

Doug

28 Likes

You inspired me to see where this could apply in my libraries...

In Swift-MMIO, I created an opt-in feature for building with support for interposing MMIO ops. e.g. the user can dependency inject their own load/store functions. I added this feature so users could test device drivers without needing hardware.

Documentation on the feature:

Aside, this feature is even used to test parts of Swift-MMIO itself, like the LLDB plugin ReadCommandTests.swift:

"""
m[0x0000_0000_0000_1012] -> 0x7a7e_cbd9
m[0x0000_0000_0000_1004] -> 0xae64
m[0x0000_0000_0000_1000] -> 0x6204_b303
"""

When building MMIO normally, MMIOInterposer is marked as deprecated, but still visible and the APIs to operate on the interposer are all no-ops.

  #if !FEATURE_INTERPOSABLE
  @available(
    *, deprecated, message: "Define FEATURE_INTERPOSABLE to enable interposers."
  )
  #endif
  public var interposer: (any MMIOInterposer)? {
    @inlinable @inline(__always) get {
      #if FEATURE_INTERPOSABLE
      self._interposer
      #else
      nil
      #endif
    }
    @inlinable @inline(__always) set {
      #if FEATURE_INTERPOSABLE
      self._interposer = newValue
      #endif
    }
  }

  #if FEATURE_INTERPOSABLE
  @usableFromInline
  internal var _interposer: (any MMIOInterposer)?
  #endif

I did this so that I could include MMIOInterposer in the standard doc catalog and provide nicer diagnostics about the feature not being available in "standard" mode without any runtime cost.

--

@tshortli / @Douglas_Gregor do you agree this feature would better modeled as an availability domain for Interposable?

My goal:

  • No runtime cost when disabled
  • Compile time diagnostics indicating the feature needs to be enabled
  • Documentation shows the availability domain
  • Simple to enable in SwiftPM

It's not entirely clear to me how this would be functionally different from spelling it hasFeature(Unicode). Can you explain why that existing syntax would be insufficient?

For the same reason I'd rather have @available on a function decl for a given OS or platform than it being completely hidden with #if os(YourFavoriteOS). #if hasFeature completely excludes (or includes if I build with it enabled) declarations from symbol graph (and DocC bundles as a consequence) and leads to confusing diagnostic messages.

3 Likes

Yes, I think availability domains would work well for conditionally enabling Swift-MMIO's interposition APIs.

When an availability domain is disabled at compile time, code isn't generated at all for the declarations that become unavailable as a result and if #available branches become constant evaluated. This gives you stronger guarantees about the runtime and code size impact with availability versus the approach of deprecating the interposition APIs.

Yes, the availability checker would ensure that developers get clear diagnostics when using Interposable APIs in contexts where interposition isn't available and prompt the developer to adapt the code or build settings accordingly.

I'd expect the @available attributes referencing custom availability domains to be rendered in documentation the same way that platform availability is today. There's probably some implementation work remaining to accomplish this but it seems straightforward to me conceptually.

How the availability domain's state is controlled is an interesting problem. An availability domain models a condition that must be globally consistent within a program, both at compile time and runtime. In other words, if #available(Domain) branches must compile to the same code everywhere and linkable symbols must either be present or absent consistently with respect to the state of the domain (enabled, disabled, or conditional).

The simplest design I can think of that achieves the requirements is for availability domains to be both declared and given a visible definition in the build settings for a single module in the build graph, and for dependent compilations to check the definition of the domain when referenced. This is really the only way it can work with separately compiled binaries - the library that declares an availability domain is compiled with the domain configured in one way, and all of its clients must use the same definition for the domain.

The theoretical Unicode and Interposable availability domains, on the other hand, seem to be designed to offer behavioral customization to clients of a library, similar to package traits. That could work when compiling for Embedded Swift, since all of the modules in a program that would use either the Unicode and Interposable domain would get built together from source and can share a definition provided by the overall build. When the Swift standard library is separately compiled and distributed, as it usually is for desktop Swift, I don't think customization of the Unicode can be offered, though. Unicode would be "enabled" in the desktop standard library build, and the client can really only choose whether Unicode is "enabled" or "always enabled" (the "always" modifier only controls diagnostics and doesn't need to be consistent in every module).

I think we can accommodate different compilation models and offer customization when it can be supported, but we'll need to be careful in the implementation to ensure that the tooling doesn't allow inconsistent states and diagnoses them well if they do occur. It does seem to me like availability domains that have definitions that can be customized by clients are really similar to package traits, so maybe there's an opportunity to use a similar design or somehow combine the two into complementary features.

1 Like

I'm not paying close attention right now, but I second tshortli's comparison to package traits. I can certainly imagine package authors wanting to provide better guidance when a trait is disabled, and the discussion for this new availability domain has already tripped the Zero-One-Many rule, in that in five posts people have already come up with at least two uses, with likely more to come. It does mean designing the general thing before solving the specific problem though.

3 Likes

I thought immediately of package traits as well, but decided there was no duplication. SwiftPM, or a package author, might implement a trait to enable or disable an availability domain. This is composability working as intended, allowing the language to be agnostic of the package manager and/or build system driving it.

1 Like

I agree with others that Unicode availability annotations seem like a special case of availability annotations for package traits (where the standard library provides different traits for different levels of Unicode tables).

As for how many types of Unicode availability we expose, I think we should have more granularity. If I understand correctly, if Unicode is available, both normalization (~50 KiB) and scalar properties (~500 KiB) are allowed. That’s a huge difference for something like a Web app using Embedded Swift. Imagine a basic web app where we’re getting the user’s email, comparing it to the previous response and sending it to the server. Checking for string equality is a common enough operation that we’re probably going to allow Unicode normalization in the app. But if we accidentally copy some code or import a library that makes subtle use of scalar properties, the binary size would jump by about 500 KiB. That’s detrimental for a web app. Instead of making developers go through a painful debugging session trying to figure out how to get their app’s binary size back down, I think exposing more availability options is crucial for DX. Also, we can still have an availability “macro” of sorts where if we need both normalization and scalar properties, we only need to specify

2 Likes

I think these are not separate toggles but nested ones. How often does one need scalar properties but not normalization?

1 Like

As part of my work with Embedded Swift for WebAssembly (ElementaryUI) I can confirm that the unicode problem is one of the biggest cliffs to accidentally drive off. Having a proper diagnostic at call-site is a VERY welcome improvement.

As described by Doug, understanding where you messed up based on a cryptic linker error is tough enough for someone deeply aware of the situation - but an almost impossible hurdle for some innocent developer just trying to get code to run in Embedded Swift. Thank you for tackling this!

From the web-app perspective: I agree with @filip-sakel in that the granularity may be more important than the proposal describes and worth the added complexity:
My best guess is that most people will want to pay 50K for String equating and hashing - but adding 500K is a very different story. I, for one, would think twice about using these APIs and look a bit harder for alternatives.

tl;dr: I love it, but I think 50K and 500K are different enough to warrant distinct availability.

6 Likes

Without derailing this pitch more into the discussion of availability domains and package traits, I do think there is some overlap in the functionality that both features provide; however, package traits intentionally compile out code that is guarded behind a trait. This allows, for example, Swift PM to not resolve optional dependencies at all. IIRC, availability domains still require that the code type checks.

1 Like

I'd like to figure out how to make these features work together, because there are a lot of advantages to doing so. A package trait can be used to cause additional targets to get linked in, which nicely models what we want for the standard library: if you ask for the unicode tables via the trait, we depend on the target that includes those tables in the binary.

Spitballing a bit here, but if the answer for Embedded Swift were that Unicode was a trait on the standard library, then the standard library might end up with something like:

#if Unicode
@availabilityDomain(Unicode)
public var _unicodeDomain: Bool { true }
#else
@availabilityDomain(Unicode)
@const public var _unicodeDomain: Bool = false
#endif

i.e., we use the trait to decide between "can be available" and "is never available".

We would still need a separate way for a module to decide between "available" (one can only use Unicode APIs from inside @available(Unicode)) and "always available" (one can always use Unicode APIs).

We'd need some checking of how the Unicode domain is configured when importing a module:

  • A Unicode-never-available module can be imported by a Unicode- never-available module or a Unicode-available module
  • A Unicode-always-available module can be imported by a Unicode-always-available module
  • A Unicode-available module can be imported by anything

The standard library would be built in either configuration that can always be imported (Unicode-available or Unicode-never-available), which also means that we don't have to build it from a package for things to work: we can build with Unicode-available in the toolchain, and separately deal with linking in the Unicode tables (or not) based on whether the Unicode trait was provided.

The default for non-Embedded Swift would need to be that the trait is enabled and each module is set to "always available". Unfortunately, this does mean that for embedded, one would need to update each of the libraries you link to either "available" or "never-available" for the Unicode domain to work, making this a bottom-up rollout. We can probably be a little more lax about this domain specifically because the failure mode (a link error) isn't catastrophic.

FWIW, I don't think it's derailing the pitch at all. These features are solving overlapping problems and we should figure that out.

You are correct that availability domains still require that the code type checks. They move the "compiling out code" to a later place in the compilation pipeline: if you always-disable an availability domain, we type check but don't emit any code for anything that has that availability. For an optional feature that doesn't involve dependencies, I think that's a better user experience, because you get the same diagnostics whether the feature is enabled or not, without having to compile twice. But when there are dependencies---say, a module you can import or not---you need the #if that traits provide.

Yeah, with such a huge difference in code size cost between the two, separating them into UnicodeNormalization and UnicodeScalarProperties seems like the best course of action. That 500kb is big enough that one might even want to avoid linking those libraries in a non-Embedded static binary build (e.g., with the static Linux SDK).

They feel orthogonal to me, but I don't have a strong sense of why. When I rework the pull request to separate the domains, we'll see how often they end up being tied together within the standard library itself.

Doug

2 Likes

Before pitching package traits, I had a discussion with @tshortli about the overlap of these two features, and IIRC, the one overlapping use-case of these two features was when a trait guards new additional API that is not depending on a conditional dependency. This use-case can be better expressed with custom availability domains since the resulting developer experience is better. Now, an interesting direction here is hooking package traits up to enable custom availability domains so that users of a package can turn them on.

There are a few use-cases that only package traits can solve though:

  1. Conditional dependencies based on a trait
  2. Change of implementation behavior based on a trait

Thinking out loud a bit here. Isn't one part of the problem here that the standard library is pre-built; otherwise, we wouldn't need a way to check how a target is configured when building it? I thought that availability domains are set for the entire build graph. So they can either be enabled or disabled. This becomes problematic if a pre-built module from the toolchain wants to support this since it would need to be built in both configurations. If we were able to build the standard libraries from source as a package, this problem would go away, right? We would still need the three different modes, but those could be expressed like this:

  1. Always available: non-embedded Swift builds would pass the -define-always-enabled-availability-domain Unicode by default
  2. Enabled availability: The standard library package would define a Unicode package trait which enables the domain. This trait is enabled by default.
  3. Disabled availability: Anything depending on the standard library would pass an empty or custom set of traits

Now, I know that building the standard library as a package isn't trivial and this would have far reaching impact but I just wanted to illustrate that in a pure package world this should be relatively easy to achieve.

but if the answer for Embedded Swift were that Unicode was a trait on the standard library

I don't understand how you would use a trait in the standard library without it being a package. A trait is a resolution/build time configuration expressed in the package graph, but the standard library currently doesn't participate in that graph.

So, IIUC, you’re proposing extending traits to basically be "use-aware”? The current Package Traits proposal says that traits should be additive to keep the dependency graph simple. With what you’re proposing, though, package users specify their preference for a given trait, but we ultimately build the package with a unified trait value. So a package (that e.g. uses the Standard Library), can say “I never want Unicode Normalization enabled” (disabled-availability), “I’m not using Unicode Normalization but other packages can” (enabled-availability), or “I’m using or reserve the right to Unicode Normalization” (always-enabled-availability).

I like that this approach exposes a lot of information to the build graph and will hence allow for nice diagnostics when something inevitably goes wrong. It also supports our main use cases:

  1. Non-embedded apps and libraries can build normally,
  2. Embedded-app builds can throw an error if one of their dependencies uses UnicodeNormalization (either explicitly as an Embedded library, or implicitly if non-Embedded), and
  3. Embedded libraries can still (a) use Unicode if required, or (b) say they don’t care if another module does.

My main questions are (1) what should the defaults be, and (2) how to generalize this “mergeable” trait strategy?

On the question of defaults, I agree that non-Embedded apps and libraries should default to always-enabled. However, I think we should make a distinction between Embedded apps and libraries. Embedded apps are resource-constrained by definition, so I think we should conservatively default to disabled-availability. That is, if you accidentally import a library that may use Unicode into an Embedded app, you should get an error (or at least a warning) by default. As for Embedded libraries, they don’t know the demands of the final application, so they should default to enabled-availability if they don’t use Unicode themselves. As a result, both Embedded apps that use and don’t use Unicode can import said library without changing any build settings or traits.

So what does a general version of this feature look like? I think we could define something like a “mergeableTrait” where the package defining the trait still ultimately gets a boolean flag of enabled or disabled. However, upstream packages can choose “.set”, “.unset”, or “.reject” (we can bikeshed the names later). Then, we should also be able to create shorthands if we interpret UnicodeNormalization and UnicodeScalarProperties as separate traits instead of nested one.

Even if we don’t build the stdlib from source, could we put all UnicodeNormalization and UnicodeScalar functionality under separate linker sections that makes them easy to strip out?

Thinking out loud here: incidentally, I recently did some work with traits to track some extra metadata for a trait’s enablers. For example, if a trait “MyTrait” for package “ChildPackage” was enabled by a parent package “Root”, then we would store this trait in the enabled traits map to look something like the following:

``
["ChildPackage": [EnabledTrait("MyTrait", setBy: [.package("Root")])]
``

I think there could be an opportunity here to expand this model to consider upstream packages that can decide to make use of the enabled trait or if they choose to reject it.

I’ll have to wrap my head around what the implementation for a “mergeableTrait” would look like here and how it would affect the resulting package graph; like you mentioned, traits are an additive concept so we may have to consider that if a trait is enabled at some point that it remains enabled within the context of the package that declares it. Perhaps we can introduce some flexibility with packages consuming this same package wherein they can decide if they’d ultimately like to use the features exposed by the enablement of a trait or not.

I’m not sure this will even work right out of the box, but if folks see some merit in this idea I can hack out a simple model as a proof of concept.

1 Like

That’s really cool! It’s good to hear that this idea of mergeable traits is a plausible.

1 Like

Yes, exactly!

Availability domains don't have to be set for the entire build graph, because some mixing and matching is possible so long as you follow the rules I described in my earlier reply. For example, if the standard library is built with the unicode tables being "available", it would be fine for it to be used with client code with any configuration.

Yes. I don't think this necessarily accounts for the use case where you want to use enabled availability in a module for both embedded and non-embedded, to make sure your code doesn't have different behavior between the two modes.

It currently does not participate in the package graph. I think we should model it in the package graph, even when it is prebuilt, so we understand all of the dependencies. I also have a side project that's working toward building the embedded standard library as a package so that we help flush out some of the issues with doing this.

Yes, thats a good summary.

Those sound like reasonable defaults for me. It becomes more work to get a library building for embedded the first time (because you need to deal with UnicodeNormalization et al availability), but the result will fit better into the embedded ecosystem once you've done so. It's not hard to do the availability work, so I think that's a good default.

I think I like this idea of generalizing the approach, because I really don't want to be doing one-off hacks for embedded traits. Mergeable traits seem like a reasonable potential solution... but I think I'd want to see more detail about how they affect the package graph and whether they have any use beyond the semantics we want for these availability domains.

I appreciate all of the discussion, folks. I feel like we're making progress toward a design that would make it easier to deal with these optional pieces. Beyond the Unicode tables, Embedded Swift has a couple things that are effectively optional features that can be supported when the underlying platform provides a hook to do so. For example, SystemRandomNumberGenerator and the things built on it require the platform to provide a random number generator (currently, arc4random). There's also print and the things that use it, which require standard output from the platform (currently, putchar). I could imagine adopting our solution here (whatever it is) for all of them.

Doug

4 Likes

I think a similar option for ICU availability would also be helpful for all non-Apple platforms. You sometimes want to use some APIs that are not in FoundationEssentials, like Process or Timer, and you are stuck importing FoundationInternationalization through importing Foundation. (Not to mention that a lot of common packages refuse to drop their Foundation imports when FoundationEssential is available because it's a "breaking change")