[Pitch] AttributedString UTF-8 and UTF-16 Views

jmschonfeld · December 3, 2024, 6:56pm

Hi all,

I'd like to propose new API for AttributedString to provide convenient access to the underlying text's UTF-8 and UTF-16 contents mirroring the UTF-8 and UTF-16 views that exist on String today. Let me know if you have any thoughts/comments/questions/concerns!

`AttributedString` UTF-8 and UTF-16 Views

Proposal: SF-NNNN
Authors: Jeremy Schonfeld
Review Manager: TBD
Status: Pitch
Implementation: swiftlang/swift-foundation#1066

Introduction/Motivation

In macOS 12-aligned releases, Foundation added the AttributedString type as a new API representing rich/attributed text. AttributedString itself is not a collection, but rather a type that offers various views into its contents where each view represents a Collection over a different type of element. Today, AttributedString offers three views: the character view (.characters) which provides a collection of grapheme clusters using the Character element type, the unicode scalar view (.unicodeScalars) which provides a collection of Unicode.Scalars, and the attribute runs view (.runs) which provides a collection of attribute runs present across the text using the AttributedString.Runs.Run element type. These three views form the critical APIs required to interact with an AttributedString via its text (either at the visual, grapheme cluster level or the underlying scalar level) and its runs. However, more advanced use cases require other ways to view an AttributedString's text.

When working with the text content of an AttributedString, sometimes it is necessary to view not only the characters or unicode scalars, but the underlying UTF-8 or UTF-16 contents that make up that text. This can be especially useful when interoperating with other types that use UTF-8 or UTF-16 encoded units as their currency types (for example, NSAttributedString and NSString which use UTF-16 offsets and UTF-16 scalars as their index and element types). Today, String itself has a UTF-8 and UTF-16 view that can be used to perform these encoding-specific operations, however AttributedString offers no equivalent. This proposal seeks to remedy this by adding equivalent UTF-8 and UTF-16 views to AttributedString, offering easy access to the encoded forms of the text.

Proposed solution

Just like String, AttributedString will offer new, immutable UTF-8 and UTF-16 character views via the .utf8 and .utf16 properties. Developers will be able to use these new views like the following example:

var attrStr: AttributedString

// Iterate over the UTF-8 scalars
for scalar in attrStr.utf8 {
    print(scalar)
}

// Determine the UTF-8 offset of a particular index
let offset = attrStr.utf8.distance(from: attrStr.startIndex, to: someOtherIndex)

For the detailed design and more in-depth info, check out the full proposal on the PR to the swift-foundation repo.

benrimmington · December 4, 2024, 10:27am

All of these changes are additive and have no impact on source compatibility except for the addition to AttributedStringProtocol. The added requirements to AttributedStringProtocol are both source and ABI breaking changes for any clients that have types conforming to this protocol. However, as declared by AttributedStringProtocol's documentation, only Foundation is allowed to conform types to this protocol and other libraries outside of Foundation may not declare a conformance. Therefore, I feel that this is a suitable change to make as we will ensure that Foundation itself does not break and any clients that have declared conformances themselves are in violation of this type's API contract.

According to the library evolution blog post and documentation, new requirements can usually be added if they have a default implementation.

extension AttributedStringProtocol {

  @available(FoundationPreview 6.2, *)
  public var utf8: AttributedString.UTF8View {
    .init(unicodeScalars._guts, in: unicodeScalars._range)
  }

  @available(FoundationPreview 6.2, *)
  public var utf16: AttributedString.UTF16View {
    .init(unicodeScalars._guts, in: unicodeScalars._range)
  }
}

jmschonfeld · December 4, 2024, 5:30pm

Ah that's a good point. Originally I hadn't attempted this since I hadn't found a compatible way to get to the guts, but getting the _guts from the unicode scalar view would be a way to add a default implementation here. In practice it'd never be used since AttributedString/AttributedSubstring would provide their own implementation that just provides _guts directly, but this could at least eliminate the possibility of any ABI concerns to make this a bit "safer" to land.

jmschonfeld · December 4, 2024, 6:26pm

I've updated the proposal to account for this by adding default implementations for the protocol requirements as suggested. The Source Compatibility section now reads:

All of these changes are additive and have no impact on source compatibility. The added requirements to AttributedStringProtocol have provided default implementations and as such are not ABI/API breaking changes.

itingliu · December 12, 2024, 7:04pm

This seems like a very straightforward API addition. We are also following a very well known pattern. I'd like to treat this pitch as an abbreviated review, and accept it as is since there's no outstanding questions.

Karl · December 12, 2024, 8:57pm

By the way, if you want to block public conformances to protocols, I believe this should work:

public protocol CanUseButNotConform {
  func doSomething()

  var _token: ConformanceToken<Self> { get }
}

public struct ConformanceToken<T> {
  internal init() {}
}

Essentially, it imposes a requirement, as part of the protocol, that the code implementing the conformance has access to a ConformanceToken initialiser. Clients outside the module are not able to construct a ConformanceToken because it has no public initialisers, and they cannot forward one from another type because the generic parameter wouldn't be Self.

You might consider adding something like this to AttributedStringProtocol to ease future evolution.

jmschonfeld · December 12, 2024, 10:06pm

An interesting suggestion! Yeah definitely something that we could have considered when introducing this protocol. It doesn't quite prevent a conformance as someone could still conform with var _token: ConformanceToken<Self> { fatalError() } but it at least helps dissuade potential conformances other than just documentation. At this point we probably couldn't add it to AttributedStringProtocol since that would break any (technically incorrect) clients that already have a conformance, but definitely something to consider for any future, similar protocols.

Karl · December 12, 2024, 10:22pm

Hm, you could try to access the field, I guess. But it's probably not worth trying too hard. Anybody who tries to implement that requirement the obvious way (looking up the ConformanceToken documentation) will see that it is supposed to be impossible.

As for the source break, given the quote ben picked out from the previous draft, it seems you were already willing to break those clients. So a break specifically to introduce a conformance barrier seems reasonable to me, personally.

It's up to you if you think it'll make future evolution easier for yourself and the other Foundation maintainers to more thoroughly block these conformances.

jmschonfeld · December 12, 2024, 10:33pm

Yeah I suspect the fact that someone would have to write fatalError() would be enough to dissuade them (or make them accept suffering the consequences).

You're not wrong, and perhaps we decide it's worth it someday. I don't think that day's today, but I'll keep it in mind for next time