Add isPrependConcatenationMark to Unicode.Scalar.Properties

Add isPrependConcatenationMark to Unicode.Scalar.Properties

Hi everyone!

Quick note: this is my first post in the Pitches category so please let me know if there's anything I should change or do differently. Thanks!

Introduction

Swift's Unicode.Scalar.Properties struct includes many useful property checks for scalars, such as isAlphabetic, isHexDigit, etc. I think including a new isPrependConcatenationMark property check would also be useful, corresponding to the Prepend_Concatenation_Mark property listed in the Unicode Standard.

Motivation

When handling grapheme cluster break detection, the grapheme break property table lists the grapheme break properties and their corresponding code points. Some of these values are defined in terms of individual code points, like CR, while others are defined in terms of Unicode properties, like Prepend.

Allowing developers to leverage Swift's Unicode.Scalar.Properties both reduces the amount of duplicate code developers need to write and possibility for errors, which might be quite common when trying to directly reference the code points listed in the GraphemeBreakProperty spec.

Proposed solution

I propose adding a new property to the Unicode.Scalar.Properties struct.

Detailed design

extension Unicode.Scalar.Properties {
	public var isPrependConcatenationMark: Bool { get } // Prepend_Concatenation_Mark
}

Source compatibility

This change is strictly additive. This proposal does not affect source compatibility.

Effect on ABI stability

This change is strictly additive. This proposal does not affect the ABI of existing language features.

Effect on API resilience

This change is very minor and nearly identical to many of the getters in Unicode.Scalar.Properties. This change will not affect API resilience.

Alternatives considered

Simply not adding the property to the standard library was considered but would likely lead to more code errors due to mistakes implementing the specific code points. It also would be less convenient to developers as it is very difficult to extend Unicode.Scalar.Properties to include a getter that could be provided in a SwiftPM package or copied into existing code. This is because there is no easy way to access the internal var _value in the Unicode.Scalar.Properties struct.

5 Likes

+1 for this.

Swift is one of a small list of languages with great Unicode support, this is a small step further. As it is additive change, lets just do it.

As the original proposer/implementor of Unicode.Scalar.Properties, I would kind of hope that keeping it up-to-date with new ICU additions (at least for things as straightforward as new Boolean properties) would be as simple as a PR vs. the pitch/proposal cycle, but I'm not a policy maker :slight_smile:

The only reason Prepended_Concatenation_Mark isn't already there is because it was added to the standard after the initial implementation in Swift, so I think adding it makes total sense.

The only thing that makes this slightly tricky is that the property will only be available on versions of Apple's OSes where libicucore.dylib is built from a recent enough version of ICU. Some of the emoji properties today have similar constraints. It looks like the following Boolean properties have been added to Unicode/ICU since Unicode.Scalar.Properties was implemented:

I don't happen to know off the top of my head what minimum versions of each OS correspond to ICU 60 and 62, though.

@Michael_Ilseman has discussed in the past the idea of embedding subsets of the Unicode data tables into the standard library to remove the odd coupling between these APIs and specific OS versions, and it would make these kinds of updates much easier, but I don't know what the status of that effort is.

7 Likes

With @Alejandro putting data into the Swift runtime/stdlib, that is removing any reliance on ICU and instead grabbing data directly from Unicode, I think we can support these a lot easier.

Simultaneously, we'll want to be a superset of scalar properties listed in UTS#18 (@nnnnnnnn), so I think this could be a good opportunity to add more functionality and clarify how these get updated.

@Sammcb sorry for the great delay (I don't know why I missed the notification), would you and/or @allevato be interested in pitching an update?

1 Like

This is a good call, thanks! I'm working on drawing up a table of which properties we've implemented and how, so we can see what's left to cover.

Will post back when I have something to share!

1 Like

Posted my survey here: Swift Unicode Properties.md · GitHub

Please let me know if I missed anything… Would be happy to discuss on the gist or in this thread.

Also, anyone interested in this would probably also be interested in the AllScalars type we recently added to the swift-experimental-string-processing repository — handy for poking around in the Unicode scalar universe: https://github.com/apple/swift-experimental-string-processing/blob/main/Sources/Util/AllScalars.swift

2 Likes

Could Unicode.Scalar conform to Strideable, instead of adding an AllScalars type?

(Range<Unicode.Scalar> and ClosedRange<Unicode.Scalar> would then have conditional RandomAccessCollection conformances.)

// in `stdlib/public/core/UnicodeScalar.swift`
extension Unicode.Scalar: Strideable {
  public typealias Stride = Int
  public func advanced(by distance: Stride) -> Unicode.Scalar
  public func distance(to other: Unicode.Scalar) -> Stride
}

// in `stdlib/public/core/Stride.swift`
extension Strideable where Self == Unicode.Scalar {
  public static func _step(
    after current: (index: Int?, value: Self),
    from start: Self, by distance: Self.Stride
  ) -> (index: Int?, value: Self)
}

for u: Unicode.Scalar in "\0"..."\u{10FFFF}" where u.properties.isMath {}

However, a Unicode.Set: SetAlgebra where Element == String type might be more useful:

That is one of the first things I did when I needed to do some scalar processing. I wondered why it isn't already conforming.

I would love to have this. But can you elaborate on how would it work in Swift?

No, I haven't designed or implemented anything yet.

And I'm not sure if all the features from UTS #35 are useful (e.g. {ab}-{cd} string ranges).

But I think that scalar/character sets would be useful outside of regex literals, and perhaps also within the result builder DSL.