Pre-pitch: Let's solve the String.Index encoded offset problem


(Michael Ilseman) #1

SE-0241 is landing too late in the 5.0 release process for it to solve all the issues it set out to solve. It's being gutted to a minimal, urgent, semantics-preserving change. I'd like to discuss what the right solution for String.Index's (potentially-soon-to-be-deprecated) encodedOffset.

There are other issues with SE-0180, discussed in another thread.

SE-0241 originally introduced a set of API attempting to solve 3 problems:

  1. SE-0180’s encodedOffset, meant for serialization purposes, needs to be parameterized over the encoding in which the string will be serialized in
  2. Existing uses of encodedOffset need a semantics-preserving off-ramp for Swift 5, which is expressed in terms of UTF-16 offsets
  3. Existing misuses of encodedOffset, which assume all characters are a single UTF-16 code unit, need a semantics-fixing alternative
Details: String’s views and encodings

String has 3 views which correspond to the most popular Unicode encodings: UTF-8, UTF-16, and UTF-32 (via the Unicode scalar values). String’s default view is of Characters.

let myString = "abc\r\nいろは"
Array(myString.utf8) // UTF-8 encoded
Array(myString.utf16) // UTF-16 encoded
Array(myString.unicodeScalars.lazy.map { $0.value }) // UTF-32 encoded
Array(myString); Array(myString.indices) // Not an encoding, but provides offset-based access to `Characters`

Uses in the Wild

GitHub code search yields nearly 1500 uses , and nearly-none of them are for SE-0180’s intended purpose. Below I present the 3 most common uses.

// Common code for these examples
let myString: String = ...
let start: String.Index = ...
let end: String.Index = ...
let utf16OffsetRange: Range<Int> = ...
let nsRange: NSRange = ...

Offset-based Character indexing

The most common misuse of encodedOffset assumes that all Characters in a String are comprised of a single code unit. This is wrong and a source of surprising bugs, even for exclusively ASCII content: "\r\n".count == 1.

let (i, j): (Int, Int) = ... // Something computed in terms of myString.count

// Problematic code
myString[String.Index(encodedOffset: i]..<String.Index(encodedOffset: j)]

// Semantic preserving alternative from this proposal
myString[String.Index(offset: i, within: myString)..<String.Index(offset: j, within: myString)]

// Even better alternative
let myIndices = Array(myString.indices)
let (i, j): (Int, Int) = ... // Something computed in terms of myIndices.count
myString[myIndices[i]..<myIndices[j]]

Range Mapping

Many of the uses in the wild are trying to map between Range<String.Index> and NSRange. Foundation already provides convenient initializers for this purpose already, and using them is the preferred approach:

// Problematic code
let myNSRange = NSRange(location: start.encodedOffset, length: end.encodedOffset - start.encodedOffset)
let myStrRange = String.Index(encodedOffset: nsRange.lowerBound)..<String.Index(encodedOffset: nsRange.upperBound)

// Better alternative
let myNSRange = NSRange(start..<end, in: myString)
let myStrRange = Range(nsRange, in: myString)

Naked Ints

Some uses in the wild, through no fault of their own, have an Int which represents a position in UTF-16 encoded contents and need to convert that to a String.Index.

// Problematic code
let strLower = String.Index(encodedOffset: utf16OffsetRange.lowerBound)
let strUpper = String.Index(encodedOffset: utf16OffsetRange.upperBound)
let subStr = myString[strLower..<strUpper]

// Semantic preserving alternative from this proposal
let strLower = String.Index(offset: utf16OffsetRange.lowerBound, within: str.utf16)
let strUpper = String.Index(offset: utf16OffsetRange.upperBound, within: str.utf16)
let subStr = myString[strLower..<strUpper]

Original Proposed Solution

Here is a (slightly revised) version of the original proposal:

  /// The UTF-16 code unit offset corresponding to this Index
  public func offset<S: StringProtocol>(in utf16: S.UTF16View) -> Int { ... }

  /// The UTF-8 code unit offset corresponding to this Index
  public func offset<S: StringProtocol>(in utf8: S.UTF8View) -> Int { ... }

  /// The Unicode scalar offset corresponding to this Index
  public func offset<S: StringProtocol>(in scalars: S.UnicodeScalarView) -> Int { ... }

  /// The Character offset corresponding to this Index
  public func offset<S: StringProtocol>(in str: S) -> Int { ... }

  /// Creates a new index at the specified UTF-16 code unit offset
  ///
  /// - Parameter offset: An offset in UTF-16 code units.
  public init<S: StringProtocol>(offset: Int, in utf16: S.UTF16View) { ... }

  /// Creates a new index at the specified UTF-8 code unit offset
  ///
  /// - Parameter offset: An offset in UTF-8 code units.
  public init<S: StringProtocol>(offset: Int, in utf8: S.UTF8View) { ... }

  /// Creates a new index at the specified Unicode scalar offset
  ///
  /// - Parameter offset: An offset in terms of Unicode.Scalars
  public init<S: StringProtocol>(offset: Int, in scalars: S.UnicodeScalarView) { ... }

  /// Creates a new index at the specified Character offset
  ///
  /// - Parameter offset: An offset in terms of Characters
  public init<S: StringProtocol>(offset: Int, in str: S) { ... }
}

This gives developers:

  1. The ability to choose a specific encoding for serialization, the original intended purpose.
  2. The ability to fix any code that assumed fixed-encoding-width Characters by choosing the most-natural variant that just takes a String.
  3. The ability to migrate their uses for Cocoa index mapping by choosing UTF-16.

However, it’s not clear this is the best approach for Swift and more design work is needed:

  • Overloading only on view type makes it easy to accidentally omit a view and end up with character offsets. E.g. String.Index(offset: myUTF16Offset, in: myUTF16String) instead of String.Index(offset: myUTF16Offset, in: myUTF16String.utf16).
  • Producing new indices is usually done by the collection itself rather than parameterizing an index initializer. This should be handled with something more ergonomic such as offset-based indexing in a future release.
  • In real code in the wild, almost all created indices are immediately used to subscript the string or one of its views. This should be handled with something more ergonomic such as offset-based subscripting in a future release.

The review thread had some interesting discussion surrounding this area that I'd like to keep going.


SE-0241: Explicit Encoded Offsets for String Indices
(Jeremy David Giesbrecht) #2

May I ask what is probably a stupid question? When you/SE‐0180 say(s) this...

...by “serialization”, you mean putting the various index kinds into a linear series in order to compare which comes first (Comparable), correct? Or do you mean data serialization, where the index instance would be encoded into a series of bits (Codable)?

Until I double‐checked String.Index and noticed that it wasn’t actually Codable, I—mistakenly?—thought you were all referring to the latter. (The word encoded didn’t help.) The idea absolutely terrified me, which was where this random post came from:

I’m glad the standard library doesn’t use it for that—right?—, but its existing name still sort of invites the confusion I had, and suggests that it can be used magically with Codable. But that would be a foot‐shoot fest. Maybe we should all keep that in mind as we design API and write documentation, so that whatever we come up with is clear?

(The API you list above looks fine in this regard.)


#3

Codable one, the Comparable part doesn‘t share the same problem we do now. Even if the encodedOffset belongs to different encoding, it still belong to the same encoding scheme under the same string, so they can compare offsets just fine.

This I do not know the reason myself, but I suppose it’s because they want us to use encodedOffset instead of String.Index for encoding.


(Jeremy David Giesbrecht) #4

Then I am very curious what @Michael_Ilseman has to say.


I hope that isn’t what they want. encodedOffset was a UTF‐16 offset. Now its a we‐won’t‐tell‐you‐what offset. Neither is suitable for Codable.

If the encoding scheme ensures bit‐wise fidelity, then encoding a String/String.Index pair is round‐trip. A binary property list might be this way?

But if the encoding scheme is itself text‐based, such as JSON or an XML property list, only an integer describing a Character offset is stable*. The lower levels of the string might change as the file is handled or transferred, as the file or stream might undergo any number of conversions between UTF‐this and UTF‐that, or normalization to NF‐this or NF‐that—all while it is completely out of Swift’s reach.

Advanced workarounds for contexts that need such things might be:

  • Encoding the string indirectly as base‐64 encoded data in order to preserve bit fidelity. Then all indices are safe.
  • Normalizing the string before computing the offsets in the first place, then performing the same pre‐agreed normalization upon decoding. Then all indices are back to pointing at what they intend.

Both have significant drawbacks that make them repugnant for the vastly more common use case: strings transferred or saved without any accompanying indices.

*Technically an integer describing a Character offset is still vulnerable to mismatched versions of Unicode and ICU, but that is significantly less of a concern.


#5

Re-reading SE-0180, it seems encodedOffset has very little to do with Codable part, likely my previous comment was wrong. :blush:

I see what you mean. Originally I imagined that people would encode/decode with UTF-16 view (so array of UInt16) when using UTF-16 offset, but that’s likely untrue (looking back, I’m probably guilty of this as well). Perhaps if we can agree on view-agnostic encoding, we can make String.Index codable.