SE-0241 is landing too late in the 5.0 release process for it to solve all the issues it set out to solve. It's being gutted to a minimal, urgent, semantics-preserving change. I'd like to discuss what the right solution for String.Index's (potentially-soon-to-be-deprecated) encodedOffset
.
There are other issues with SE-0180, discussed in another thread.
SE-0241 originally introduced a set of API attempting to solve 3 problems:
- SE-0180’s
encodedOffset
, meant for serialization purposes, needs to be parameterized over the encoding in which the string will be serialized in - Existing uses of
encodedOffset
need a semantics-preserving off-ramp for Swift 5, which is expressed in terms of UTF-16 offsets - Existing misuses of
encodedOffset
, which assume all characters are a single UTF-16 code unit, need a semantics-fixing alternative
Details: String’s views and encodings
String has 3 views which correspond to the most popular Unicode encodings: UTF-8, UTF-16, and UTF-32 (via the Unicode scalar values). String’s default view is of Characters.
let myString = "abc\r\nいろは"
Array(myString.utf8) // UTF-8 encoded
Array(myString.utf16) // UTF-16 encoded
Array(myString.unicodeScalars.lazy.map { $0.value }) // UTF-32 encoded
Array(myString); Array(myString.indices) // Not an encoding, but provides offset-based access to `Characters`
Uses in the Wild
GitHub code search yields nearly 1500 uses , and nearly-none of them are for SE-0180’s intended purpose. Below I present the 3 most common uses.
// Common code for these examples
let myString: String = ...
let start: String.Index = ...
let end: String.Index = ...
let utf16OffsetRange: Range<Int> = ...
let nsRange: NSRange = ...
Offset-based Character
indexing
The most common misuse of encodedOffset
assumes that all Characters in a String are comprised of a single code unit. This is wrong and a source of surprising bugs, even for exclusively ASCII content: "\r\n".count == 1
.
let (i, j): (Int, Int) = ... // Something computed in terms of myString.count
// Problematic code
myString[String.Index(encodedOffset: i]..<String.Index(encodedOffset: j)]
// Semantic preserving alternative from this proposal
myString[String.Index(offset: i, within: myString)..<String.Index(offset: j, within: myString)]
// Even better alternative
let myIndices = Array(myString.indices)
let (i, j): (Int, Int) = ... // Something computed in terms of myIndices.count
myString[myIndices[i]..<myIndices[j]]
Range Mapping
Many of the uses in the wild are trying to map between Range<String.Index>
and NSRange
. Foundation already provides convenient initializers for this purpose already, and using them is the preferred approach:
// Problematic code
let myNSRange = NSRange(location: start.encodedOffset, length: end.encodedOffset - start.encodedOffset)
let myStrRange = String.Index(encodedOffset: nsRange.lowerBound)..<String.Index(encodedOffset: nsRange.upperBound)
// Better alternative
let myNSRange = NSRange(start..<end, in: myString)
let myStrRange = Range(nsRange, in: myString)
Naked Ints
Some uses in the wild, through no fault of their own, have an Int which represents a position in UTF-16 encoded contents and need to convert that to a String.Index
.
// Problematic code
let strLower = String.Index(encodedOffset: utf16OffsetRange.lowerBound)
let strUpper = String.Index(encodedOffset: utf16OffsetRange.upperBound)
let subStr = myString[strLower..<strUpper]
// Semantic preserving alternative from this proposal
let strLower = String.Index(offset: utf16OffsetRange.lowerBound, within: str.utf16)
let strUpper = String.Index(offset: utf16OffsetRange.upperBound, within: str.utf16)
let subStr = myString[strLower..<strUpper]
Original Proposed Solution
Here is a (slightly revised) version of the original proposal:
/// The UTF-16 code unit offset corresponding to this Index
public func offset<S: StringProtocol>(in utf16: S.UTF16View) -> Int { ... }
/// The UTF-8 code unit offset corresponding to this Index
public func offset<S: StringProtocol>(in utf8: S.UTF8View) -> Int { ... }
/// The Unicode scalar offset corresponding to this Index
public func offset<S: StringProtocol>(in scalars: S.UnicodeScalarView) -> Int { ... }
/// The Character offset corresponding to this Index
public func offset<S: StringProtocol>(in str: S) -> Int { ... }
/// Creates a new index at the specified UTF-16 code unit offset
///
/// - Parameter offset: An offset in UTF-16 code units.
public init<S: StringProtocol>(offset: Int, in utf16: S.UTF16View) { ... }
/// Creates a new index at the specified UTF-8 code unit offset
///
/// - Parameter offset: An offset in UTF-8 code units.
public init<S: StringProtocol>(offset: Int, in utf8: S.UTF8View) { ... }
/// Creates a new index at the specified Unicode scalar offset
///
/// - Parameter offset: An offset in terms of Unicode.Scalars
public init<S: StringProtocol>(offset: Int, in scalars: S.UnicodeScalarView) { ... }
/// Creates a new index at the specified Character offset
///
/// - Parameter offset: An offset in terms of Characters
public init<S: StringProtocol>(offset: Int, in str: S) { ... }
}
This gives developers:
- The ability to choose a specific encoding for serialization, the original intended purpose.
- The ability to fix any code that assumed fixed-encoding-width Characters by choosing the most-natural variant that just takes a String.
- The ability to migrate their uses for Cocoa index mapping by choosing UTF-16.
However, it’s not clear this is the best approach for Swift and more design work is needed:
- Overloading only on view type makes it easy to accidentally omit a view and end up with character offsets. E.g.
String.Index(offset: myUTF16Offset, in: myUTF16String)
instead ofString.Index(offset: myUTF16Offset, in: myUTF16String.utf16)
. - Producing new indices is usually done by the collection itself rather than parameterizing an index initializer. This should be handled with something more ergonomic such as offset-based indexing in a future release.
- In real code in the wild, almost all created indices are immediately used to subscript the string or one of its views. This should be handled with something more ergonomic such as offset-based subscripting in a future release.
The review thread had some interesting discussion surrounding this area that I'd like to keep going.