Now that String.Index
is aware of what code units it is counting, and considering the lovely new String processing APIs that are currently going through Swift Evolution, I think it's worth spending a little effort on improving the developer experience of working with string indices.
Motivation
If you ever tried printing a string index, you may have come across a printout like the ones in this example:
let string = "👋🏼 Привіт"
print(string.startIndex) // ⟹ Index(_rawBits: 1)
print(string.endIndex) // ⟹ Index(_rawBits: 1376257)
These are generated via the default reflection-based string conversion paths. The in-memory representation of String.Index
is a single 64-bit integer value that is logically broken up into separate fields. Unfortunately, the reflection mechanism is only aware of the raw integer, so that's what gets printed. This is completely unhelpful/unusable -- not even to the people who work on String's implementation in the stdlib.
String indices are simply offsets from the start of the string's underlying storage representation, referencing a particular UTF-8 or UTF-16 code unit, depending on the string's encoding. (Most Swift strings are UTF-8 encoded, but strings bridged over from Objective-C may remain in their original UTF-16 encoded form.)
I think it would be useful if string indices would print themselves in a way that directly reflects their logical contents.
Proposed Solution
I have a PR up that conforms String.Index
to CustomStringConvertible
and CustomDebugStringConvertible
, implementing human-readable descriptions.
@available(SwiftStdlib x.y, *)
extension String.Index: CustomStringConvertible {
@available(SwiftStdlib x.y, *)
public var description: String { ... }
}
@available(SwiftStdlib x.y, *)
extension String.Index: CustomDebugStringConvertible {
@available(SwiftStdlib x.y, *)
public var debugDescription: String { ... }
}
Notes:
-
The new conformances necessarily need to come with availability. Fortunately, code should never need to directly call
description
ordebugDescription
-- the recommend way to convert things to strings is to call theString(describing:)
/String(reflecting:)
initializers, or to use string interpolation. These initializers work on every version of Swift and they will always return a description, whether or not these conformances are present. On older versions of the Swift Standard Library,String.Index
will continue to print using the original, reflection-based method.) -
Changing the description of
String.Index
may turn out to be a binary breaking change, in which case we can still apply this change by restricting it to programs built with a particular Swift release.(For reference, the Standard Library does not consider its
description
anddebugDescription
implementations to be part of its ABI -- for most types, the strings returned bydescription
anddebugDescription
may change with any Swift release, without going through a separate Swift Evolution proposal.)
Description formats
Note: This section merely presents the description strings returned by the proposed implementation to demonstrate the improvement. The exact strings returned wouldn't be normative, and may continue change in any Swift Standard Library release, to make the displays useful and to make sure they continue to reflect the underlying data.
CustomStringConvertible
For CustomStringConvertible
, the index description displays the storage offset value and its encoding:
let string = "👋🏼 Привіт"
print(string.startIndex) // ⟹ "0[any]"
print(string.endIndex) // ⟹ "21[utf8]"
Note how the start index does not care about its storage encoding -- offset zero is the same location in either case.
String index ranges print in a compact, easily understandable form:
let i = string.firstIndex(of: "р")!
let j = string.firstIndex(of: "і")!
print(i ..< j) // ⟹ 11[utf8]..<17[utf8]
Exposing the actual storage offsets in the description effectively demonstrates how indices work, helping people gain a better understanding of both the underlying Unicode concepts, and the details of their implementation in Swift.
CustomDebugStringConvertible
The CustomDebugStringConvertible
output is a bit more verbose. In addition to the offset + encoding, it also includes detailed information about the bits of the index that are reserved for performance flags and other auxiliary data.
For example, index i
below is addressing the UTF-8 code unit at offset 10 in some string, which happens to be the first code unit in a Character
(i.e., an extended grapheme cluster) of length 8:
print(String(reflecting: i))
// ⟹ String.Index(offset: 10, encoding: utf8, aligned: character, stride: 8)
Feedback would be most welcome!
(Edit 2022-05-02: updated sample output to match current implementation.)