Improving `String.Index`'s printed descriptions

lorentey · April 28, 2022, 2:02am

Now that String.Index is aware of what code units it is counting, and considering the lovely new String processing APIs that are currently going through Swift Evolution, I think it's worth spending a little effort on improving the developer experience of working with string indices.

Motivation

If you ever tried printing a string index, you may have come across a printout like the ones in this example:

let string = "👋🏼 Привіт"

print(string.startIndex) // ⟹ Index(_rawBits: 1)
print(string.endIndex) // ⟹ Index(_rawBits: 1376257)

These are generated via the default reflection-based string conversion paths. The in-memory representation of String.Index is a single 64-bit integer value that is logically broken up into separate fields. Unfortunately, the reflection mechanism is only aware of the raw integer, so that's what gets printed. This is completely unhelpful/unusable -- not even to the people who work on String's implementation in the stdlib.

String indices are simply offsets from the start of the string's underlying storage representation, referencing a particular UTF-8 or UTF-16 code unit, depending on the string's encoding. (Most Swift strings are UTF-8 encoded, but strings bridged over from Objective-C may remain in their original UTF-16 encoded form.)

I think it would be useful if string indices would print themselves in a way that directly reflects their logical contents.

Proposed Solution

I have a PR up that conforms String.Index to CustomStringConvertible and CustomDebugStringConvertible, implementing human-readable descriptions.

@available(SwiftStdlib x.y, *)
extension String.Index: CustomStringConvertible {
  @available(SwiftStdlib x.y, *)
  public var description: String { ... }
}

@available(SwiftStdlib x.y, *)
extension String.Index: CustomDebugStringConvertible {
  @available(SwiftStdlib x.y, *)
  public var debugDescription: String { ... }
}

Notes:

The new conformances necessarily need to come with availability. Fortunately, code should never need to directly call description or debugDescription -- the recommend way to convert things to strings is to call the String(describing:)/String(reflecting:) initializers, or to use string interpolation. These initializers work on every version of Swift and they will always return a description, whether or not these conformances are present. On older versions of the Swift Standard Library, String.Index will continue to print using the original, reflection-based method.)
Changing the description of String.Index may turn out to be a binary breaking change, in which case we can still apply this change by restricting it to programs built with a particular Swift release.

(For reference, the Standard Library does not consider its description and debugDescription implementations to be part of its ABI -- for most types, the strings returned by description and debugDescription may change with any Swift release, without going through a separate Swift Evolution proposal.)

Description formats

Note: This section merely presents the description strings returned by the proposed implementation to demonstrate the improvement. The exact strings returned wouldn't be normative, and may continue change in any Swift Standard Library release, to make the displays useful and to make sure they continue to reflect the underlying data.

`CustomStringConvertible`

For CustomStringConvertible, the index description displays the storage offset value and its encoding:

  let string = "👋🏼 Привіт"

  print(string.startIndex) // ⟹ "0[any]"
  print(string.endIndex) // ⟹ "21[utf8]"

Note how the start index does not care about its storage encoding -- offset zero is the same location in either case.

String index ranges print in a compact, easily understandable form:

  let i = string.firstIndex(of: "р")!
  let j = string.firstIndex(of: "і")!
  print(i ..< j) // ⟹ 11[utf8]..<17[utf8]

Exposing the actual storage offsets in the description effectively demonstrates how indices work, helping people gain a better understanding of both the underlying Unicode concepts, and the details of their implementation in Swift.

`CustomDebugStringConvertible`

The CustomDebugStringConvertible output is a bit more verbose. In addition to the offset + encoding, it also includes detailed information about the bits of the index that are reserved for performance flags and other auxiliary data.

For example, index i below is addressing the UTF-8 code unit at offset 10 in some string, which happens to be the first code unit in a Character (i.e., an extended grapheme cluster) of length 8:

print(String(reflecting: i)) 
// ⟹ String.Index(offset: 10, encoding: utf8, aligned: character, stride: 8)

Feedback would be most welcome!

(Edit 2022-05-02: updated sample output to match current implementation.)

Michael_Ilseman · April 28, 2022, 2:26am

Big +1!

Bikeshedding the printouts:

How about:

String.Index(utf8Offset: 10, aligned: character)

How about:

print(string.startIndex) // (offset: 0)
print(string.endIndex)   // (utf8Offset: 21)

benrimmington · April 28, 2022, 3:49am

There are four overloads of String.init(describing:) in Mirror.swift.
There are four overloads of appendInterpolation(_:) in StringInterpolation.swift.

Some overloads call the description property directly.
Will callers need availability checks to back-deploy?
(I'm just curious, I don't think it would be a problem for this pitch.)

Karl · April 28, 2022, 10:56am

The only thing I can think of is that we should maybe be a bit cautious when writing string indices as integers. Developers (who may be unfamiliar with the character/codepoint/codeunit relationship) ask for integer subscripting from strings all the time, and perhaps showing them values like this may be confusing (11 might not be the character/codepoint offset), or may make the UTF-8 view look more attractive in situations when it shouldn't be used.

Perhaps it can be helped by writing the encoding first? So we're not saying "11 (oh BTW it's UTF8)" but rather something more like "if viewed as UTF8 it's offset 11". I'm not sure if that would make a big difference though.

lorentey · April 28, 2022, 9:46pm

Thankfully, no! The extra overloads are just a performance optimization; they do not affect the returned result.

If the deployment target is lower than the availability of the CustomStringConvertible conformance, then the compiler will (silently!) select the unconstrained overload, ignoring the (partially available) constrained ones. This will result in an opaque call into the stdlib that ends up doing a runtime as? check to find description. If the code happens to be running on a stdlib version that includes this proposed conformance, then this check will succeed and we get the nice description. If we are running on an older stdlib, the check will fail and we'll get the original descriptions.

If the minimum deployment target is higher than whatever stdlib release includes these changes, then the compiler will select the constrained overloads, which results in improved performance without affecting the returned result.

lorentey · April 29, 2022, 12:44am

I don't want to get too much into bikeshedding the description formats -- this isn't something we can reasonably expect to form a consensus on.

We can and should productively argue about what information we should include in these descriptions.

For reference, as of Swift 5.7, string indices contain the following information:

storage offset (all versions)
storage encoding (v5.7+, one of: unknown, UTF-8, UTF-16, any)
transcoded offset (all versions, used to select a code unit in a UTF-16 transcoded scalar inside a UTF-8 string, or vice versa)
cached alignment bits (scalar [5.1+], character [5.7+], future: possibly word/sentence/paragraph)
cached extended grapheme cluster length (private impl detail, may go away in future stdlibs)

In the current implementation, description reports the first three of these, while debugDescription also includes the performance bits (4 & 5). Are y'all happy with this setup?

Note: None of these parts are directly accessible through public API. The deprecated String.Index.encodingOffset property returns the storage offset; however, that is useless without also knowing the associated encoding.

At this moment, I don't really see how we can usefully expose any of this information in public APIs. They are much too low-level, and I believe they are way, way, waaay too full of compatibility/semantic pitfalls to be safely usable for anything outside the stdlib. (They are tricky to correctly use within the stdlib, too.)

One concern raised privately is that including information in descriptions that isn't accessible via API would potentially encourage folks to parse these strings. I think that's a very valid concern, but I think the potential benefits of showing people how string indices work are much larger than the potential dangers of the same.)

To distinguish between valid indices in the various string views, description needs to include the storage offset, storage encoding (if known), and transcoding offset (if any), preferably with minimal fluff.

I would be fine with removing the (very internal) performance bits from debugDescription, and simply have that describe the same components as description. For valid code, these bits do not affect the result of any string operation, just the time it takes to get that result.

One benefit of having debugDescription return the full picture, warts and all, is that it makes it easier to explain what goes wrong with code that uses an invalid index. (E.g., the perf bits might indicate that the index is pointing to the start of a Character of length 8, when the corresponding data in the string itself might be in the middle of a character that's only 3 code units long.)

Hm, I had that implemented and used it for a while but I found it less readable in general than having the encoding listed as a separate component:

String.Index(offset: 10, encoding: utf8, aligned: character)
String.Index(offset: 23, encoding: utf16, aligned: character)
String.Index(offset: 7, encoding: unknown, aligned: character)
String.Index(offset: 9, encoding: any, aligned: character)

I found this display far more readable than utf8Offset:/utf16Offset:/unknownOffset:/anyOffset:.

At the end, I ended up switching to using the shorthand form, because I grew very accustomed to it through description, and I found the long form variant unpleasantly verbose. (Especially when transcoded offsets are also present.)

String.Index(offset: 10, encoding: utf8, transcodedOffset: 1)
String.Index(offset: 42, encoding: utf16, transcodedOffset: 3)

This does work fine though; I'd be happy to revive it.

Same issue.

For reference, this year I've spent a couple months debugging tricky string indexing issues, living on a bunch of different CustomStringConvertible implementations throughout this effort.

Here are some of the description formats I experimented with: (plus a bunch I'm not remembering I'm sure)

Index(offset: 45, encoding: utf8, transcodedOffset: 1)
(offset: 45, encoding: utf8, transcodedOffset: 1)
(45, utf8, 1)
#45/utf8+1
45utf8+1       45@utf8+1      45:utf8+1     45.utf8+1
utf8:45+1      utf8.45+1         utf8/45+1    
utf8[45]+1
[45utf8]+1
[45utf8+1]
45[utf8]+1

Some of these I only played with in mock output, but I think I implemented at least one variant from each line.

Anything longer than half a dozen or so characters (or a dozen with a transcoding offset) felt overly verbose in practice.

The format I ended up sticking with the longest was 45[utf8]+1. I changed it to round parens on a whim before posting this pitch.

One important constraint is that the description needs to remain readable at a glance, even when it's printed as part of a range.

I hate the convention of not putting spaces around ..< -- it can make the operator visually blur into its arguments way too easily, and it's tricky to find a display that visually binds stronger than the range operator. The fully bracketed forms like [45utf8+1] do well in this, I think, but I feel they have issues with the unknown/any encoding.

String indices, in a very real sense, are just integers -- just not in the coordinate system people might expect. One thing I noticed while experimenting with these is that I got annoyed when the format tried to deemphasize the numerical offset -- putting it after the encoding, or within parens, or preceded by a label etc. -- these felt like they were obscuring the real nature of an index.

I strongly feel that the encoding "wants" to work like a unit of measure -- and I think the description works best if it accepts this. So the right order is offset first, followed immediately by the encoding. (As in 1km, 16.67ms, 21°C etc)

However, while I found 45utf8 and 23utf16 somewhat acceptable, I did not like 43unknown or 12any at all. Hence the many attempts at trying to figure out a workable spelling.

(Indices with unknown encodings will be encountered when running code built with older versions of Swift. Indices in ASCII strings and the start index of every string are encoding-agnostic, and their encoding is both UTF-8 and UTF-16 at the same time (indicated by the any encoding).)

On the contrary, I'm hoping that showing people what these indices actually are will help them develop a strong mental model of not just how Unicode works, and but also how Swift implements it.

Learning a little bit about Unicode is pretty much unavoidable when dealing with text strings these days, no matter what language one is using -- just like learning a little bit about binary floating point math is pretty much unavoidable when doing any numerical work.

I would very much like folks to interpret string indices as sort-of-dimensioned quantities, measuring distance in a specific encoding. I think description ought to encourage this. The more people learn about these things, the less often we'll get naive feature requests.

Ah, do you think the parens make the encoding read like an optional note? If so, perhaps reverting to square brackets 34[utf8] would help. We can also reconsider simple juxtaposition, as in 34utf8 -- clearly this notation works for 1km or 2kHz; I don't see why it wouldn't work here.

I think people should be reaching for the various string views more often than they do.

Note: the encoding isn't just a constant. While UTF-8 is going to be the most frequent value, UTF-16 strings are commonly encountered when interfacing with Objective-C; any will be routinely seen in the startIndex and some ASCII cases. (unknown will hopefully be rare, but people will see it while debugging older binaries.)

</bikeshed mode>

xwu · April 29, 2022, 4:06am

Going forward, will most indexes encountered in practice in native Swift code be UTF-8 encoded, with transcoded offset zero? If so, I would prioritize legibility optimizing for the most common case, at least for description (versus debugDescription). That is:

"48"            // offset: 48, encoding: UTF-8, transcoded offset: 0
"48+1"          // offset: 48, encoding: UTF-8, transcoded offset: 1
"48+1 (UTF-16)" // offset: 48, encoding: UTF-16, transcoded offset: 1

Other than offset 0, where the encoding obviously does not matter, are all offsets with "any" encoding those of ASCII strings? If so, to extend the approach above, I'd suggest:

"48 (ASCII)" // rather than "any [encoding]"
"0"          // rather than "any"--which any index with offset 0 must be

benlings · April 29, 2022, 6:35am

On a first impression, parens definitely read as being ‘more parenthetical’ than square bracket.

I think your comment about the encoding being like units is very apt, so it’s important to include. Compare with giving the temperature: 37 is meaningless (without context, at least) - it needs to be 37°F or 37°C or 37K.

benrimmington · April 29, 2022, 7:11am

Which description format will be used by playgrounds?

Would the REPL (and debuggers) need an LLDB summary provider?

hooman · April 29, 2022, 5:02pm

I would prefer utf8@45+1

Karl · April 29, 2022, 5:42pm

A unit of measure is perhaps the closest analogue, but I see it really just like a different view. There is some string there (where at a macro level, a "string" is a "collection of characters"), but you can pull out a magnifying glass to view those characters as individual Unicode scalars . The UTF-8/-16 encodings are just different ways those scalars are written in storage, so they are really like twins of each other - the same information, expressed slightly differently. Like measuring electrons vs. holes, perhaps?

At lot of the tricks that developers use for better performance - like parsing ASCII bits of strings at the UTF-8 level, would work at the UTF-16 level, too. We just prefer UTF-8 because String is natively stored that way, but things like "1 code-unit = 1 character" that are useful for reasoning about offsets in ASCII strings apply to both.

But yeah, indices obviously are numbers. It's just that when we're talking about this thing with lots of overlapping sets of numbers, you need to be vigilant to not make mistakes. And it is complex, and it takes some time before people learn it.

Maybe. But we also have progressive disclosure, and I guess what I'm unsure about is when exactly we expect people to develop this strong mental model, and to what extent the technical details should be available vs. being prominent.

The current index description isn't doing anything for anybody. It isn't helping anybody, but I guess you could argue that it also isn't hurting or helping to mislead.

Yeah, I think I like square brackets slightly better for that reason.

davedelong · April 29, 2022, 5:47pm

+1 This will be an enormous improvement to String.Index's helpfulness while debugging.

Out of curiosity, what's the motivation for going this route as opposed to implementing CustomReflectable as a way to expose the portions of the underlying bitfield to the default description logic?

tera · April 29, 2022, 9:15pm

If you like 21°C you may like 45₈ and 23₁₆

lorentey · May 3, 2022, 1:42am

Excellent points, everyone. I updated the proposed implementation to revert to square brackets.

I believe Playgrounds renders items using CustomStringConvertible when a type does not implement CustomPlaygroundDisplayConvertible; I think that'll do just fine in this case.

po in LLDB does a string conversion in the target process, so it will automatically get the new display.
The regular print command in LLDB does not run code in the target, so it's independent of a type's conformances. However, it seems desirable to have a data formatter for string indices that reproduces the same display. I'll add this to the to do list, as a separate followup task.

The default display created through reflection would be overly verbose. (We can make it work for debugDescription, but not for the shorthand description.)

Adding CustomReflectable would be nice, but I have some worries:

It needs to be clear to everyone that this is for debugging only. The internal components of String.Index aren't public API, and the mirror may arbitrarily change between stdlib releases, without notice.
The custom mirror should not interfere with how indices get printed in Playgrounds by default.

I think something like the impl below might work. The underscored names might discourage misuse; DisplayStyle.struct seems to take care of point 2.

@available(SwiftStdlib x.y, *)
extension String.Index: CustomReflectable {
  @available(SwiftStdlib x.y, *)
  @inline(never)
  public var customMirror: Mirror {
    var children: [(label: String?, value: Any)] = []
    children.reserveCapacity(5)
    children.append(("_encodedOffset", _encodedOffset))
    children.append(("_encoding", _encodingDescription))
    if transcodedOffset > 0 {
      children.append(("_transcodedOffset", transcodedOffset))
    }
    if _isCharacterAligned {
      children.append(("_aligned", "character"))
    } else if _isScalarAligned {
      children.append(("_aligned", "scalar"))
    }
    if let stride = characterStride {
      children.append(("_characterStride", stride))
    }
    return Mirror(self, children: children, displayStyle: .struct)
  }
}

I'd prefer not to introduce new types just to make the mirror work, hence all the string-valued children.

benrimmington · May 5, 2022, 4:38am

Was this a convention to make the spacing consistent with partial ranges?
(e.g. 0..<45 vs ..<45)

The description could be comma-separated and parenthesized, like a tuple.
The customMirror and debugDescription could use the same underscored names.

print(i..<j)  //-> "(0, any)..<(45, utf8, +1)"

debugPrint(i..<j)  //-> "Range(String.Index(_offset: 0, _encoding: any)..<String.Index(_offset: 45, _encoding: utf8, _transcodedOffset: +1))"

AliSoftware · May 5, 2022, 9:32am

+1 on the pitch

Nit on the format:

I like the effort on making the description short
I agree that using square brackets are better than round brackets, to avoid the encoding info to feel "optional"
But I think 17@utf8 would convey that even better than 17[utf8]

To me 17@utf8 conveys better the meaning of "offset 17 in the utf8 view" — and thus better emphasis that this is not to be considered a character offset —, while 17[utf8] still feels it could be read as "character at index 17… oh and btw the string is utf8-encoded".

DevAndArtist · May 5, 2022, 1:49pm

Nit pick on the example in this pitch. I don't know what the proposals author direct or indirect intentions were but please do not include any examples, direct or indirect, explicit or implicit that shows your personal support or opinion for some concrete political conflicits. Political topics or opinions wether they are (semi) hidden or not should stay away from the language evolution and this forum in general. It might be an unlucky coincidence for the chosen language in the example, but my gut feeling tells me it's not. Yes I know, it's just ONE WORD, but still. That said, with all my respect to anyone involved in this conversation, I would appreciate if the examples were amended by using a neutral language that has nothing to do with current global political conflicts. Thank you for your time.

John_McCall · May 5, 2022, 3:55pm

There surely is a line which can be crossed there, but I’m not going to treat writing “hello” in Ukrainian as crossing it.

woolsweater · May 5, 2022, 4:15pm

I'm not sure how serious you were, but this seems like a pretty fair option to me. True subscripts (not bracketed) are already used like that in contexts where numbers are being written out in more than one base. You can see in Wikipedia, for example: Hexadecimal - Wikipedia and Octal - Wikipedia. This allows the value to be prominent, and the rule to interpret it to be present but in an assistive role.

DevAndArtist · May 5, 2022, 4:41pm

While you're not wrong I would appreciate to see no continuation of such potential pattern. Thank you for your understanding.