Improving `String.Index`'s printed descriptions

lorentey · April 29, 2022, 12:44am

I don't want to get too much into bikeshedding the description formats -- this isn't something we can reasonably expect to form a consensus on.

We can and should productively argue about what information we should include in these descriptions.

For reference, as of Swift 5.7, string indices contain the following information:

storage offset (all versions)
storage encoding (v5.7+, one of: unknown, UTF-8, UTF-16, any)
transcoded offset (all versions, used to select a code unit in a UTF-16 transcoded scalar inside a UTF-8 string, or vice versa)
cached alignment bits (scalar [5.1+], character [5.7+], future: possibly word/sentence/paragraph)
cached extended grapheme cluster length (private impl detail, may go away in future stdlibs)

In the current implementation, description reports the first three of these, while debugDescription also includes the performance bits (4 & 5). Are y'all happy with this setup?

Note: None of these parts are directly accessible through public API. The deprecated String.Index.encodingOffset property returns the storage offset; however, that is useless without also knowing the associated encoding.

At this moment, I don't really see how we can usefully expose any of this information in public APIs. They are much too low-level, and I believe they are way, way, waaay too full of compatibility/semantic pitfalls to be safely usable for anything outside the stdlib. (They are tricky to correctly use within the stdlib, too.)

One concern raised privately is that including information in descriptions that isn't accessible via API would potentially encourage folks to parse these strings. I think that's a very valid concern, but I think the potential benefits of showing people how string indices work are much larger than the potential dangers of the same.)

To distinguish between valid indices in the various string views, description needs to include the storage offset, storage encoding (if known), and transcoding offset (if any), preferably with minimal fluff.

I would be fine with removing the (very internal) performance bits from debugDescription, and simply have that describe the same components as description. For valid code, these bits do not affect the result of any string operation, just the time it takes to get that result.

One benefit of having debugDescription return the full picture, warts and all, is that it makes it easier to explain what goes wrong with code that uses an invalid index. (E.g., the perf bits might indicate that the index is pointing to the start of a Character of length 8, when the corresponding data in the string itself might be in the middle of a character that's only 3 code units long.)

Hm, I had that implemented and used it for a while but I found it less readable in general than having the encoding listed as a separate component:

String.Index(offset: 10, encoding: utf8, aligned: character)
String.Index(offset: 23, encoding: utf16, aligned: character)
String.Index(offset: 7, encoding: unknown, aligned: character)
String.Index(offset: 9, encoding: any, aligned: character)

I found this display far more readable than utf8Offset:/utf16Offset:/unknownOffset:/anyOffset:.

At the end, I ended up switching to using the shorthand form, because I grew very accustomed to it through description, and I found the long form variant unpleasantly verbose. (Especially when transcoded offsets are also present.)

String.Index(offset: 10, encoding: utf8, transcodedOffset: 1)
String.Index(offset: 42, encoding: utf16, transcodedOffset: 3)

This does work fine though; I'd be happy to revive it.

Same issue.

For reference, this year I've spent a couple months debugging tricky string indexing issues, living on a bunch of different CustomStringConvertible implementations throughout this effort.

Here are some of the description formats I experimented with: (plus a bunch I'm not remembering I'm sure)

Index(offset: 45, encoding: utf8, transcodedOffset: 1)
(offset: 45, encoding: utf8, transcodedOffset: 1)
(45, utf8, 1)
#45/utf8+1
45utf8+1       45@utf8+1      45:utf8+1     45.utf8+1
utf8:45+1      utf8.45+1         utf8/45+1    
utf8[45]+1
[45utf8]+1
[45utf8+1]
45[utf8]+1

Some of these I only played with in mock output, but I think I implemented at least one variant from each line.

Anything longer than half a dozen or so characters (or a dozen with a transcoding offset) felt overly verbose in practice.

The format I ended up sticking with the longest was 45[utf8]+1. I changed it to round parens on a whim before posting this pitch.

One important constraint is that the description needs to remain readable at a glance, even when it's printed as part of a range.

I hate the convention of not putting spaces around ..< -- it can make the operator visually blur into its arguments way too easily, and it's tricky to find a display that visually binds stronger than the range operator. The fully bracketed forms like [45utf8+1] do well in this, I think, but I feel they have issues with the unknown/any encoding.

String indices, in a very real sense, are just integers -- just not in the coordinate system people might expect. One thing I noticed while experimenting with these is that I got annoyed when the format tried to deemphasize the numerical offset -- putting it after the encoding, or within parens, or preceded by a label etc. -- these felt like they were obscuring the real nature of an index.

I strongly feel that the encoding "wants" to work like a unit of measure -- and I think the description works best if it accepts this. So the right order is offset first, followed immediately by the encoding. (As in 1km, 16.67ms, 21°C etc)

However, while I found 45utf8 and 23utf16 somewhat acceptable, I did not like 43unknown or 12any at all. Hence the many attempts at trying to figure out a workable spelling.

(Indices with unknown encodings will be encountered when running code built with older versions of Swift. Indices in ASCII strings and the start index of every string are encoding-agnostic, and their encoding is both UTF-8 and UTF-16 at the same time (indicated by the any encoding).)

On the contrary, I'm hoping that showing people what these indices actually are will help them develop a strong mental model of not just how Unicode works, and but also how Swift implements it.

Learning a little bit about Unicode is pretty much unavoidable when dealing with text strings these days, no matter what language one is using -- just like learning a little bit about binary floating point math is pretty much unavoidable when doing any numerical work.

I would very much like folks to interpret string indices as sort-of-dimensioned quantities, measuring distance in a specific encoding. I think description ought to encourage this. The more people learn about these things, the less often we'll get naive feature requests.

Ah, do you think the parens make the encoding read like an optional note? If so, perhaps reverting to square brackets 34[utf8] would help. We can also reconsider simple juxtaposition, as in 34utf8 -- clearly this notation works for 1km or 2kHz; I don't see why it wouldn't work here.

I think people should be reaching for the various string views more often than they do.

Note: the encoding isn't just a constant. While UTF-8 is going to be the most frequent value, UTF-16 strings are commonly encountered when interfacing with Objective-C; any will be routinely seen in the startIndex and some ASCII cases. (unknown will hopefully be rare, but people will see it while debugging older binaries.)

</bikeshed mode>