SE-0445: Improving String.Index's printed descriptions

xwu · September 16, 2024, 10:32pm

Hello Swift community,

The review of SE-0445: Improving String.Index's printed descriptions begins now and runs through September 30, 2024.

Reviews are an important part of the Swift evolution process. All review feedback should be either on this forum thread or, if you would like to keep your feedback private, directly to the review manager via the forum messaging feature. When contacting the review manager directly, please keep the proposal link at the top of the message.

What goes into a review?

The goal of the review process is to improve the proposal under review through constructive criticism and, eventually, determine the direction of Swift. When writing your review, here are some questions you might want to answer in your review:

What is your evaluation of the proposal?
Is the problem being addressed significant enough to warrant a change to Swift?
Does this proposal fit well with the feel and direction of Swift?
If you have used other languages or libraries with a similar feature, how do you feel that this proposal compares to those?
How much effort did you put into your review? A glance, a quick reading, or an in-depth study?

More information about the Swift evolution process is available at

https://github.com/apple/swift-evolution/blob/main/process.md

Thank you,

Xiaodi Wu
Review Manager

jrose · September 17, 2024, 1:16am

Why CustomStringConvertible and not CustomDebugStringConvertible? Conforming to one and not the other still results in the other being used as a fallback in standard interpolation, String inits, and print, so why not pick the one that emphasizes the debug-ness of the chosen output?

The answer might be “the distinction between the two is rarely important if a type is only going to implement one of them anyway, and so we want to encourage people to prefer CSC over CDSC for consistency”. Or it might be “since only the method can be backwards-deployed, not the conformance, we wanted to use the one with the shorter name”. Or perhaps something else. Any of those are fine! I’d just like to see it mentioned in the proposal.

dimi · September 17, 2024, 6:33am

Love the addition!

Wanted to share a gut reaction to the printouts, I immediately read the [] in 15[utf8]+1 as a subscript, and it didn’t read at all like a unit (which I found out only after reading the pitch thread). I wonder if utf8[15] + int[1]or utf8[15 + 1] would be more recognizable to most swift devs who would be seeing them, especially since they would immediately read as indexes of a particular view this way.

wes1 · September 17, 2024, 8:51am

A great quality-of-life improvement in a high-traffic element!

Two comments at the margins that do not affect the acceptance of the proposal...

Add alignment and edgeness suffix? [c s u ^ $]
Disclaim any normative intent for position as "offset from start"?

While the output is technically non-reviewable, it's good that it seems to be battle-tested with past use and future-proofed in covering most everything exposed in the documented String.Index ABI. I'm just a little fuzzy on whether the suffix (optional +1) covers all the ABI-visible cases of character- and scalar-aligned that people might want to see when debugging. It might also help if the endIndex (if not startIndex) were flagged as such. Perhaps an optional 1-character suffix would avoid having to revisit the output?

Readers mostly care when indexes are not C/S-aligned or if they are at the edges. So perhaps c would mean not character-aligned, s not scalar-, and u unaligned, i.e., neither. Both end and start index are character and scalar aligned, so they could get the ^ and $ regexp symbols. The result is a small optional suffix, which wouldn't clutter the vast majority of real-world indexes.

But (you might ask) why bother using ^ for 0?

That's another aside: I noticed the proposal mention that the position is offset (implicitly 0-based?) from the start of storage. I see that value in string.index data now (and see inlined isZeroPosition in the source) but hadn't noticed this API claim before. As a developer I would love to be able to assume it when I persist (and compress) indexes associated with a utf-8 backed string.

But for API purposes I imagine future Swift would not want to guarantee offset from start (0-based or otherwise) if it doesn't already. Though ABI requires the value to be represented in those bits, couldn't future Swift still use different values for the same combination of enclosing string and substring (assuming it can set aside discriminator bits)? So I assume the proposal mention of position as offset from start is not intended as "normative" and thus ^ is relevant.

lorentey · September 18, 2024, 11:44pm

This is a good question. I've been meaning to start a discussion on the role of CustomDebugStringConvertible, and this may be a good excuse to start it now.

I find that the name and current documentation of CustomDebugStringConvertible (and its debugDescription property) are harmful and misleading, because they aren't at all reflecting their actual purpose.

From what I can tell, the real purpose of CustomDebugStringConvertible is to serve as a secondary variant of CustomStringConvertible to be used when the use of the default description may interfere with understanding, such as when generating the descriptions of aggregate types or collections.

For example, String has an implementation of description that simply returns self, while its debugDescription is careful to provide a quoted display, with properly escaped contents.

let a = "Truman, Harry S."
print(a)      // ⟹ Truman, Harry S.
debugPrint(a) // ⟹ "Truman, Harry S."

let b = "Dwight D. \"Ike\" Eisenhower"
print(a)      // ⟹ Dwight D. "Ike" Eisenhower
debugPrint(a) // ⟹ "Dwight D. \"Ike\" Eisenhower"

Meanwhile, Array always uses debugDescription to print its elements:

let c = [a, b]
print(c)      // ⟹ ["Truman, Harry S.", "Dwight D. \"Ike\" Eisenhower"]
debugPrint(c) // ⟹ ["Truman, Harry S.", "Dwight D. \"Ike\" Eisenhower"]

This is to prevent confusion; if Array did not use the "suitable for debugging" variants when printing its items, then its description could easily become impossible to understand: for example, the comma in Truman, Harry S. would be indistinguishable from the commas that separate array items:

[Truman, Harry S., Dwight D. "Ike" Eisenhower]

So, in my (quite deeply held) view, the entire purpose of debugDescription is to be a secondary variant of description that is expected to be safe to embed into syntactic/structural displays. The documentation should talk about specifically what that means -- it should be talking about the need to avoid punctuation such as "naked" spaces, newlines, commas or colons, and unpaired quotes, brackets, parentheses etc. (It is quite tricky to formally specify what a well-formed debugDescription should be, which I expect partially explains why the documentation doesn't even attempt at hinting at this as a requirement.)

Notably, debugDescription is mostly invoked when building collection/aggregate descriptions, where brevity is really important. So CustomDebugDescription is not at all the right place to add information that isn't already present in description -- in fact, it may sometimes be better to omit or shorten things. When printing an array of 100 items, we really, really do not need to see some over-detailed presentation of each item, repeated 100 times -- brevity is perhaps even more important in this context than it is for description.)

Given all that, my first instinct is to say that a type should only conform to CustomDebugStringConvertible if it already conforms to CustomStringConvertible, but its description isn't suitable for unescaped embedding into syntactic formats. (Such as the case with String.) Therefore, the proposal suggests that String.Index should only conform to CustomStringConvertible.

extension String.Index {
  var description: String { get }
}
@available(SwiftStdlib 6.1, *)
extension String.Index: CustomStringConvertible {}

This has the superficial benefit that when the conformance isn't available, people can simply type foo.description to still produce a sensible printout, saving six keystrokes vs debugDescription. I find description is also a little easier to remember -- it's the first thing I'd try to invoke.

However, it would also be fair to argue that if a type's description happens to be "syntactically well-formed" anyway, then it should formally declare this by only conforming to CustomDebugStringConvertible, not CustomStringConvertible:

extension String.Index {
  var debugDescription: String { get }
}
@available(SwiftStdlib 6.1, *)
extension String.Index: CustomDebugStringConvertible {}

I'm open to either of these variants. As explained above, I do have a small preference for CustomStringConvertible, but I wouldn't mind if we went the other way, either.

(However, I would object to conforming String.Index to both CustomStringConvertible and CustomDebugStringConvertible -- conformances have some cost, and we should not pay for two when one will do.)

Whichever we choose, I expect this proposal will serve as a reference for future API design decisions.

lorentey · September 19, 2024, 1:15am

I did experiment with various ways to indicate that the index is known to be on a Character and/or UnicodeScalar boundary, but the notation never felt intuitive enough. It also doesn't feel like the alignment bits are generally useful enough to show in the description: they really don't seem all that useful for debugging. (It also seems weirdly lopsided to display these, but not the cached Character size that's also stored in the index.)

That said, I think it would be worth adding new API to expose the alignment bits for the handful of cases where someone might be interested in seeing them. (These bits are baked into the String.Index ABI; there is no real reason to hide their value.) However, I don't think it's worth displaying them every time we print an index.

We do not track "edgeness" within the index -- in particular, the end index is indistinguishable from a regular index (in some other string) that happens to have the same offset. The start index of a String value is always at offset 0 (and transcoded offset 0), but String.Index is also used as the index type of Substring, whose startIndex can fall on arbitrary nonzero offsets, too. So it would not be practically possible to specially mark start/end indices as such in their descriptions.

Hm. The proposal includes this sentence, in normative context:

String indices represent offsets from the start of the string's underlying storage representation, referencing a particular UTF-8 or UTF-16 code unit, depending on the string's encoding.

This is not meant to be new information; it is merely a statement of a preexisting fact. I would not mind if this became the general view -- after all, this is what string indices are.

String indices are always UTF-8 or UTF-16 offsets from the start of a string (plus an optional transcoded offset, and/or some cached information about the Unicode data around that position). This is hardwired into their ABI, and it would not be practical to change it at this point.

A huge reason I'm proposing this conformance is in fact to clearly bring this to the attention of developers -- it is quite difficult to understand how String works without internalizing the nature of its indices, and emphasizing their "offsetness" every time we print them is a really good way to encourage that.

Examples of past proposals that explicitly state the same include SE-0180 and SE-0241. I included some analysis below.

Click to expand historical examples of String.Index being treated as an offset

In the context of Swift Evolution proposals, this was quite explicitly the case as far back as SE-0180. That proposal introduced the unified String.Index type, and added the (doomed) encodedOffset initializer and property:

public extension String.Index {
  /// Creates a position corresponding to the given offset in a
  /// `String`'s underlying (UTF-16) code units.
  init(encodedOffset: Int)

  /// The position of this index expressed as an offset from the
  /// beginning of the `String`'s underlying (UTF-16) code units.
  var encodedOffset: Int
}

This pair of API was introduced as a way to enable serialization/deserialization of string indices. The init(encodedOffset:) initializer assumes (and normatively requires) that a String.Index is essentially just an offset value.

This property (incorrectly) assumed that String values would always be stored in UTF-16 encoding, so once we switched over to using UTF-8, it had to be deprecated, which was done in SE-0241. While that proposal deprecated the API, it too reinforced the notion that indices are storage offsets -- it simply admitted that the encoding is variable.

benrimmington · September 20, 2024, 11:00pm

CustomStringConvertible is associated with LosslessStringConvertible, and doesn't evoke a debug-only representation.

If String.Index could use the @DebugDescription macro, then would CustomDebugStringConvertible be more appropriate?

A third option is TextOutputStreamable, which only the String, Substring, Character, Unicode.Scalar, and Float{16,32,64,80} types conform to.

lorentey · September 23, 2024, 10:47pm

I do not think there is such a thing as a "debug-only" text representation within the API surface of the Swift stdlib.

I do believe that conforming to CustomStringConvertible is the obvious way for Swift types to provide a custom text representation of their instances -- that is what this protocol is for.

Its relation to LosslessStringConvertible is a second-order matter; but the illustrative description strings in the proposal can indeed be used as the input format for a LosslessStringConvertible conformance for String.Index, if we wanted to add one. (However, I am not planning to add such a conformance; I think it would be a bad idea.)

The @DebugDescription macro does not seem relevant to this proposal -- the macro did not add anything to clarify the semantic distinction between description or debugDescription, and it is able to work with either property.

(Aside: My personal opinion of using this macro in systems contexts remains rather poor; I do not expect String.Index (or any stdlib type) to use it. We're far more likely to either provide hand-written LLDB summary strings, or to continue to maintain data formatters directly in LLDB, like the preexisting String.Index formatter.)

Conforming String.Index to a niche/exotic protocol like TextOutputStreamable would be a non-starter.

(Aside: However, note that TextOutputStreamable does provide a very useful model for streaming descriptions: it allows values to get printed without materializing their full representation all at once in a single String instance. This is exactly what we need in environments where heap allocations do not exist, or are overly expensive. Therefore, there is a good chance that TextOutputStreamable will eventually become the (conceptual) basis for a lower-level CustomStringConvertible alternative in the future. However, even if that future becomes reality at some point, it will not make us regret adding a description property today.)

xwu · October 2, 2024, 2:16pm

Thank you all for your feedback—the language steering group has decided to accept the proposal with modifications.