History of Swift String.Index design

I'm a PL designer writing up some comparative PL linguistics on the String API design space and was hoping to bother Swifties with some questions about Swift's String.Index, which seems really nice fwiw.
(I'm also working on a language's String design, but nothing that competes head on with Swift in plausible timelines)

I noticed String's ABI and UTF-8 about the evolving internal representation and interop with Objective-C and other languages, and see some pitches about serializability of String indices.

Questions:

Would it be fair to say that one of the design goals of String.Index is by-construction guarantees that indices fall at Grapheme boundaries with the caveat that index's behaviour is underdefined when used across strings that have different internal encodings or prefixes up to index? If not, how would you characterize the goal?

Was this design influenced by other languages?

Were language interop / ABI driving concerns in the initial design?

Do the language maintainers feel that the developer community largely understands and avoids cross-string-value uses of indices?

Do the language maintainers feel that developers largely understand that indices should not be treated as comparable across strings: for i into string a and j into string b, (i < j) does not imply that there are fewer indexable elements in a[..<i] than in b[..<j]? (Assuming that's right)

Are there invariants like that that developers really wish did hold?

On serialization, are string indices often sent across the network, eg as part of a JSON payload, or are the serialization concerns mostly related to object persistence?

How essential are string views in parsing? Do you need octets for URL parsing and UTF-16 for JSON parsing or would parsing infrastructure be able to adapt if only code-points or grapheme clusters were available?

cheers

3 Likes

That’s not quite the case. It’s true that the string indices you get when working with the standard String API fall at Grapheme boundaries, but there are different string views that all share the same String.Index index type.

This is relevant when working with different views:

Since the String.Index type is used by String and also String.UTF8View, String.UnicodeScalarView (and more), you can switch between different string views as you parse a string.

That way, if you parse a string where most of the tokens are in the ASCII range (like most programming code), you can use the more efficient String.UTF8View view for most of the parsing and switch to String.UnicodeScalarView or String for certain tokens that are based on Unicode (such as someScalar.properties.isIDContinue).

1 Like

One of the ones I really wish held is "if the string is ascii, then you can compute indices via simple arithmetic", but unfortunately CRLF exists. We do have the option of adding a "doesn't contain any CRLFs" flag, but it's not clear at this point that it's a big enough concern to be worth it, so we've cautiously held off for now.

1 Like

As a user, I wanted to speak specifically to one of these points:

Do the language maintainers feel that the developer community largely understands and avoids cross-string-value uses of indices?

Speaking for myself, it's not always obvious when it is okay to reuse a String.Index across values. For example, if you are writing a "attributed" string library (where attributes are applied to ranges) and you want to write an append() function, you might be surprised to discover that it is not safe to reuse the indicies you stored in the left-hand-side String in the new combined string.

In the real-world code that had this problem, I ended up also having some issues converting to code units, doing the addition, and back, and ended up replacing the backing data structure with NSAttributedString, which deals in UTF-16 code units in the first place. I'm not sure how the Swift Foundation.AttributedString handles this case, but it might be interesting reading if this particular issue is of interest to you.

2 Likes

Thanks, so since the change that made views use the same index type, now String.Index should align at some kind of boundary but which kind of boundary is dependent on the external encoding of the source.

iiuc, views allow for fast path optimizations in parsers. For example, until you know you're parsing an identifier you don't need to deal with full grapheme clusters. (Though when parsing numeric tokens, you might want to ensure that :one:, keycap digit 1, isn't recognized.)

you might be surprised to discover that it is not safe to reuse the indicies you stored in the left-hand-side String in the new combined string.

Might this come up if you're writing something like commonPrefixOf(String, String) -> Substring ?