SE-0241: Explicit Encoded Offsets for String Indices

The review of SE-0241: Explicit Encoded Offsets for String Indices begins now and runs through Sunday, February 3rd, 2019. This is a very late-breaking proposal which we are considering for Swift 5.

Reviews are a vital part of the Swift evolution process. All review feedback should be either on this forum thread or, if you would like to keep your feedback private, directly to me as the review manager via email or direct message on the forums. If you send me email, please put "SE-0241" somewhere in the subject line.

What goes into a review of a proposal?

The goal of the review process is to improve the proposal under review through constructive criticism and, eventually, determine the direction of Swift.

When reviewing a proposal, here are some questions to consider:

  • What is your evaluation of the proposal?
  • Is the problem being addressed significant enough to warrant a change to Swift?
  • Does this proposal fit well with the feel and direction of Swift?
  • If you have used other languages or libraries with a similar feature, how do you feel that this proposal compares to those?
  • How much effort did you put into your review? A glance, a quick reading, or an in-depth study?

As ever, thank you for contributing to Swift.

John McCall
Review Manager

15 Likes
  • What is your evaluation of the proposal?

This seems like a marked improvement over the encodedOffset approach. It is always tempting to sweep String's complexity under the rug, but doing so usually creates correctness problems. With this design, naïve developers will get what they thought they were getting in encodedOffset—an integer count of characters—and more sophisticated users will have a lot more control over what kind of offset they're getting.

One question: Should the new APIs be written in terms of StringProtocol instead of String so that users can use them with substrings?

  • Is the problem being addressed significant enough to warrant a change to Swift?

Yes. The "Uses in the Wild" section is...bad news.

  • Does this proposal fit well with the feel and direction of Swift?

Yes. String is very much designed to require you to be precise about the level of Unicode semantics you want; this proposal supports that.

  • If you have used other languages or libraries with a similar feature, how do you feel that this proposal compares to those?

Most languages don't make as serious of an attempt to tackle this problem as Swift does. Frankly, the fact that we only need to redesign one set of APIs to accommodate changing String's internal encoding is incredible.

  • How much effort did you put into your review? A glance, a quick reading, or an in-depth study?

Quick reading.

1 Like

What is your evaluation of the proposal?

Looks like it solves a real problem in an elegant way.

One thing is unclear to me though: while I understand encodedOffset will be deprecated, it isn't clear to me what it'll do. Will it continue giving the UTF-16 offset to avoid breaking existing code?

Is the problem being addressed significant enough to warrant a change to Swift?

There is little use for an encodedOffset that doesn't tell you which view it refers to, so this needs fixing.

Does this proposal fit well with the feel and direction of Swift?

Yes.

If you have used other languages or libraries with a similar feature, how do you feel that this proposal compares to those?

String's multi-encoding support seems unpaired, so difficult to compare.

How much effort did you put into your review? A glance, a quick reading, or an in-depth study?

A quick reading.

I use String as Substring and play with that Substring, one has enumerators next() kinda ++i. Or

Wow... 1500 misuses...

Definitely deprecate encodedOffset.


I have not decided yet about the additional methods.

I would probably use them if they were available just so I can write string one fewer times, but I cannot help but notice how little difference the additions actually make:

let i = string.index(string.startIndex, offsetBy: x)

let i = String.Index(offset: x, within: string)

If developers failed to discover the first before, will they really find the second any more easily now? The deprecation message will help those who are already misusing encodedOffset, but are there other low‐hanging misusables new users might stumble on before finding the real solution?

I also wonder whether the additions obscure the walking of the string which needs to take place. Does the old method communicate that better?

If it were my own library, I would just add them, but discussions about additions to the Standard Library often end with “That is trivially composable.” or “That would obscure the performance penalties.” Both may apply here. I guess that is what we are weighing when it comes to the additions.


Has the label “within” become the new preference? It used to be that “in” was preferred. It was even the main rationale for SE‐0001.

2 Likes

I also want to commend @Michael_Ilseman for getting this ready for review so fast. I noticed the bug report that started it is only four days old.

6 Likes

This makes me think it'd be better if we move init functions inside String, eg.

stringValue.index(offsetFromStart: x)
stringValue.utf8.index(offsetFromStart: x)
...

Especially if they're using these offsets w.r.t particular encoding, likely they'll have that view lying around.

Generally speaking, are there other bad practices/nasties we don't know about yet that ABI stability will solidify?

This seems like an important fix! I'm surprised at the behavior for out of bounds offsets — was there a consideration of making the initializer failable instead of returning the endIndex for any non-viable offset, or trapping for at least negative input?

4 Likes

I also wondered about that. But if a negative offset is intentionally accepted then I would consider startIndex a more natural return value than endIndex.

While the offset methods seem reasonable to me, I think @SDGGiesbrecht’s point here is quite valid that these seem rather similar and duplicative. If it’s burdensome to write out string.startIndex, it seems sensible to have an overload that omits the parameter (i.e., String.index(offsetBy:)) rather than creating a very similar interface String.Index.init(offset:within:).

1 Like

I wouldn't be opposed to this:

string.index(ofOffset: x)
string.utf8.index(ofOffset: x)
string.utf16.index(ofOffset: x)
string.unicodeScalars.index(ofOffset: x)

This seems like a good name for collections you can query by offset. There's a precedent in the BTree library too, although the difference is that BTree's index(ofOffset:) is O(n log n), whereas I assume for string we have O(offset), or O(1) if you pick the same view as the internal encoding. Complexity should be documented in the proposal.

[Edit: Changed from index(atOffset:) to index(ofOffset:) to match BTree.]

StringProtocol's views are not their own protocols, but constrained Bidi collections:

  associatedtype UTF16View : BidirectionalCollection
  where UTF16View.Element == UInt16, // Unicode.UTF16.CodeUnit
        UTF16View.Index == Index

We could make these APIs generic with the same set of constraints. Alternatively, we could make concrete overloads for Substring.XView.

I view this SE as more of a solution for transitioning to Swift 5 than the final desired approach (more on that below), so I was erring on the side of being concrete and as close to the actual use as possible. But these alternatives are also totally reasonable. What do others think?

If this is the worst fallout from a pretty fundamental shift, then I'm very happy.

encodedOffset will be an offset into whatever encoding the String happens to be encoded with, which is hidden from the developer. It cannot give a specific encoding's offset without access to the contents of the String.

As a transitional solution (more on a potential final solution below), this is a blessing. String.Index is a relatively infrequent namespace that the current problematic code is using. By adding to that namespace, we give the most direct migration path to users without throwing in a lot of API bloat to a higher-visibility namespace.

If a better solution comes, we can choose to deprecate these and it might even be something simple enough that the migrator can update for the developer.

That's interesting. I was following precedent from SE-0180. Either are fine with me, whichever the community sees as most consistent. NSRange's initializer from Range<String.Index> uses in.

The goal is to provide solutions matching the semantics of the old code as close as possible. Since those initializers were not failable, we chose to make these not fail either. Trapping would also be a semantic difference, as the old initializer wouldn't trap on construction, but only when you actually try to use the result.

If an out-of-bounds offset is given, we need to return some index that will trap on access, so we could choose either between endIndex (the canonical out-of-bounds index), or some index beyond endIndex. I could see arguments for either approach, but an API providing beyond-end indices isn't really precedented in Swift AFAICT.

Old code would of produced an index that traps on access, as any occurrence of negative encoded offsets represents a more serious bug and we need to trap for safety. Returning startIndex would produce a valid non-trapping index, which is semantically different than current usage.

I think this is what the final solution may look like:

extension [Bidirectional/Ordered?]Collection {
  func index(atOffset offset: Int) -> Index? {
    if offset < 0 {
      // Non-Bidirectional traps instead
      return index(endIndex, offsetBy: offset, limitedBy: startIndex) 
    }
    return index(startIndex, offsetBy: offset, limitedBy: endIndex)
  }
  subscript(offset offset: Int) -> Element? {
    guard let idx = index(atOffset: offset) else { return nil }
    return self[idx]
  }
  subscript(offset range: Range<Int>) -> SubSequence {
  subscript(offset range: Range<Int>) -> SubSequence {
    let lower = index(atOffset: range.lowerBound) ?? endIndex
    let upper = index(atOffset: range.upperBound) ?? endIndex
    guard lower <= upper else { return self[endIndex..<endIndex] }
    return self[lower..<upper]
  }
  ...
}

There's the design decision regarding whether we permit negative offsets for Bidi collections, why some things are failable, corner cases for the slicing subscript, etc.

This would give us offset-based subscripting for String and its views. This would also provide a solution to those who want a failable subscript on Array, the ability to use Int indices more conveniently on Slice<Array> based on the slice's start value, and a way to safely use Data (which may be self-sliced).

This is substantially more API that we cannot push through in time for Swift 5, but should begin the process soon after.

6 Likes

I totally get that — it feels like we're only solving one of two problems with this change, then.

If we were to just take whatever offset they gave us and wrap it in an index, it would act a bit like the Range(uncheckedBounds:) initializer, which gives you back an invalid range if you pass incorrect input. We could have the label uncheckedOffset as a hint that it's up to the user to check their own input. endIndex traps on access, but is valid in other respects (as a boundary, for example), so I'm not as clear that it's the right return value in all cases. (I don't have examples of poor usage to back this up, just thinking defensively.)

+1 to the offset methods. Speaking as someone who was (before the utf-8 transition) using encodedOffset wrong, these methods were exactly what I wanted all along. Something that would be really helpful would be documentation of the performance of these APIs. Can we document that the UTF-8 offset is O(1) and UTF-16 is amortized O(1) for native Swift Strings? The distance method only says that it's O(k) for a bidirectional collection, but the implementation for String is faster than that most of the time.

-0 for the String.Index inits. The bounds behaviour feels weird to me. With the index method, at least you have the argument limitedBy: to tell you what's going on.

2 Likes

This is a good point. My understanding is that this would be O(k) in the general case, but O(1) for the view that matches the string's storage, which is opaque to the user, and also O(1) for certain kinds of strings, regardless of storage or view (ASCII only? others?).

Generally speaking, are there other bad practices/nasties we don't know about yet that ABI stability will solidify?

Almost certainly, yes. That's life with a stable ABI.

8 Likes

looks good!

definitely. for the longest time i thought encodedOffset was a UTF-8 offset since the docs just say “The offset into a string’s code units for this index”. (what kind of code units? i hear the word used most often when talking about UTF-8 code units, not UTF-16 code units)

A lot of the String.Index docs seems out of date at the least too, and a few of them just don’t make any sense at all, for example:

If the index passed as sourcePosition represents either the start of a Unicode scalar value or the position of a UTF-16 trailing surrogate, then the initializer succeeds. If sourcePosition does not have an exact corresponding position in target , then the result is nil . For example, an attempt to convert the position of a UTF-8 continuation byte results in nil .

How can you create a String.Index pointing to a UTF-8 continuation byte in the current system, if the current encodedOffsets are UTF-16 offsets?

Yes. anything promoting UTF-8 over UTF-16, or at least putting them on equal footing in Swift should be welcomed

C mostly just uses UTF-8 and passes around UTF-8 offsets, which corresponds nicely with the proposed API. I wish the performance would be better documented/elaborated on, at first glance, the implementations look like O(n), and i thought the new UTF-8 String ABI was supposed to make these O(1) operations.

Quick reading

1 Like

But there is no out-of-bounds return value that is semantically equivalent, as the string could be in a different encoding. Taking the provided offset as-is might be out-of-bounds in one encoding, but not another. We need to give some out of bounds index, so we have a few options.

  1. We give the canonical endIndex, which has the potential downside of changing == comparisons, though it preserves the behavior of <=.
  2. We give endIndex plus some arbitrary offset beyond that, e.g. +1, so it won't compare equal to endIndex.
  3. We give endIndex plus either a heuristic or an exact calculation based off of the difference between the provided offset in the provided view and the actual encoding used, minimum 1.

Either way, the details are behind a resilient function call and unspecified, which I think is better as it would allow us to choose something less problematic in the future.

We can, but I don't think we've firmly established what "native" Swift string means or how it might change if we introduce new backing representations.

Other than not adding those initializers, what other behavior could we do that's semantics-preserving?

Currently, this will be amortized O(1) for UTF-16 for all strings. For UTF-8, it's O(1) for native Swift strings and Cocoa strings that answer CFStringGetCStringPtr, otherwise O(n). But all of this may (actually, will) be changing over time and this isn't something we need to nail down right now.

Ha! That’s funny. It has init(_:within:), but samePosition(in:).