SE-0241: Explicit Encoded Offsets for String Indices


(Lily Ballard) #41

This is better than the original, but it will still fatalError() when provided with a UTF-8 index that does not lie on a scalar boundary (if the string is backed by UTF-8; if it's lazily bridged from Cocoa it instead just produces an incorrect result).


(Alexis Gallagher) #42

+1

Yes. I was misusing encodedOffset myself just recently, so it was an easy mistake to make for me at least.

Yes, I think so.

I have no experience to report here.

I read the original proposal and spent significant time trying to understand the original issue. I have read all of this thread.


(Daniel Höpfl) #43
  • What is your evaluation of the proposal?

-1

  • Is the problem being addressed significant enough to warrant a change to Swift?

Yes.

  • Does this proposal fit well with the feel and direction of Swift?

Undecided.

  • If you have used other languages or libraries with a similar feature, how do you feel that this proposal compares to those?

I used lots of languages, none of them had an explicit solution for indexing into strings.

  • How much effort did you put into your review? A glance, a quick reading, or an in-depth study?

I took some time to try to understand the problem but I have to admit that I'm not sure if I succeeded. I do understand that having a single Int as index does not work for the four different views that string supports. On the other hand, I do not see how String.Index actually solves that problem.

One of the problems I see is that String.Index does not seem to have any knownledge about the string. Thus there is no defined error behaviour for

let i = String.Index(offset: 1, within: "one")
let x = i.offset(within: "other".utf8)

I do understand that Swift has different views on strings that use different index values for the same position in the string. As shown in the previous point, the proposed solution seems to have problems here. Isn't this the main reason for not using Int as index type? I boiled down this comment to: “Users will mix up the index into different *Views or into the Collection of Characters that String is.”
With a dedicated index type that supports convertion (by retaining the string it comes from) this would indeed not be a problem.
On the other hand, the argument in the comment is not too strong in my opinion as subscript with Int is unquestioned for arrays (where array.sorted() is a “view” on array) .

TL;DR

I think String.Index is too complicated to use when the benefit is unclear.

I would be fine if String.Index was the return type for all offset calculation methods but the views should (also) accept Int as input.

This thread shows that users do not understand why you have to write

let x = "abc"
let c = x[x.index(x.startIndex, offsetBy: 2)]

instead of let c = "abc"[2].


(Lily Ballard) #44

That's a violation of collection index semantics in general. You can't take an index from one collection and use in another in the general case. In the specific case of collections whose index is Int you can do this as long as the index is in-bounds in the new collection, but in the general case, you can only safely use an index in a collection if the index is a member of col.indices, and that's not going to be true for index types other than Int.


(Michael Ilseman) #45

Which proposal are you reviewing? SE-0241 is not proposing String.Index, which already exists. It's deprecating an existing property and method on the type (which was misused in the wild), and replacing it with a more explicit alternative.


(Daniel Höpfl) #46

You are right, that's a violation of collection index semantics but in my opinion the API should be designed in a way that either prevents incorrect usage or has a defined error behaviour.

But, as Michael_Ilseman correctly pointed out: My point if off-topic here.


(Daniel Höpfl) #47

As I understood it, SE-0241 seeks to improve String.Index to minimize misuse of encodedOffset. I think that the proposed change will not fix this problem because people (well: me), will still fail to understand the reason for String.Index in the first place. E.g. I’m not even sure if this will be correct:

let iutf8 = String.Index(offset: 4, in: myString.utf8)
let c = myString[iutf8.offset(in: myString)]

My opinion (which might well be wrong because I, again, admit that I do not get all the hard edges that unicode imposes on Swift strings) is: 1500 people tried to do myString[i..<j] and failed. SE-0241 will not fix this. Making myString[i..<j] behave like String(Array(myString)[i..<j]) will fix it. It will come with its own problems but in most cases it does what people expect it to do. Removing String.Index is the ultimate fix of encodedOffset misuse.


(Michael Ilseman) #48

String.Index(offset:in:) is not in the proposal anymore, since this is now an emergency fix for 5.0. Even with the originally proposed API, this code doesn't make any sense. What are you trying to do?

I agree that indices are the wrong level for the majority of people programming against String, and it's a shame people still have to do it. Personally, I don't think String.Index makes much sense, but it's what we have right now and there are bugs in the wild being exposed in Swift 5.0.

Back to the proposal:

The behavior of encodedOffset shifted in a way that exposes bugs in existing misuses of it, and there is no way to recover the original behavior without access to the string content (even though this is still amortized O(1)). Another way of saying that is that SE-0180 had a bug. This proposal deprecates the encodedOffset pair, and replaces it with a semantics-preserving alternative utf16Offset:in: pair, which explicitly specifies what the user is asking for.

Are you opposed to this change? If so, why?


(Jonas) #49

Review of the updated proposal:

  • What is your evaluation of the proposal?

-1

  • Is the problem being addressed significant enough to warrant a change to Swift?

Yes, there is definitely a need for convenient-to-use functions for converting between String.Index and a utf16 offset.

There is nothing wrong with the implementation of the two methods in the detailed design, but I feel they are declared on the wrong type. In Swift the convention is that the collection type is responsible for constructing indices. These two functions are simply convenience functions built on the existing collection index functions. Therefore, they should be in an extension on String.UTF16View. This avoids polluting the String.Index namespace with a bunch of methods of unrelated encodings.

  • Does this proposal fit well with the feel and direction of Swift?

Sceptical. It is okay-ish only considering Swift 5, but it doesn't make much sense for the future when adding similar functions for utf8 and other encodings.

  • How much effort did you put into your review? A glance, a quick reading, or an in-depth study?

In-depth study of proposal and previous string index proposals.

Alternative proposal

extension String.UTF16View { 
    /// Returns the UTF-16 code unit offset corresponding to the index, as measured from the views startIndex
    public func codeUnitOffset(_ index: String.Index) -> Int {
        return distance(from: startIndex, to: index)
    }
    /// Creates a new index at the specified UTF-16 code unit offset
    ///
    /// - Parameter offset: An offset in UTF-16 code units.
    public func index(codeUnitOffset offset: Int) -> String.Index {
        let (start, end) = (startIndex, endIndex)
        guard offset >= 0,
            let idx = index(start, offsetBy: offset, limitedBy: end)
        else {
            return end.nextEncoded
        }
        return idx
    }
}

Example usage:

let string = "foo"
let index = string.utf16.index(codeUnitOffset: 2)
let codeUnitOffset = string.utf16.codeUnitOffset(index)

(Michael Ilseman) #50

I think it would be much better to extend all Collections with the means to get the offset of a given index, and an index at a given offset. Such as:

and a corresponding offset(of: Index) -> Int method as well. Way out of scope for 5.0 at this point.

Reasons I kept this on String.Index:

  1. It's a off-ramp for misuses of encodedOffset which is already on String.Index.
  2. String.Index is an infrequent namespace; if we had to pollute something that will be superseded in the future, better in an infrequent namespace.
  3. It's name wouldn't conflict with a more general addition to all Collections.

(Lily Ballard) #51

We already have this. Use the pre-existing Collection.distance(from:to:) to calculate an offset from the start index, and Collection.index(_:offsetBy:) to calculate an index from an offset.


(Michael Ilseman) #52

The former still requires a nested use of the collection, which is a little ugly. The latter traps if out of bounds.


(Brent Royal-Gordon) #53

We could provide a couple overloads:

extension Collection {
  public func distance(to i: Index) -> Int {
    return distance(from: startIndex, to: i)
  }
  public func index(offsetBy offset: Int) -> Index {
    return index(startIndex, offsetBy: offset)
    // Could use index(_:offsetBy:limitedBy:) and return Optional,
    // or even allow negative offsets from the endIndex, but those
    // aren’t my preference.
  }
}

(If protocols permitted default arguments, these could be default arguments instead. But they don’t, so they can’t.)


As for offset subscripts, well...

extension StringProtocol {
    public subscript<C>(offset: Int, in view: KeyPath<Self, C>) -> Element
        where C: BidirectionalCollection, C.Index == Index
    {
        let i = self[keyPath: view].index(self[keyPath: view].startIndex, offsetBy: offset)
        return self[i]
    }
    
    public subscript<R, C>(offsetRange: R, in view: KeyPath<Self, C>) -> SubSequence
        where C: BidirectionalCollection, C.Index == Index, R: RangeExpression, R.Bound == Int
    {
        let range = offsetRange.relative(to: 0 ..< .max)
        let start = self[keyPath: view].index(self[keyPath: view].startIndex, offsetBy: range.lowerBound)
        let end = self[keyPath: view].index(start, offsetBy: range.count)
        return self[start..<end]
    }
}

let str = "Thë qûìçk brøwn fôx jūmpęd œvėr thé łãzÿ dōg"
str[10..<25, in: \.self]    // brøwn fôx jūmpę
str[10..<25, in: \.utf8]    // çk brøwn fôx

Modifying this so that something like str.utf8[10..<25, in: \.self] slices the UTF-8 view by character offsets—or, for that matter, str.indices[10..<25, in: \.self] gives you a slice of the string’s indices—is left as an exercise for the reader. (Or me if I get bored late at night again.)


(Daniel Höpfl) #54

The change feels to me like it replaces something broken with something less broken. There is no time to get “the real fix” for Swift 5 and I fear that it will not come afterwards if it is “not that broken anymore” and “already got changed twice”.

So it’s not that I am opposing this change but it’s more that I do not want this change to block the future change I would like to see.


(John McCall) #55

SE-0241 has been accepted; please feel free to discuss the acceptance in that thread.


(John McCall) #56