SE-0241: Explicit Encoded Offsets for String Indices

I’m quite curious, what ABI changes are you referring to?

3 Likes

Having methods to get the encoded offset within a given view raises the specter of silent index rounding. If I take an arbitrary index and want to encode it, how do I know if the UTF-16 or UTF-8 encodings would more accurately represent the index? This is made even more dangerous by the fact that the "obvious" method to get an offset is offset(with: String), which is the least accurate method to use, and in fact will throw a fatal error if the index does not fall on a unicode scalar boundary.

Instead of having methods that produce the encoded offset for an index within a view, I think we should replace encodedOffset with an enum that tells you what view it belongs to. This way you can accurately serialize all indices no matter what the view is, and without accidentally running into fatal errors.

extension String.Index {
    enum Offset {
        case utf8(Int)
        case utf16(Int)
    }
    var encodedOffset: Offset
}

I've only included UTF-8 and UTF-16 as the other views can all be expressed in terms of these two (and the view chosen for encoding would depend on the actual underlying encoding of the string itself).

There is of course value in being able to request the UTF-8 distance for an arbitrary index, but we already have that in the utf8.distance(from:to:) method.

As for creating an index from an offset, we can keep String.Index(encodedOffset:) and have it just take this new String.Index.Offset. Again there's value in being able to construct a String.Index from an arbitrary unicode scalar index, but also again we can do that already with unicodeScalars.index(_:offsetBy:).

1 Like

These offsets are related to Codable right? If I encode a string and an index into JSON, is it safe for me to decode the pair, or am I toast if the JSON file/stream were normalized? Do the offsets need to be in NF‐something‐standard?

JSON does not mandate any particular normalization form. As long as your JSON encoder and decoder don't normalize or denormalize the string, I would expect it to work. I have not tested JSONEncoder or JSONDecoder to see how it behaves, though I would be surprised if these applied any sort of normalization.

String.Index does not know the underlying encoding of the String from which it was derived, so it can't serve up that enum.

It doesn't need to know the underlying encoding. Every String.Index is representable either as a UTF-8 offset or a UTF-16 offset. Given that the native encoding is now UTF-8 I assume that all indices normally correspond to UTF-8 offsets, except ones from the utf16 view that don't lie on a scalar boundary. All we really need to be able to answer is "can I represent this index as a UTF-8 offset, or is it represented as a UTF-16 offset instead?".

No, encodedOffset is for interpretation by the string that vended it. It is otherwise an opaque integer with a partial order against other indices from the same string, and a total order if neither index is from a transcoded view. Lazily bridged Cocoa strings store UTF-16 offsets. Any foreign string concept in the future would likewise store their own offset in there.

String.Index does track a transcoded offset, which is non-zero only when not-scalar-aligned and produced from a transcoded view. Such an index's encoded offset is the earlier scalar boundary.

None of this is fully specified yet, but parts of it are burned into the ABI. We have 8 reserved bits available to us for future use, but we are not going to burn one of them in time for Swift 5.0. Candidates for future use could be a isNative bit which can help catch some issues, or some kind of storage hash to detect invalidation or use on a different string. All of that is speculative for the future.

Ok, that does change things. In that case I'd recommend something like

extension String.Index {
    public enum Offset {
        case utf8(Int)
        case utf16(Int)
    }
    public func offset(within str: String) -> Offset
}

By only providing a single offset(with:) func we avoid accidentally introducing index rounding, or worse, accidentally producing a fatal error by trying to measure a non-boundary index in the wrong view.

As a side note, if I produce an index from a transcoded view, but the index lies on a scalar boundary, I assume encodedOffset still has a total order when compared to an index from a non-transcoded view?

Correct

1 Like

Having said that, you did mention "foreign string concept". I don't know what the plans are (if any) around such foreign strings; is there a desire to support a "foreign string" or string view whose indices cannot be represented in either UTF-8 or UTF-16?

In the existing proposal, presumably such a foreign string would be supported by adding a new offset(within: str.foreignStringView) method. That could still be done with my proposal, the only requirement is that an index taken from that foreign string view would either have to be expressible as a UTF-8 or UTF-16 offset, or we'd have to be ok with offset(within: str) producing a fatal error in that scenario (which, I note, the current proposal already does if you take a UTF-8 non-boundary index and use it with the UTF-16 view, or vice versa).

No current plans. Foreign strings can be added without requiring a new view, they just transcode for both UTF8View and UTF16View. encodedOffset is for their interpretation only, with some ordering guarantees.

1 Like

My last comment is irrelevant. String.Index is not Codable to begin with.

The proposal says:

This gives developers:
...
The ability to migrate their uses for Cocoa index mapping by choosing UTF-16.

But I am worried it doesn't really offer a migration path for all Cocoa index mapping.

One place where I have found encodedOffset, interpreted as a utf16 offset, to be indespensible is in the context of various Text Kit APIs that wend NSRange. There are points where you may need to create an index at a certain offset into a string that doesn't yet exist.

On such example is the NSTextView delegate method:

func textView(_ textView: NSTextView, shouldChangeTextIn affectedCharRange: NSRange, replacementString: String?) -> Bool

Lets say I am using this method to update some text annotations as the user types. Ranges that are after the edited range need to be adjusted to stay put.
Using todays encodedOffset that looks like this (well there is more to it, here simplified for demonstration purposes):


let textAnnotations = TextAnnotaions()

...

func textView(_ textView: NSTextView, shouldChangeTextIn affectedCharRange: NSRange, replacementString: String?) -> Bool {
    guard let replacementString = replacementString else { return }
    let replacementStringLengthUtf16 = replacementString.utf16.count
    let changeInLength = replacementStringLengthUtf16 - affectedCharRange.length
    textAnnotations.processEdit(editedNSRange: affectedCharRange, changeInLength: changeInLength)
}

class TextAnnotations {
   var highlightedRanges: [Range<String.Index>]
   
   func processEdit(editedNSRange: NSRange, changeInLenght delta: Int) {
        let editedRange = Range<String.Index>(uncheckedBounds: (
           String.Index(encodedOffset: editedNSRange.lowerBound),
           String.Index(encodedOffset: editedNSRange.upperBound)
        ))

        // updateEach not shown here is a mutating forEach
        highlightedRanges.array.updateEach({ (range) in
             if range.lowerBound >= editedRange.upperBound {
                let shiftedRange = Range<String.Index>(uncheckedBounds: (
                     String.Index(encodedOffset: range.lowerBound.encodedOffset + delta),
                     String.Index(encodedOffset: range.upperBound.encodedOffset + delta)
                ))
                range = shiftedRange
             } else {
                 // some other code to invalidate ranges that overlapped with edit
             }
        }
   }
}

With the proposal, there is no way to form the indices of the updated ranges, because the UTF16View that the new range points into doesnt yet exist:

String.Index(offset: editedNSRange.lowerBound + delta, within: <hmm what do I put here> )

I am not saying there are not workarounds, but the migration path is not trivial.
Possible workarounds:

  1. Actually form the new string:

    var string = textView.string
    let newString = string.replaceSubrange(editedRange, with: replacementString)
    ... update ranges forming indices into newString ...
    

    But this is not desirable due to efficiency reasons, especially if the string is large, since this is called everytime the user types a character. We are duplicating work of the text storage and also doing several big allocations due to NSString -> String bridging.

  2. Buffer edited ranges until the new string exists (eg. NSTextViewDelegate textDidChange), then process range updates

  3. Use NSRange instead of Range<String.Index> as our range datastructure, and create Range only when needed.
    Ok, but lead to proliferation of NSRange throughout the rest of code

Similar affected APIs are:

NSTextStorageDelgate textStorage(_:willProcessEditing:range:changeInLength:)
UIKit UITextView textView(UITextView, shouldChangeTextIn: NSRange, replacementText: String)

The original proposal sought to solve 3 problems:

  1. SE-0180’s encodedOffset, meant for serialization purposes, needs to be parameterized over the encoding in which the string will be serialized in
  2. Existing uses of encodedOffset need a semantics-preserving off-ramp for Swift 5, which is expressed in terms of UTF-16 offsets
  3. Existing misuses of encodedOffset, which assume all characters are a single UTF-16 code unit, need a semantics-fixing alternative

After discussing with the core team, the hard reality of the Swift 5.0 release is that it is simply too late to be adding a slew of new API. At this point, we can only apply urgent fixes, and only problem #2 qualifies.

However, this review thread has been very valuable and I would like to continue the discussion surrounding how best to address problems #1 and #2, which unfortunately will have to arrive after Swift 5.0.

I spun off another thread for further discussion, and will make a “Public Service Announcement”-like post highlighting the issues and various mitigation strategies for current uses. This proposal is being gutted to a minimal, semantics-preserving off-ramp for current uses of encodedOffset.

3 Likes

In light of that change, I've extended this review by two days (until Tuesday the 5th).

I don't want to assume that previous review feedback still applies to the modified proposal, so if you've already given feedback, I have to ask that you reply again (even if just to say that your previous feedback still stands).

3 Likes

+1.

As it happens, I think I'd prefer this style of naming to the originally-proposed one. Just feels easier to reach for somehow.

I think I only said three things:

  1. Too much misuse. Deprecation needed.

    ✓ Still stands.

  2. The original rearrangement didn’t look all that much different from the existing right way to do things.

    ✗ No longer stands. The newer variant has a new name stating much more more clearly what it does, and it provides additional safety by no longer being a semantically differing overload set.

  3. Why within?

    ✗ No longer stands. The rewritten proposal uses in.

:+1: All three thoughts are now in favour of the new version of the proposal.

1 Like

Now it is a +1 from me. String changes in Swift 5.0 makes this interim fix necessary. I also prefer shorter in: instead of within:.

This is better than the original, but it will still fatalError() when provided with a UTF-8 index that does not lie on a scalar boundary (if the string is backed by UTF-8; if it's lazily bridged from Cocoa it instead just produces an incorrect result).