NSString Indices invalidated potentially after it has been bridged to String and manipulated

Zhao_Xin · July 10, 2019, 6:24am

When doing text replacing, like replacing all "dog" in "A dog is a dog in dogs." to "monkey". A common algorithm is:

get the ranges of "dog" in "A dog is a dog in dogs."
reversed the ranges and for each of the range, call String.replaceSubrange(range, with: "monkey").

Above algorithm works fine with String. However, if we use NSString, the code will not work.

That is because NSString is lazily bridged to String and it uses UTF-16, when getting the ranges, the ranges are in UTF-16 and when replacing the first range, the NSString bridged String with UTF-16 is converted String with UTF-8. All ranges in UTF-16 are invalidated potentially.

So above algorithm won't work in NSString if people use UTF-8 characters in Asia.

The issue harms to all Apple current SDKs as UIKit and Cocoa are used NSString internally in TextFields and TextViews.

I don't know if this is an already known tradeoff or should be consider as a bug.

Below is a sample playground code.

import Foundation

var s:String = "A dog is a dog in dogs."
var ns:NSString = "A dog is a dog in dogs."
var s1:String = "A dog is a dog in dogs. 字 A dog is a dog in dogs."
var ns1:NSString = "A dog is a dog in dogs. 字 A dog is a dog in dogs."

func ranges(_ content:String, of source:String) -> [Range<String.Index>] {
    var ranges = [Range<String.Index>]()
    var startIndex = content.startIndex
    let endIndex = content.endIndex
    
    while let aRange = content.range(of: source, range: startIndex..<endIndex) {
        ranges.append(aRange)
        startIndex = aRange.upperBound
    }
    
    return ranges
}

func replace(_ content: String, in ranges:[Range<String.Index>], to target:String) -> String {
    var source = content
    ranges.reversed().forEach {
        source.replaceSubrange($0, with: target)
    }
    
    return source
}

let sRanges = ranges(s, of: "dog")
s = replace(s, in: sRanges, to: "monkey")

let nsRanges = ranges(ns as String, of: "dog")
ns = replace(ns as String, in: nsRanges, to: "monkey") as NSString

let s1Ranges = ranges(s1, of: "dog")
s1 = replace(s1, in: s1Ranges, to: "monkey")

let ns1Ranges = ranges(ns1 as String, of: "dog")
ns1 = replace(ns1 as String, in: ns1Ranges, to: "monkey") as NSString

Martin · July 10, 2019, 7:10am

It is documented for String.replaceSubRange()

Calling this method invalidates any existing indices for use with this string.

and for the protocol method RangeReplaceableCollection.replaceSubRange():

Calling this method may invalidate any existing indices for use with this collection.

bzamayo · July 10, 2019, 7:16am

I’m not sure if your text replacement algorithm was merely for exposition, or if that is your shipping code. If the latter, consider deleting it all and using replacingOccurrences... instead.

Zhao_Xin · July 10, 2019, 8:31am

The code is just for demonstration.

Zhao_Xin · July 10, 2019, 8:43am

I know that in documentation they did say. But I don't believe the document says what I demoed.

In reality, you will find that String.replaceSubRange() is never called, what always called isRangeReplaceableCollection.replaceSubRange():

And "Calling this method may invalidate any existing indices for use with this collection.", for me, it means that the range after the replacement may invalidate. That is very obvious and consistent with other language.

Also, "Calling this method invalidates any existing indices for use with this string.", this is untrue as I demoed. It is also counterintuitive for a developer with other language experiences.

lukasa · July 10, 2019, 9:18am

It is not untrue. The fact that it works in this case does not change the fact that the documentation makes it clear that the indices are invalid. They may still function, but they are not required to do what you think they will.

This is true of all range-replaceable collections: reusing their index after you call replaceSubRange is always invalid, and what happens is not defined. In this case, sometimes, it appears to do what you want. That does not change the fact that your algorithm is wrong. Even a stopped clock is right twice a day: that doesn't mean it's not fundamentally always wrong.

That's as may be, but it is a fundamental property of Swift collection indices. That will not be changing any time soon.

Zhao_Xin · July 10, 2019, 10:51am

Since you have already agree that this is counterintuitive. Here is a quick question: Why "all range-replaceable collections" are design this way? If there is no gains more than pains, why designing like that?

lukasa · July 10, 2019, 11:01am

Because in all languages Collections implicitly behave this way: they just don't normally write it down in such stark terms.

Consider the following Python program:

my_array = [1, 2, 3, 4, 5]
five_index = my_array.index(5)
my_array[2:3] = [5, 6, 7]
print(my_array[five_index] == 5)

This program prints false.

This is what it means to say that "indices are invalidated". The moment you mutated the array, the old indexes to the array may no longer point to the same objects they previously did.

In this sense, Swift is simply more strict about the rules on index reuse. The rules are most clearly laid out in the standard library index invalidation document, but in this case the rule is simple: anything that rearranges elements in a collection invalidates all previous indices. The invalid indices may continue to actually work, but it doesn't change the fact that in an arbitrary program you should behave as though the indices are all broken.

The sense in which I agree that it is counter intuitive is only that other languages have mostly not formalised this rule. Almost all other languages have the rule: they just exit into the realm of weird, unspecified behaviour when it's violated.

Zhao_Xin · July 10, 2019, 11:26pm

I am not a python guy. But I revised you sample and the result shows what you said is the not opposite of my opinion. Here is my revised sample.

my_array = [1, 2, 3, 4, 5]
one_index = my_array.index(1) // get index before the replace index start position
five_index = my_array.index(5)
my_array[2:3] = [5, 6, 7]
print(my_array[one_index] == 1) // will print true
print(my_array[five_index] == 5)

So even in python, if you replace something from behind, the index before the changing position won't be proved invalidated. I think this is a common agreed rule. And in Swift this rule also applies until a String is bridged from NSString as I demoed. Also I pointed out the reason is not because of the replacing, it is because the hidden delayed/lazy conversion from UTF-16 to UTF-8.

So I think you over extended the explanation of Calling this method may invalidate any existing indices for use with this collection. What you demoed is the same as I already known, that's why my demo reversed the ranges before replacing.

I will redraw my conclusion if you could give a demo pure in Swift (only use String or Collections instead of bridged NS classes) and show a proved invalidated index/range causing by something replaced after its position.

jonprescott · July 11, 2019, 12:23am

I think you are missing the use of the word "may" in @lucasa's replies. Everything you say "may" be true today, but, it is not guaranteed to be in the future, since it is not in the language specification as a constraint on the implementation, as indicated in the documentation. That the fact that it works today is, strictly speaking, an "aberration" probably made for expediency, but, no guaranteed to hold in the future. In C/C++, you can perform operations on collections/containers that technically "invalidate" iterators/indices, but, in some implementations, the value of the iterators hold for a time. You however can't count on that behavior to hold up, however, since the specification specifically states that they shouldn't. The fact that they do is an implementation detail that change at a moment.

Zhao_Xin · July 11, 2019, 12:51am

I don't think I am missing the use of the word "may" in @lucasa's replies.

I think my point is just another explanation of "may". Mine is that "may" means that the indices after the lowerBound of the replacing range will invalidate if the replacing target is not the same size with the replacing source. All indices before the range should always keep valid.

I think the above point/rule is true in all languages and also in pure Swift without Objective-C bridging.

I am looking for an opposite example so I could redraw my point.

jrose · July 11, 2019, 1:09am

There's definitely a mystery here, and I'd like to know why the standard library behaves the way it does here, but it doesn't change the fact that you must follow the docs. It's not okay to assume that indexes before the replacement point won't change if the docs don't make that promise. You can certainly request that that be true (in a bug report or a library evolution pitch), but (for example) if String.Index was implemented as a pointer into the buffer of characters, it would not be the case.

An Index in Swift is not an offset.

EDIT: We've had similar discussions about whether it's okay to assume that the sort method performs a stable sort, because it does today. That assumption isn't safe, because it's perfectly okay, according to the docs, to change it to be non-stable in next year's release of Swift.

lukasa · July 11, 2019, 8:33am

That is fundamentally in direct opposition to the language you quoted. Let me add some emphasis on the key words: "Calling this method may invalidate any existing indices for use with this collection.". I appreciate we're down into the nitty-gritty of native language speaker territory here, but to me this leads to the following definition of the two highlighted words:

"may": An implementation can choose to do this, or not. No guarantees are made. Notably, this is the opposite of a requirement that the indices must not be invalidated: it is an explicit note that an implementation is welcome to invalidate the indices if it chooses to.
"any": There are no constraints as to which indices this sentence covers. That is, this sentence does not only allow an implementation to invalidate the indices after the mutated index: it is free to invalidate any index it wishes to.

This meets the requirements as laid out by both @jrose and @jonprescott. Specifically, while it is possible that the behaviour of the index of a given type is more forgiving than the documentation allows, if you want to program safely in Swift you should assume that indices behave exactly the way the documentation says. Otherwise, weird stuff will happen!

The key insight is what @jrose stated:

This is not a hypothetical case. Consider SwiftNIO's CircularBuffer. This data type has an opaque Index type, which fundamentally stores the Index into the backing Array storage. If you replaceSubrange on us, we may have to resize the underlying Array. This involves us copying elements around to ensure that elements are still contiguous. The act of copying those elements will change their index in the underlying array, which means an old indices may no longer be valid. This affects both indices before and after the mutation point.

Swift's permissive rules around Collection Index types allow us to implement the buffer this way, rather than by using offset-based indexing. If you are not careful in your program and you ever get fed a NIO CircularBuffer instead of an Array, but you rely on Array's semantics for indices, you will likely crash your program.

I understand why you construct things this way, but fundamentally Swift disagrees with almost all other languages because as @jrose stated above, Indexes in Swift are not offsets. This differs from every language that allows the program you wrote above, as their indexes fundamentally are offsets. This distinction is really critical in understanding why Swift behaves the way it does, as well as why it allows this invalidation behaviour.

(Incidentally, this can also be observed in Set and Dictionary if you remove elements from them.)

Michael_Ilseman · July 11, 2019, 5:02pm

As others have mentioned, the existence of a counter example today is not required, merely the potential for there to be a counter example in the future. That is the whole point of such API guarantees and contracts in documentation.

But if you must have a counter example that doesn't require bridging, here is one:

var str = "abcde"
let lastIdx = str.index(before: str.endIndex)

print(str[lastIdx] == "e") // true
print(str.count) // 5

str += "\u{301}"
let newLastIdx = str.index(before: str.endIndex)
print(str.count) // 5
print(lastIdx == newLastIdx) // true
print(str[lastIdx] == str[newLastIdx]) // false

print(str[newLastIdx] == "e\u{301}") // true
print(str[lastIdx] == "e\u{301}") // false

But this is just due to implementation details. Using lastIdx in this way may even cause a trap in the future, which is a commonly requested feature.

bobergj · July 12, 2019, 5:06am

The docs for RangeReplaceableCollection insert, removeSubrange, replaceSubrange removeLast, removeFirst etc says "Calling this method may invalidate any existing indices for use with this collection.".

The documentation for RangeReplaceableCollection append, append(contentsOf:) do not mention anything about index invalidation. append(_:) | Apple Developer Documentation
Neither does the String.append documentation.

Is the documentation wrong or could Michaels example above be considered a bug in String append?
My point is, if all operations that do invalidate indices are not explicitly documented, using indices at all feels like pointing a shotgun to your foot (I have myself relied on append not invalidating indices in production code..)

Regarding bridging String/NSString bridging and indices, there is the fact that you can invalidate indices by the mere act of bridging, which IMO is a huge pitfall that should be classified as a bug. Example from Issues · apple/swift-issues · GitHub

let str = "同意"
let range = str.startIndex..<str.index(after: str.startIndex)
print(str[range])    //  "同" OK

class SomeSwiftClass {
    var str: String = ""
}
let someSwiftClass = SomeSwiftClass()
someSwiftClass.str = str
let str2 = someSwiftClass.str
print(str2[range])          //  "同" OK and arguably safe, we have not mutated the string

let someObjcClass = SomeObjcClass()  // Objc class with @property (nonatomic, strong) NSString *str;
someObjcClass.str = str
let str3 = someObjcClass.str
print(str3[range])          //  BOOM  Fatal error: String index range is out of bounds

xwu · July 12, 2019, 10:32am

The documentation for Collection is very clear:

Saved indices may become invalid as a result of mutating operations.

This applies to all mutating operations. It is incorrect to used indices saved prior to a mutating operation after that operation unless the method is specifically documented with guarantees about indices.

bobergj · July 12, 2019, 10:40am

I think your quote was a bit subjective, the full paragraph on that page says:

Saved indices may become invalid as a result of mutating operations. For more information about index invalidation in mutable collections, see the reference for the MutableCollection and RangeReplaceableCollection protocols, as well as for the specific type you’re using.

I do not know how you conclude "This applies to all mutating operations" from this paragraph. It may very well be that you are right, but I think "very clear" is a bit of an exagerration.

Edit: to explain why I think it is not clear: the "Calling this method may invalidate any existing indices for use with this collection." sentence on only some methods in RangeReplaceableCollection gives the impression that only those methods invalidate indices. | am not aware of any method in the whole SDK that specifies "this method will not invalidate indices". For example String.reserveCapacity doesn't (I know this is a straw man, but I think it illustrates my point).

Edit 2: I did some edits since replies below, I understand your parsing of "may", but my wish is not to debate the semantics of this single sentence, what I want to say is: Hey, maybe the documentation on this could be better, and Hey, maybe we need some stronger guarantees, especially with regards to append.

xwu · July 12, 2019, 10:46am

Again, if something is not explicitly guaranteed by a protocol, then it's incorrect to make that assumption. The phrase "may become invalid" is even stronger than that, because it is explicitly calling out (not just implicitly by omission) that you should not make this assumption. If you choose to do that anyway, unexpected stuff will happen.

Lantua · July 12, 2019, 10:51am

Not to put word in @xwu mouth, but I'd reach the same conclusion.

It applies to all mutating operations in that

All mutating operations may invalidate saved indices"

NOT

"All mutation operations will invalidate saved indices".

So any indices not invalidated by mutation is implementation detail, unless noted by protocol/concrete interface and may be subjected to change.

lukasa · July 12, 2019, 12:42pm

In general I think the only mutating operation that the standard library guarantees will not mutate the indices of its data structures is assigning through the Index subscript. I am not confident whether this rule is explicitly part of the protocol though: perhaps it should be?