String indices validity

It's 6pm on friday in my place, so it's runtime crashes on production time. Yeeey!

The issue is following:

class Label: UILabel {
  ...
  public var stylableText: String? {
    // get { super.attributedText?.string.cString(using: .utf8).flatMap(String.init(cString:)) }
    get { super.attributedText?.string }
    set { ... }
    }
}

let string = "Hello world!"
let range = string.range(of: "world!")!
let label = Label()
label.stylableText = string

If I apply the range at the string, everything works as expected. But when I apply range at label.stylableText, the app crashes on iOS older than 16.

The commented-out "hack" on stylableText.getter fixes the issue on all iOS versions that I have tested.

My assumption is, that the app crashes because the String instace from label.stylableText stores the text differently than on newer iOS versions.

But is it my fault to assume, that the indices would work? If I am wrong, what is the recommended way to express indices across different instances of String?

Indices are only valid for the string that you're indexing. The recommendation is that you must not try to use a range of indices from one string to subscript another. Swift reserves the option in future versions to use reserved bits in the memory storage of String.Index to "remember" the original string an index was obtained from so that trying to do this always crashes. You'll need to rework your code to accomplish your task in another way if you want to avoid having to deal with this over and over again in the future; consider, for example, if you want to store your string as a canonicalized array of UTF-8 code units and operating on the UTF-8 offset.

2 Likes

Thanks for the reply.

So if I understand it correctly, there is no guarantee, that an index would be valid even for copies (or 1-1 copies) of the original string?

I understand, that in my case there is a different underlying buffer. But after reading your post, I’m curious what are the precise rules for using indices.

It looks to me, like I need to essentially create an extension, that would allow me to reference substrings via Range<Int> !

But it seems so wrong.

I believe the "1:1 copy" can (at least conceptually, but perhaps in practice as well) change the underlying string representation (e.g. change it from being stored as utf16 to utf8) – so long as the copy == original – swift considers it'd fulfilled its obligations, yet the index created from the original will not work with the copy.

You didn't show the rest of the code that actually uses the range, if you do maybe we'll suggest something else.

how does one obtain a canonicalized array of UTF-8 data?

1 Like

There is nothing secret about the class, I just think there was a lot of unnecessary code, that would make my point just more obscure. (I carry it from project to project, here is a gist.)

The problem is with handling of the additionalAttributes variable, which allows me to decorate substrings of the stylableText.

The problem really boils down to:

which I anticipated once I've isolated the issue. (Or to be more precise, I thought that the issue is, that the indices are not valid, because the buffer comes from bridged NSString, thus the tag bridging.)

At any rate, this

really bothers me. I would not be surprised, that equal Substring and String would have different indices.

What bothers me is the fact we're talking about equal instances of String, and further that some platforms reject my code, while others do not. I have been working with String quite a lot recently so I felt pretty confident in my understanding of the validity scope of indices. But this thing really surprised me and I think maybe the documentation should be more clear about the undefined nature of what appears as indices of two identical instances of String.

two strings who are == may still not have the same underlying representation, either the encoding could be different (utf8 vs utf16) or they share the same encoding but both are not in the canonical representation that is used to compare two strings.

1 Like

Well...

2 Likes

yep…

1 Like

I was wondering about that unfinished fragment:

that's not fully shown (neither in code nor in pseudocode), e.g. how you apply the range. The question is: can you not reorder it to take the range from the proper string?

let string = "Hello world!"
let label = Label()
label.stylableText = string
let range = label.stylableText.range(of: "world!")!

Do you have two strings and apply range to both? or do you want to calculate the range once (e.g. because it is expensive?). Just wanted to understand the higher level picture of what you are doing here.

FWIW, there are a bunch of string usability additions we would like to propose soonish, and these normalization APIs are something we'd like to propose as part of this. cc: @lorentey

5 Likes

The general rule, as Xiaodi described, is that Index values aren't exchangeable between distinct collection values, even if the collections compare equal.

In the specific case of String, indices are (effectively) offsets into the string's storage, and these offset values depend on the string's encoding -- a string value that is encoded in UTF-8 will have very different indices to a string value that is encoded in UTF-16. (UTF-16 String values can occur when NSString instances get bridged from Objective-C.)

The thing that causes your problem here is that label.stylableText isn't at all an identical copy of the original string -- NSAttributedString is (intentionally!) not preserving the identity of the string passed to it, so what you get back from NSAttributedString.string is a brand new string value. In this case, this new value is backed by a bridged Cocoa NSString instance. As such, its storage is encoded in UTF-16, and indices in your original strings are no longer valid in it.

(Note: The precise details of exactly what [NS]AttributedString is doing to the original string value are subject to change in any OS release. Do not assume anything other than that the result string has the same contents.)

Prior to iOS 16, the Swift Stdlib did not keep track of the encoding associated with the offset in String.Index. So when code applied an (invalid) UTF-8 index to a UTF-16 string, String could not precisely diagnose this violation. UTF-8 offsets tend to be higher than the equivalent UTF-16 offsets, so as you are seeing, such errors tend to lead to out of bounds accesses, nonsensical results, or (in some cases) even full-blown undefined behavior.

In Swift 5.7 (that shipped in iOS 16), the Stdlib started marking each String.Index value with its expected encoding, to better diagnose such problems. Unfortunately, this uncovered a large volume of existing code that includes these issues. So instead of trapping, String is now bending over backwards to allow these to continue working, this time without nonsensical behavior: in iOS 16+, it transparently converts the offset to the right encoding when necessary. This can be extraordinarily expensive, as this requires transcoding the string every single time such an index is used.

Note: this change in Swift 5.7 did not make such indices valid; the use of such indices continue to be programmer errors. (We just haven't figured out how to properly report the issue.)

11 Likes

The fact that passing a Swift string to NSAttributedString isn't preserving its identity is sadly not clear when reading code that uses this class. This is an unfortunate consequence of Swift's implicit bridging and the way NSAttributedString works.

(NSAttributedString makes a copy of the string passed to it, so that it avoids it getting mutated behind its back. Copying a Swift String into a Cocoa NSString instance involves transcoding it from UTF-8 to UTF-16, as that is the native NSString encoding. The Swift-native AttributedString does a similar thing, but in reverse -- because it relies on quick access to UTF-8 code units. Luckily, this newer type does not provide a string property, so it avoids similar issues.)

2 Likes

The following remembers original plain text string for as long as possible:

class Label: UILabel {
    ...
    private var plainText: String? // cached value

    var stylableText: String? { // can return nil !
        get { plainText }
        set {
            plainText = newValue
            super.attributedText = ... // based on new plainText
        }
    }
    var attributesText: NSAttributedString {
        didSet {
            plainText = nil
        }
    }
}

Typing in a browser so there are typos.

The high-level idea is to NOT convert attributedText back to string (for the purposes of whatever OP's code is trying to do when applying the range created from the original plain text string to the new string).

Interesting. I guess it won't be able to catch the other source of indices disparity: individual characters could be stored in either decomposed or precomposed form (and in principle this could vary on a per character basis now, although we could probably prohibit such crazy scenarios going forward).

Oops. What was the behaviour of the code that used wrong indices before this change? I guess it was getting wrong results and/or crashing. Instead of "bending over backwards to allow these to continue working" (better than it worked originally!) maybe continue following the old behaviour of getting wrong results and/or crashing, along with reporting the incident to the console and also trapping when some diagnostic variable is set?

3 Likes

That is good advice.

That is correct. Strings that consist of the same Characters, but encode them with different scalars do not have interchangeable indices. Attempts at such misuse will not be detected by the encoding indicator bits in 5.7. (String will not currently help diagnose this problem, beyond catching attempts at trying to access data outside the bounds of the string.)

The point of fixing bugs in the Standard Library is to, uh, actually fix those bugs. :wink:

The precise issue that kept poor @stuchlej in the office used to be quite a bit more common before 5.7 got released, to the point that making it work anyway became the least bad option. I expect the frequency of this ruining someone's day on an earlier system will continue to diminish as time goes on, until it fades away completely. Bug fixes sometimes need time to roll out.

Reporting a runtime warning (i.e. not just printing to the console) is already a FIXME that I'd like to get around to resolving. PRs are welcome!

1 Like

And it eventually would (even without the acrobatics you mentioned above!)... just not immediately. For starters runtime would crash less right away (when it sees the wrong index being used it could early return something without crashing straight away (along with screaming to the console and trapping when diagnostic enabled).

As Xcode user I'd prefer those traps to be conditional when upgrading Xcode version. It can put something like this to the console during trap:

The new runtime version now checks <...> it didn't check previously. This new diagnostic option is turned on by default, however you can turn it off (by doing <...>) if it's Friday night and you want to go home early.

Swift's philosophy is and has always been that it is safer to crash early rather than continue on in an invalid state; when the user violates a precondition, the standard library never knowingly doubles down on it—either it is recoverable or execution halts.

6 Likes

The mentioned "bending over backwards" workaround would "help" string index related bugs hiding undercover, unfixed :slightly_frowning_face:

I suggest we have this new diagnostic option, enabled by default in some new Xcode version:

2 Likes

Thank you all for an enlightening discussion. <3

@tera Since I haven't found any solution (in the standard library) for my issue, I have decided to work do a around by creating a type called CharacterRange which is "sort of" similar to NSRange:

CharacterRange
public struct CharacterRange {

    var buffer: Range<Int>

    var nsRange: NSRange {
        NSRange(location: buffer.startIndex, length: buffer.count)
    }

    init(_ buffer: Range<Int>) {
        self.buffer = buffer
    }

    init(_ range: NSRange) {
        self.buffer = range.lowerBound..<range.upperBound
    }

    init(range: Range<String.Index>, in string: String) {
        guard
            string.indices.contains(range.lowerBound) || string.endIndex == range.lowerBound,
            string.indices.contains(range.upperBound) || string.endIndex == range.upperBound
        else {
            assertionFailure()
            self.buffer = 0..<0
            return
        }

        let lowerBound = string[..<range.lowerBound].count
        let upperBound = lowerBound + string[range].count
        self.buffer = lowerBound..<upperBound
    }

    func indices(in string: String) -> Range<String.Index>? {
        guard
            let start = string.index(string.startIndex, offsetBy: buffer.lowerBound, limitedBy: string.endIndex),
            let end = string.index(string.startIndex, offsetBy: buffer.upperBound, limitedBy: string.endIndex),
            start <= end
        else {
            return nil
        }

        return start..<end
    }
}

Since it is invariant, that indices provided to the stylable text are valid, I have just added some assets here and there

private func reformatAndStore(string: String?) {
    super.attributedText = string.flatMap { text in
        var attrStr = NSMutableAttributedString(string: text, attributes: attributes)
        let stringRange = NSRange.init(location: 0, length: attrStr.length)
        additionalAttributes.forEach { attrRange in
            // There was an index-out-of-bounds issue in the past
            guard NSEqualRanges(NSIntersectionRange(stringRange, attrRange.range.nsRange), attrRange.range.nsRange) else {
                assertionFailure()
                return
            }
            attrStr.addAttributes(attrRange.attributes, range: attrRange.range.nsRange)
        }
        return attrStr
    }
}

This would be in my opinion great addition.