AttributedString index fetching causes internal unwrap of nil value

seanmrich · August 9, 2023, 4:55pm

I have an attributed string that has a substring being replaced, but there's a fatalError raised:

var string = AttributedString("café")
let replaceIndex = string.index(beforeCharacter: string.endIndex)
let range = replaceIndex..<string.endIndex
string.replaceSubrange(range, with: AttributedString("e"))
let next = string.index(afterCharacter: replaceIndex)
//                ^---- Unexpectedly found nil while unwrapping an Optional value
assert(next == string.endIndex)

This was surprising since I thought the indexes would remain stable before the change. What's stranger is that changing how I create the range doesn't cause the failure. This works fine:

var string = AttributedString("café")
let range = range(of: "é")!
string.replaceSubrange(range, with: AttributedString("e"))
let next = string.index(afterCharacter: replaceIndex)
assert(next == string.endIndex)

Using ASCII characters instead of the acute 'e' doesn't cause the failure with either range technique. I pieced through the open source implementation to try and divine the source of the nil optional, but I couldn't see anything and figured it's different than the current release code.

Any advice on where my logic has gone wrong?

I'm using macOS 13.4.1 and Xcode 15b5.

tera · August 9, 2023, 5:57pm

Interesting find. Moreover:

    let range = replaceIndex..<string.endIndex
    let range2 = string.range(of: "é")!
    print(range)  // ... _rawBits: 197127))..< ... _rawBits: 327687))
    print(range2) // ... _rawBits: 196613))..< ... _rawBits: 327685))
    precondition(range == range2) // passes

Could be a bug, . Or some rule along the lines that "after you mutate a string the previous indices are invalidated and it's a mere luck it works in one case as you are triggering UB", not sure about that. That the two ranges look different internally (yet comparing equal) sounds a bit dodgy, especially given the fact that using them leads to a different behaviour afterwards.

BTW, the same applies to normal strings, just with them it is more obvious that you should not use the old indices with the new string (there's not "replace" defined on string, but "replacing" returning the new one).

seanmrich · August 9, 2023, 6:35pm

I did notice the ranges were equal, but not that their values were different. That was helpful! You're probably right that the indexes are no longer valid after mutation. It isn't intuitive that earlier indexes of the collection are affected, but perhaps this is for performance reasons. That little assumption is going to cause me a bunch of work .

Incidentally, there is a in-place replacement method: String.replaceSubrange(_:with:)

itaiferber · August 9, 2023, 6:56pm

To be clear, RangeReplaceableCollection mutation is allowed to invalidate existing indexes, as those indices may contain information that's no longer true for the mutated collections (in the case of strings, calculated byte offsets which are no longer valid, for example). From the RangeReplaceableCollection.replaceSubrange(_:with:) docs:

Calling this method may invalidate any existing indices for use with this collection.

And, this method serving as the backbone for pretty much all other operations on RangeReplaceableCollection, one should assume (unless documented otherwise for a specific type) that any mutating operation which could change anything about indexes will invalidate existing ones.

seanmrich · August 9, 2023, 7:21pm

Understood. My intuition is that it would work effectively like an Array where changes down the line don't affect indices earlier in the collection, but that's clearly wrong.

I have a need to be able to access the lowerBound of a replaced range, and it isn't obvious the best way to do that. I suppose I could maintain UTF-16 offsets and convert when needed:

var a = AttributedString("abcd")
let replaceRange = NSRange(location: 2, length: 1)
if let stringRange = Range(replaceRange, in: a) {
  a.replaceSubrange(stringRange, with: AttributedString("e"))
  let fetchRange = NSRange(location: 2, length: 0)
  if let fetchStringRange = Range(fetchRange, in: a) {
    let c = a[fetchStringRange]
  }
}

This would be a pain and slow to do the conversion each time. Is there a better way to do this?

itaiferber · August 9, 2023, 8:00pm

It at least partially depends on the text replacing part of the existing string: is it known to you statically, or is it user content? The reason I ask is that how you need to search for a lower bound may depend on the content of the replacement.

For example:

let replacement = "eteria"
var string = "caffeine"
string.replaceSubrange(string.firstRange(of: "feine")!, with: replacement)
print(string) // => cafeteria

// From Collection (working at the grapheme cluster level)
print(string.firstRange(of: replacement) != nil) // => true

It seems pretty straightforward that you'd find the replacement inside of the updated string — but this is not always the case: if replacement starts with a combining character, the replacement string may compose with the end of string and the individual characters (as defined by Swift: grapheme clusters) may change:

// continued from above
let replacement2 = "\u{0301}" // combining accute accent
string.replaceSubrange(string.firstRange(of: "teria")!, with: replacement2)
print(string) // => café

// From Collection (working at the grapheme cluster level)
print(string.firstRange(of: replacement2) != nil) // => false
// "café" does not contain a bare "́"

Finding the start of your replacement string inside of the original may be less straightforward than anticipated: if the replacement string contains data "lower level" than grapheme clusters, you may similarly need to drop down to that lower level (e.g., Unicode scalar/UTF-8/UTF-16 offsets — as you suggest, though there are slightly nicer ways of expressing similar operations).

If the content isn't arbitrary, and you know this won't happen (or can guard against it), the right way to do this would be to do something similar, but with the Collection APIs; i.e., while the indexes themselves might change, you can use index distances to re-fetch a specific index with, e.g., string.index(string.startIndex, offsetBy: ...) without needing to perform any conversions at a lower level.

seanmrich · August 9, 2023, 8:40pm

Appreciate the feedback. The strings are user-generated so I can't do the work with static information. Also, since there's no way to know whether the target substring is the first match, I don't think a find operation would work. Will have to rethink my approach, I suppose. Thanks again!

wadetregaskis · August 9, 2023, 10:26pm

itaiferber:

let replacement = "eteria"
var string = "caffeine"
string.replaceSubrange(string.firstRange(of: "feine")!, with: replacement)
print(string) // => cafeteria

// From Collection (working at the grapheme cluster level)
print(string.firstRange(of: replacement) != nil) // => true

In addition to being pretty expensive - O(String.count) - this might yield incorrect results if the replacement string already occurs in the original string.

Karl · August 9, 2023, 11:10pm

This is a known design flaw in RangeReplaceableCollection.

However, by extraordinary coincidence, RRC also includes a plain init(). That means you can implement a generic "replace-all" operation (or similar) by constructing an empty result and appending to it.

In other words, instead of replacing the "é", copy all characters up to it in to a new string, append a plain "e", then append characters after it. This technique works with any RRC-conforming type.

What I've done in my own data types is to implement replaceSubrange with a different signature:

@discardableResult
public mutating func replaceSubrange(
  _ subrange: Range<Index>,
  with newContents: some Sequence<Element>
) -> Range<Index>

Where the idea is that the returned range contains the locations of newContents, so you can continue processing the collection.

It is not possible to make this change to RRC in an ABI compatible way, so we would need to deprecate and replace it with a new protocol. Unfortunately that won't help you today, but I think there could be support for making that kind of change to the standard library one day.

seanmrich · August 10, 2023, 12:30am

This is an intriguing approach. Here's a naive implementation for discussion:

extension AttributedString.CharacterView {
	public mutating func replaceRange(
		_ subrange: Range<Index>,
		with newContent: some Sequence<Element>
	) -> Range<Index> {
		var new = Self()
		new.reserveCapacity(underestimatedCount + newContent.underestimatedCount)
		new.append(contentsOf: self[..<subrange.lowerBound])
		let lower = new.endIndex
		new.append(contentsOf: newContent)
		let upper = new.endIndex
		new.append(contentsOf: self[subrange.upperBound...])
		return lower..<upper
	}
}

The issue is that the indices lower and upper are invalidated as new is mutated. How can you capture the replaced range?

Also, a quick check shows this to be about three orders of magnitude slower than the standard library's replaceSubrange(_:with:) on a test string of about 3,000 characters. Is there a more efficient method than my naive one?

seanmrich · August 10, 2023, 11:10am

Since there's an API for switching from NSRange to Range<AttributedString.Index>, I presume that maintaining UTF-16 offsets and switching to a Range when needed is supported. The optional return value of the NSRange -> Range conversion is a hassle but workable.

Performance of this approach is significantly better than rebuilding the string to extract the indexes. In my tests, the delay is on the order of hundredths of milliseconds for a 3,000 character string.

itaiferber · August 10, 2023, 2:11pm

Sorry, I should have clarified in the example code — I was only using firstRange(of:) as a shorthand for fetching the indexes of the replacement text (e.g., just like the original code has hardcoded locations); this wasn't intended semantically.

Yes, that's effectively what you're doing, though what I was going to originally suggest was the possibility of doing this with native String operations, rather than the NSRange ↔︎ Range round-trip.

If you want to stick to UTF-16 offsets (more efficient for Strings which are NSStrings under the hood), you can use String.Index.utf16Offset(in:) and String.Index.init(utf16Offset:in:) to fetch the UTF-16 offset of the lower bound of your replacement range, perform the replacement, and then use the same offset to recreate the index
If your underlying string is UTF-8, using UTF-16 offsets is likely less efficient than getting the UTF-8 offset of a string index using String.Index.samePosition(in:) for the string's .utf8View, storing the distance from .utf8View.startIndex, performing the replacement, then converting back by offsetting the distance from .utf8View.startIndex and re-forming a String.Index from String.UTF8View.samePosition(in:). Unfortunately, this doesn't help eliminate the optionals, but it might be more performant
- Yes, if you check, you'll see that String.UTF8View.Index = String.Index, but AFAIK, the actual formed index bits under the hood are different based on the view you created them from. What I'm not sure about is whether the conversion to/from UTF-8 view indices is necessary, or whether the default String.Index representation already matches UTF-8 these days. (@Michael_Ilseman likely knows best here)

seanmrich · August 10, 2023, 3:21pm

Makes sense that the UTF-8 offsets would be more efficient for a swift String, but I’m using AttributedString which doesn’t have much API surface to manipulate the underlying string. I can’t find any documentation that discusses the format of the CharacterView. Maybe it depends on the string it was initialized with?

I did test range conversions on an AttributedString created from an NSString and a string literal. The NSString was almost 50% slower. In any case, the delay for converting is small enough to not be an issue in my app.

Greatly appreciate your guidance!

itaiferber · August 10, 2023, 3:30pm

Ah, right, I forgot that AttributedString doesn't offer a way to access the underlying string like NSAttributedString does — you can convert it to a String, but only by copying.

Your best bet is likely to stick with the NSRange ↔︎ Range conversions, for better or worse. Sorry for any confusion!