How to find range in new string after replacingCharacters(in: range, with: c)

RonAvitzur · June 22, 2022, 4:45am

Given a String and a Range<String.index> valid for it, and a new string made by

let newString = originalString.replacingCharacters(in: range, with: c)

How do I find the range of that particular c in newString?

Edit: Does the problem become well-defined to ask instead for an index in newString after the replaced characters?

SDGGiesbrecht · June 22, 2022, 6:44am

The short answer is you do not want to. Instead you want to refactor your code to avoid it.

The long answer is strings are complicated and the constraints of the question are not specific enough to know what the expected result of the operation should even be. See the code below. You can paste it into an XCTest case verbatim if you need proof I am not crazy.

func demonstrate(
  originalString: String,
  range: Range<String.Index>,
  c: String
) -> (
  afterReplacement: String,
  scalarRange: Range<String.UnicodeScalarView.Index>,
  characterRange: Range<String.Index>?
) {
  // Given the requisite replacement...
  let newString = originalString.replacingCharacters(in: range, with: c)
  // ...then...

  // ...we must first check that it actually did what we thought...
  guard let rangeLowerBoundScalar
      = range.lowerBound.samePosition(in: originalString.unicodeScalars),
    range.upperBound.samePosition(in: originalString.unicodeScalars) != nil else {
      fatalError("The replacement was malformed and corrupted the string.")
  }

  // ...and then we can figure out one or both
  // of two definitions of the insertion’s range:
  let scalarRange: Range<String.UnicodeScalarView.Index> = {
    let distanceToRangeStartInOriginalString
      = originalString.unicodeScalars.distance(
        from: originalString.unicodeScalars.startIndex,
        to: rangeLowerBoundScalar
      )
    let rangeStartInNewString = newString.unicodeScalars.index(
      newString.unicodeScalars.startIndex,
      offsetBy: distanceToRangeStartInOriginalString
    )
    let replacementEndInNewString = newString.unicodeScalars.index(
      rangeStartInNewString,
      offsetBy: c.unicodeScalars.count
    )
    return rangeStartInNewString..<replacementEndInNewString
  }()
  let characterRange: Range<String.Index>? = {
    guard let lower = scalarRange.lowerBound.samePosition(in: newString),
      let upper
        = scalarRange.upperBound.samePosition(in: newString) else {
          return nil
    }
    return lower..<upper
  }()

  // The relevant bit is finished,
  // but if we return everything, we can test it easily.
  return (
    afterReplacement: newString,
    scalarRange: scalarRange,
    characterRange:characterRange
  )
}

// For ease of testing, let’s be able to look up what we are pointing at...
func lookup(
  _ result: (
    afterReplacement: String,
    scalarRange: Range<String.UnicodeScalarView.Index>,
    characterRange: Range<String.Index>?
  )
) -> (scalars: String, characters: String?) {
  return (
    String(result.afterReplacement[result.scalarRange]),
    result.characterRange.map({ String(result.afterReplacement[$0]) })
  )
}

// And now we can show how it performs, proving why all that was necessary.

let helloWorld = "Hello, world!"
XCTAssert(
  lookup(
    demonstrate(
      originalString: helloWorld,
      range: helloWorld.dropFirst(7).startIndex
        ..< helloWorld.dropLast(1).endIndex,
      c: "universe"
      // “Hello, universe!”
    )
  )
  ==
  (scalars: "universe", characters: "universe")
)

let café = "cafe"
XCTAssert(
  lookup(
    demonstrate(
      originalString: café,
      range: café.endIndex..<café.endIndex,
      c: "\u{301}" // ◌́
      // “café”
    )
  )
  ==
  (scalars: "\u{301}", characters: nil)
)

let jalapeño = "jalapeño"
// Uncomment this one to trigger the fatal error:
/*
_ = demonstrate(
  originalString: jalapeño,
  range: jalapeño.utf8.dropLast(2).endIndex..<jalapeño.dropLast().endIndex,
  c: "n"
  // “jalapen�no”
)
*/

RonAvitzur · June 22, 2022, 2:33pm

Thank you! That is very helpful.

What is originalString.scalars? (It does not compile here on Xcode 13.)

Does the problem become well-defined to ask instead for an index in newString after the replaced characters?

tera · June 22, 2022, 3:47pm

What do you do with the found range? Is this for setting a style in, say, a resulting attributed sting?

RonAvitzur · June 22, 2022, 4:00pm

Actually, I only need an index after the replaced string. I'm representing a selection within a String as a Range<String.Index>. After typing replaces the range with a new Character, I represent the new insertion point as a range with lowerBound == upperBound, after the new character.

This is used inside of a structured 2D mathematical expression editor when editing short variable names - using a system text editor here is not feasible.

Michael_Ilseman · June 22, 2022, 5:31pm

We have a strong need for a replaceSubrange that returns the index range of the newly replaced portion (potentially @discardableResult). That would allow you to have this behavior performantly as well as allow for multiple in-place mutations (e.g. a replaceAll) without creating a brand new string / RRC to do so.

CC @Alejandro for thoughts.

tera · June 22, 2022, 6:02pm

is this not alright?

let resultString = originalString.replacingCharacters(in: range, with: c)
let upperIndex = resultString.index(range.lowerBound, offsetBy: c.count)

RonAvitzur · June 22, 2022, 6:06pm

I think range.lowerBound is valid only in originalString. (It worked by accident, since ASCII strings are most commonly used in my app, until a user who used a lot of Unicode Plane-1 Math Symbols showed how it led to a crash.)

tera · June 22, 2022, 6:09pm

I see.

Is this any better?

var resultString = originalString
resultString.replaceSubrange(range, with: c)
let upperIndex = resultString.index(range.lowerBound, offsetBy: c.count)

Nevin · June 22, 2022, 8:44pm

No.

Indices from one string are not valid in another string, and replacing a substring does not guarantee the length of the string changes as expected.

var s = "abc"
let combiningAcuteAccent = "\u{301}"
let i = s.index(after: s.startIndex)
let r = i ..< s.index(after: i)
s.replaceSubrange(r, with: combiningAcuteAccent)
print(s, s.count)    // ác 2

We replaced one character ("b") with one character (the combining acute accent "\u{301}"), and the original string shrank from 3 characters ("abc") to 2 characters ("ác"). Even if the indices could be used, the calculation would produce "c".

In this example, there is in fact no subrange of the resulting string which corresponds to the inserted characters.

xwu · June 22, 2022, 9:31pm

Note that there is a distinction between the behavior using standard library APIs, which operate consistently at the level of characters (i.e., extended grapheme clusters), versus Foundation APIs such as replacingCharacters(in:with:) and range(of:).

For instance, even though \r\n is a single character, Foundation will happily tell you the range of \n (or, correspondingly in the case of your example, combiningAcuteAccent).

SDGGiesbrecht · June 22, 2022, 9:33pm

Sorry. It is supposed to be unicodeScalars. (It is shortened to scalars by an alias in one of the packages I use a lot and there must have been a stray import while I was writing it. Thanks Xcode¡) I went back and fixed the example; copying and pasting works now.

Do you mean by searching newString for c? Not unless you are certain there are no other instances of c in the string. Determining the narrowed range in which to search for it suffers from the very problem stated in the question.

Correct. (And if it is declared var, the range becomes invalid for originalString as well as soon as it mutates.)

What might help is store the data as [String], possibly wrapped in something with a conformance to Collection using Character elements. Then you have stable offsets to reason about, since nothing ever jumps from one segment to another, and you can simply use joined() when you need to display it.

This strategy is helpful for some things, but just extra work for others. You will have to judge for yourself which is the case for what you are doing.

lorentey · June 24, 2022, 4:34pm

Here is a neat trick:

extension String {
  // Like the standard `replaceSubrange`, but updates its subrange
  // to remain valid in the new string.
  mutating func replaceSubrange<C: Collection>(
    _ range: inout Range<Index>, 
    with replacement: C
  ) where C.Element == Element {
    var temp = self[range]
    self = "" // Prevent unnecessary copy-on-write copies
    temp.replaceSubrange(range, with: replacement)
    self = temp.base
    range = temp.startIndex ..< temp.endIndex
  }
}

var s = "Hello, world"
let i = s.index(s.startIndex, offsetBy: 7) // w
var r = i ..< i
s.replaceSubrange(&i, with: "nice ")
print(s) // "Hello, nice world"
print(s[r]) // "nice "

The key here is that

Substring conforms to RangeReplaceableCollection,
it generally (but, frustratingly, not always) implements its mutations by ~forwarding them to its base string rather than replacing it, and
it needs to make sure its start and end indices remain valid before/after these mutations.

(Note: if the subrange replacement affects characters in the string surrounding the replaced range (e.g., because the replacement collection starts with a combining scalar), then the updated bounds may no longer be valid (i.e., reachable) indices in the string, as they will no longer fall on character boundaries. This is an unavoidable consequence of Unicode's grapheme breaking rules; it isn't specific to this particular solution.)

Beware: prior to Swift 5.7, Substring.replaceSubrange failed to properly update its bounds in some (rare) edge cases.

lorentey · June 24, 2022, 4:47pm

it generally (but, frustratingly, not always) implements its mutations by ~forwarding them to its base string rather than replacing it

To be specific, Substring.replaceSubrange works correctly, but Substring.append doesn't.

var s = "Hello, world"
let i = s.index(s.startIndex, offsetBy: 7) // w
var t = s[i..<i]
print(t.base) // "Hello, world"
t.append(contentsOf: "nice ")
print(t.base) // "nice " 😭

lorentey · June 24, 2022, 8:12pm

It needs to be noted that calling replaceSubrange in a loop is often not a good idea -- each invocation will take O(utf8.count) time, so replacing a series of ranges this way tends to lead to accidentally quadratic behavior.

(This is one of the reasons we don't have a direct function like this in the stdlib (yet?).)

Unless you know for sure this isn't going to be the case, it's usually much faster to instead build a brand new string piecewise, by starting with an empty string and successively appending to it slices of either the original or one of the replacements. (This will usually take time that is proportional to the size of the overall result.) You can usually even reserve capacity in advance, preventing the result string from needing to resize its storage -- this tends to result in a measurable speedup (by a constant factor).

xwu · June 24, 2022, 9:52pm

Is this a guarantee for Substring.replaceSubrange or is this just a current implementation detail?

lorentey · June 27, 2022, 6:10pm

The semantics of base are woefully ill-defined. Predictably, this has led to diverging behavior across the stdlib.

The correct thing to do, in my view, is for a mutable SubSequence type to always preserve its entire base collection on mutations -- dropping the sliced-off parts is highly questionable behavior. (If I didn't care about preserving the rest of the collection, I would've simply copied the data into a new collection value instead of mutating the slice.)

Substring.append(contentsOf:) got this badly wrong: in exchange for a performance boost, we got unusable semantics. I really want to fix this, but Hyrum's Law is in full effect for String APIs, and that makes this a high-risk change.

Sadly, as far as I am aware, the behavior as implemented does not violate any documented requirement -- base properties aren't part of any protocol, and there are no documented expectations about them.

tera · June 28, 2022, 11:41am

Can dilemmas like this be resolved by forcing developers to explicitly choose (to get past some introduced compilation error) between the new correct "but now you have to test your code" behaviour and old incorrect "don't have time for this now, remind me via a warning" behaviour, when they switch to a new compiler/sdk?