My self-crafted string replacer causes unwanted result

I am trying to get rid of the current non-pure-Swift Foundation.
Therefore, I tried crafting my own tool for replacing string occurrences.

extension StringProtocol {
  subscript(offset: Int) -> Character { self[index(startIndex, offsetBy: offset)] }
  subscript(range: Range<Int>) -> SubSequence {
    let startIndex = index(self.startIndex, offsetBy: range.lowerBound)
    return self[startIndex..<index(startIndex, offsetBy: range.count)]
  }  
  internal func swapping(_ target: String, with newString: String) -> String {
    var result = ""
    var buffer = ""
    var sleepCount = 0
    for index in 0 ..< count {
      let currentChar = self[index]
    ripCheck: if currentChar == target.first {
      let range = index ..< (Swift.min(index + target.count, count))
      let ripped = self[range]
      sleepCount = range.count
      if ripped != target {
        break ripCheck
      } else {
        result.append(buffer)
        result.append(newString)
        buffer.removeAll()
      }
    }
      if sleepCount < 1 {
        buffer.append(currentChar)
      }
      sleepCount -= 1
    }
    result.append(buffer)
    buffer.removeAll()
    return result
  }
}

However, in the following special case, the replacement still happens even if mismatched:

let yu = "ㄉㄧㄠ 1"
print(yu.swapping("ㄉㄧㄝ", with: "die"))

I dunno why.

P.S.: replacing() doesn't work with macOS 10.9.

Try replacingOccurrences(of:, with:).

That needs Foundation if targeting OS earlier than macOS 13. I need a non-Foundation approach on macOS 10.9 Mavericks.

1 Like

Your code is accidentally quadratic, because getting a character from offset is linear


I didn't feel like debugging your code, but here's my version

it should have performance problems if there's a lot of partial or full matches with target in your string

I didn't test it extensively, it might be buggy

extension String {
    internal func swapping(_ target: String, with newString: String) -> String {
        var result = self
        var potentialTargetIndex = result.startIndex
        while potentialTargetIndex != result.endIndex {
            // look for `target` starting at `potentialTargetIndex`
            if result[potentialTargetIndex...].starts(with: target) {
                // the end of the target word that we just found
                let targetEnd = result.index(potentialTargetIndex, offsetBy: target.count)
                // replace the target with the new word
                result.replaceSubrange(potentialTargetIndex..<targetEnd, with: newString)
                // skip to the end of the word we just inserted to make sure we won't start looking for `target` inside `newString`
                result.formIndex(&potentialTargetIndex, offsetBy: newString.count)
            } else {
                // didn't find the target starting at `potentialTargetIndex`, let's check starting from the next character
                result.formIndex(after: &potentialTargetIndex)
            }
        }
        return result
    }
}

Why?

The main reason I’ve seen people avoid Foundation is to reduce code size when targeting Linux or wasm. But Foundation is dynamically linked on macOS, so that shouldn’t have nearly as big of an impact. Besides, if you’re using AppKit or SwiftUI, those require Foundation internally anyways.

I don’t want to preclude the possibility that you have a good reason I’m not aware of, though.

1 Like
  1. potentialTargetIndex is invalid after replaceSubrange and could potentially trap, such as if the string swapped its internal representation when you asked it to mutate. As a general rule, never do this with any collection.
  2. If the beginning of newString was able to combine with its predecessor, a character boundary will have been eliminated and all your offset math will be invalid, potentially reaching out of bounds and trapping on illegal memory access.

The rest of what you have is on the right track. You just need to accumulate into a separate String instance instead to avoid modifying the one you are scanning through. (The only optimization opportunity I see would be to get target.count once before the loop instead of repeating it with each iteration, which you alluded to yourself.)

I am not sure what was on the OP’s mind, but cluster‐by‐cluster replacement (Character, String) is fundamentally different from replacement by UTF‐16 code unit (NSString), and they produce vastly different results:

“café” (NFD) × (“e” → “x”):
Clusters: “café”
UTF‐16: “cafx́”

For this reason, I actually recommend genericizing such methods over Collection, so that you can call them on the unicodeScalars, utf16 or utf8 views in addition to the string itself depending on what each situation calls for.

2 Likes

When would you want the latter substitution? I guess, if I don't need such fanciness I should stay away from using an algorithm over a collection on any of unicodeScalars , utf16 or utf8 and only use an algorithm on the string itself?

I finally found the prime cause: I forgot the highlighted statement here:


I tried to modify a little. However, it has memory leaks.

  internal func swapping(_ target: String, with newString: String) -> String {
    let selfArray = map(\.description)
    let arrTarget = target.map(\.description)
    var result = ""
    var buffer = ""
    var sleepCount = 0
    for index in 0 ..< selfArray.count {
      let currentChar = selfArray[index]
      ripCheck: if currentChar == arrTarget.first {
        let range = index ..< (Swift.min(index + arrTarget.count, selfArray.count))
        let ripped = Array(selfArray[range])
        if ripped.isEmpty { continue }
        sleepCount = ripped.count
        if ripped != arrTarget {
          result.append(buffer)
          result.append(ripped.joined())
          buffer.removeAll()
          break ripCheck
        } else {
          result.append(buffer)
          result.append(newString)
          buffer.removeAll()
        }
      }
      if sleepCount < 1 {
        buffer.append(currentChar)
      }
      sleepCount -= 1
    }
    result.append(buffer)
    buffer.removeAll()
    return result
  }

The UTF‐16 one you pretty much never want, unless you are parsing a document format that was specified that way. (Its treatment of surrogates would be surprising to any user.)

But despite all the places you see grapheme clusters being described as “what a user perceives as characters”, that is false more often than it is true.

Grapheme clusters would better be described as the linear unit of a text’s 1‐dimensional “chain”, between which order is meaningful, but within which order is not meaningful in the same way or not at all. The user of the text does not perceive the “e” as coming before or after the acute accent, but at the same time. However, to represent it in a binary stream, we need to agree on one or the other order for the stream’s sake. The agreed order is a matter of encoding, not of the text, and the user does not care. Concerning the finger stream, some keyboards type the accent first (mechanically advantageous), others type it after (algorithmically advantageous), even when both keyboards target the same language. But the user still (usually) perceives them as distinct graphemes. To the French speaker, “le café” contains two “e”s; try to tell him there is only one and he will roll his eyes at technology’s ineptness. The Greek speaker who enters “◌̀” to find‐and‐replace it with “◌́” expects “ψυχὴ” to turn into “ψυχή”; if it doesn’t, he will be miffed at having to do roughly 80 separate find‐and‐replace operations to accommodate all the combinations the text engine foisted on him. When in doubt, do any text processing of this sort in scalars, and maintain NF(K)D throughout.

Now, there certainly is a time and place for operating in grapheme clusters. Present either of the above users with a crossword puzzle five squares wide and he will never think to solve the puzzle with “cafe◌́” or “ψυχη◌́”. Show him that “solution” and he will say your puzzle is defective. That is an example of a time to think in clusters. But only when you know the end use case can you choose the best string unit for an operation. Often the best option is to provide both and let the user choose, in the same vein as a case sensitivity option.

Swift’s unique decision to provide grapheme cluster processing out‐of‐the‐box was very wise and helpful. However, renaming them “characters” and elevating them to the default string unit probably went to far. The ensuing obsession the community acquired with switching all logic to them is making many aspects of text processing more complicated, less reliable, less intuitive to users and generally worse not better. Do yourself a favour and refrain from “updating” an implementation to use Character—especially when porting from another programming language—unless you know exactly what you are doing and why you are doing it.


@ShikiSuen, you are getting closer. Your choice to convert to an array effectively evades the pitfalls the rest of us have been talking about. There is still a lot of indirection that seems unnecessary to me, and it isn’t the fastest way to go about it, but nothing jumps out at me as logically unsound anymore.

Swift manages its own memory; you should not have to worry about that. If it really is leaking (which I doubt), it is Swift’s fault, not yours.

As you continue to improve your method, if you want a good test to verify your handling of offsets, try this:

assert("Ἄιδης".swapping("ι", with: "\u{345}") == "ᾌδης")

The replacement will be assimilated into the first character, so unintuitively 5 − 1 + 1 = 4. The code you originally posted would have reached out of bounds and trapped.

1 Like

Thanks for your idea on how to unit-test it.

I finally made one that really works, despite its performance problem.

// (c) 2022 and onwards The vChewing Project (MIT-NTL License).
// ====================
// This code is released under the MIT license (SPDX-License-Identifier: MIT)
// ... with NTL restriction stating that:
// No trademark license is granted to use the trade names, trademarks, service
// marks, or product names of Contributor, except as required to fulfill notice
// requirements defined in MIT License.

extension StringProtocol {
  func has(string target: any StringProtocol) -> Bool {
    let selfArray = Array(unicodeScalars)
    let targetArray = Array(target.description.unicodeScalars)
    guard !target.isEmpty else { return isEmpty }
    guard count >= target.count else { return false }
    for index in 0 ..< selfArray.count {
      let range = index ..< (Swift.min(index + targetArray.count, selfArray.count))
      let ripped = Array(selfArray[range])
      if ripped == targetArray { return true }
    }
    return false
  }

  func sliced(by separator: any StringProtocol = "") -> [String] {
    let selfArray = Array(unicodeScalars)
    let arrSeparator = Array(separator.description.unicodeScalars)
    var result: [String] = []
    var buffer: [Unicode.Scalar] = []
    var sleepCount = 0
    for index in 0 ..< selfArray.count {
      let currentChar = selfArray[index]
      let range = index ..< (Swift.min(index + arrSeparator.count, selfArray.count))
      let ripped = Array(selfArray[range])
      if ripped.isEmpty { continue }
      if ripped == arrSeparator {
        sleepCount = range.count
        result.append(buffer.map { String($0) }.joined())
        buffer.removeAll()
      }
      if sleepCount < 1 {
        buffer.append(currentChar)
      }
      sleepCount -= 1
    }
    result.append(buffer.map { String($0) }.joined())
    buffer.removeAll()
    return result
  }

  func swapping(_ target: String, with newString: String) -> String {
    let selfArray = Array(unicodeScalars)
    let arrTarget = Array(target.description.unicodeScalars)
    var result = ""
    var buffer: [Unicode.Scalar] = []
    var sleepCount = 0
    for index in 0 ..< selfArray.count {
      let currentChar = selfArray[index]
      let range = index ..< (Swift.min(index + arrTarget.count, selfArray.count))
      let ripped = Array(selfArray[range])
      if ripped.isEmpty { continue }
      if ripped == arrTarget {
        sleepCount = ripped.count
        result.append(buffer.map { String($0) }.joined())
        result.append(newString)
        buffer.removeAll()
      }
      if sleepCount < 1 {
        buffer.append(currentChar)
      }
      sleepCount -= 1
    }
    result.append(buffer.map { String($0) }.joined())
    buffer.removeAll()
    return result
  }
}

Unit tests:


let nmsl = "狗死了"
let sl = "死了"
print(nmsl.has(string: sl))
print("250".has(string: sl))

let serial = "P2MV9ㄉㄧㄠJYYQ6ㄉㄧㄠJM44KㄉㄧㄠQMYTHㄉㄧㄠ8RB2W"
print(serial.sliced(by: "ㄉㄧㄠ").filter { !$0.isEmpty })
print(serial.sliced(by: "ㄉㄧㄝ").filter { !$0.isEmpty })
print(serial.swapping("ㄉㄧㄠ", with: "-"))
print(serial.swapping("ㄉㄧㄝ", with: "123"))

let yu = "ㄉㄧㄠ "
print("[\(yu.swapping("ㄉㄧㄝ", with: "die"))]")
print("[\(yu.swapping(" ", with: "1"))]")

print("\nThe following is supposed to be true.")
print("Ἄιδης".swapping("ι", with: "\u{345}") == "ᾌδης")

let serial1 = "P2MV9+-@JYYQ6+-@JM44K+-@QMYTH+-@8RB2W"
print("===")
print(serial1.has(string: "-@"))
print(serial1.sliced(by: "-@").filter { !$0.isEmpty })
print(serial1.swapping("-@", with: ""))


let serial2 = "P2MV9--@JYYQ6--@JM44K--@QMYTH--@8RB2W"
print("===")
print(serial2.has(string: "-@"))
print(serial2.sliced(by: "-@").filter { !$0.isEmpty })
print(serial2.swapping("-@", with: ""))

P.S.: This should be the final.

Why indeed? Especially on a macOS version that's older than Swift? Is there a practical reason or is this sort of a brain teaser?

There are still some users of my input method who are using macOS 10.9. They have individual reaons to run certain deprecated industrial softwares on their mac (which have problems on later OS).

macOS 10.9 understood. But what's wrong with Foundation on macOS 10.9 ... current?
I would have understood if you are on some exotic platform, say 16-bit unix or need to fit the app into 32K RAM, etc.

I am also thinking of porting the input method to Linux.