Counting occurrences of a substring in a string

Given the fact that Swift String class doesn't has any method for counting occurrences of a substring in a string. i.e
Counting occurrences of "Swift" in "Hello Swift Swift" => 2

There are many alternatives available for counting occurrences of substring but I think it would be more intuitive if we add support of counting occurrences of a substring in a string. i.e

func occurences(of substring: String, compareOption: CompareOptions? = nil) -> Int {
	....
}

I feel like it could be part of a more general regex api, though NSRegularExpression already exist.

Yes true, but it requires few lines of code and not very succinct version. Perhaps adding a specific method for only counting of occurrences using NSRegularExpression API makes more sense.

I’m not sure if it needs to be in the standard library, and I’m also not sure why the pitch is limited to String. Wouldn’t the same idea work for any Sequence?

Also, what do you want to happen for overlaps?

"nananana".countOccurrences("nana")
// 2 or 3?

And what about returning an array of indices where each match begins?

Regarding overlaps it should works akin to the following method of standard library and its answer will be 2.

func components(separatedBy separator: String) -> [String]

I think it will be a great idea to generalise this pitch to Sequence.

I think that a generic version of this, that counts the occurrences of a subsequence (a collection with the same element type) in a collection, would be helpful. Something along the lines of the following I think would fit the bill:

extension Collection where Element: Equatable {
    
    /// Returns the number of occurences of a subsequence within the collection.
    ///
    /// Any collection with the same element type as `self` may be passed to the
    /// `sequence` parameter. In the following example, both a `String` and a
    /// `Substring`'s occurrences can be counted in a string because they both have
    /// the same element type as `String`,  `Character`.
    ///
    ///     let someString = "abcabcbc"
    ///     let matchString = "abc"  // "abc"
    ///     let matchSubstring = someString.suffix(2)  // "bc"
    ///
    ///     print(someString.occurrences(of: matchString))
    ///     // Prints "2"
    ///
    ///     print(someString.occurrences(of: matchSubstring))
    ///     // Prints "3"
    ///
    /// - Parameter subsequence: The collection whose occurrences of its sequence
    /// of elements, in order, are being counted.
    ///
    ///
    /// - Returns: The number of occurrences of `subsequence` in the collection.
    ///
    /// - Complexity: O(*n*), where *n* is the length of the collection.
    func occurrences<OtherCollection>(of subsequence: OtherCollection) -> Int where OtherCollection: Collection, OtherCollection.Element == Element {
        guard !isEmpty, !subsequence.isEmpty else {
            return 0
        }
        var runningCount = 0
        var currentIndex = startIndex
        var currentMatchIndex = subsequence.startIndex
        var isMatching = false
        while currentIndex < endIndex {
            let element = self[currentIndex]
            let otherElement = subsequence[currentMatchIndex]
            if element == otherElement {
                isMatching = true
                currentMatchIndex = subsequence.index(after: currentMatchIndex)
            }
            currentIndex = self.index(after: currentIndex)
            if isMatching, currentMatchIndex == subsequence.endIndex {
                runningCount += 1
                isMatching = false
                currentMatchIndex = subsequence.startIndex
            }
        }
        return runningCount
    }
}

Of course, the question is whether we want subsequence or substring, or both.

I feel like both would be the way to go, like in the generic method that I showed, because why limit yourself when you could use any collection with the same element type as the initial collection.

Also, I think that a generic method should also probably include an overlaps parameter to allow for overlapped matches to be counted. Here is an altered version of my generic implementation with an overlaps argument.

extension Collection where Element: Equatable {
    
    /// Returns the number of occurences of a subsequence within the collection.
    ///
    /// Any collection with the same element type as `self` may be passed to the
    /// `sequence` parameter. In the following example, both a `String` and a
    /// `Substring`'s occurrences can be counted in a string because they both have
    /// the same element type as `String`,  `Character`.
    ///
    ///     let someString = "abcabcbc"
    ///     let matchString = "abc"  // "abc"
    ///     let matchSubstring = someString.suffix(2)  // "bc"
    ///
    ///     print(someString.occurrences(of: matchString))
    ///     // Prints "2"
    ///
    ///     print(someString.occurrences(of: matchSubstring))
    ///     // Prints "3"
    ///
    /// In some cases, a subsequence may have overlapping occurrences in the collection. The
    /// following example shows the effect of the `overlaps` parameter when counting occurrences
    /// of a subsequence in a collection that contains overlapping occurrences of said subsequence.
    ///
    ///     let source = "bbbabbabbb"
    ///     let match = "bb"
    ///
    ///     // Overlapping matches
    ///     print(source.occurrences(of: match, overlaps: true))
    ///     // Prints "5"
    ///
    ///     // Non-overlapping matches
    ///     print(source.occurrences(of: match))
    ///     // Prints "3"
    ///
    /// - Parameter subsequence: The collection whose occurrences of its sequence
    /// of elements, in order, are being counted.
    /// - Parameter overlaps: A Boolean value indicating whether overlapping occurrences
    /// of `subsequence` will be counted as separate occurrences. Default value is `false`.
    ///
    /// - Returns: The number of occurrences of `subsequence` in the collection.
    ///
    /// - Complexity:
    ///    * If `overlaps` is `false`: O(*n*), where *n* is the length of the
    ///      collection.
    ///    * If `overlaps` is `true`:  O(*nm*) where *n* is the length of the collection and *m*
    ///      is the length of the subsequence.
    func occurrences<OtherCollection>(of subsequence: OtherCollection, overlaps: Bool = false) -> Int where OtherCollection: Collection, OtherCollection.Element == Element {
        guard !isEmpty, !subsequence.isEmpty, subsequence.count <= count else {
            return 0
        }
        var runningCount = 0
        var currentIndex = startIndex
        var currentOffset = 0
        var currentMatchIndices = [subsequence.startIndex]
        repeat {
            let element = self[currentIndex]
            for (i, matchIndex) in currentMatchIndices.enumerated().reversed() {
                let otherElement = subsequence[matchIndex]
                if element == otherElement {
                    let nextMatchIndex = subsequence.index(after: matchIndex)
                    if nextMatchIndex == subsequence.endIndex {
                        runningCount += 1
                        currentMatchIndices.remove(at: i)
                    } else {
                        currentMatchIndices[i] = nextMatchIndex
                    }
                } else {
                    currentMatchIndices.remove(at: i)
                }
            } while currentIndex < endIndex 
            currentIndex = self.index(after: currentIndex)
            currentOffset += 1
            if overlaps, count - currentOffset >= subsequence.count {
                currentMatchIndices.append(subsequence.startIndex)
            }
        }
        return runningCount
    }
}

Just a few things:

  • The outer loop could use for-in for more Swifty code.

  • The second algorithm should be O(nm), not O(n).

  • Your first algorithm is maximally counting non-overlapping subsequences. The second one is counting substrings.

    You’re using the same readme for both of them, so I’m not sure if you intend to have different algorithms, or not.

    • If you meant to have two algorithms, maybe we should name them differently to avoid confusion?

I don't understand the difference you are getting at between subsequences and substrings. Substring is String.SubSequence. Substrings are subsequences of strings. Also how is the overlapping occurrences algorithm O(nm)?

Also, bikeshedding some alternative names:

  • We could use elements instead of subsequence for the first parameter
  • We could use omittingOverlappingOccurrences instead of the overlaps parameter
  • We could call the method count(of:omittingOverlappingOccurrences:) instead of occurrences

Actually I understand why the latter algorithm is O(nm), I'll update the function.

See both wikis. Subsequence can be non-consecutive. Substring (not String.Substring) needs to be consecutive.

Note: Some use subarray as a more generic substring. Maybe we can switch to that.

No, that’s incorrect:

https://developer.apple.com/documentation/swift/collection/1641276-subsequence

https://developer.apple.com/documentation/swift/string/subsequence

That's not the subsequence Lantua is talking about. Check out the posted wiki links

1 Like

Yes, thank you.

Terms of Service

Privacy Policy

Cookie Policy