Count(of:) and contains(other:) for Collection

anthonylatsis · March 21, 2018, 10:23pm

Introduction

This is inspired by the proposal of adding count methods to Sequence. Apart from count(of: Element) and count(where:), there is another count method and, in addition, a contains method that I consider a sensible addition to Collection.

Methods

count<T: Collection>(of other: T, overlapping: Bool) -> Int where T.Element == Element

Returns the number of occurrences of the given subcollection in the collection. When overlapping is true, overlapping occurrences are also counted.

contains<T: Collection>(_ other: T) -> Bool where T.Element == Element

Returns a boolean value indicating whether the collection contains the given subcollection.

Motivation

Strings

There is a rather tricky and non-trivial way (for a regular user) to count the number of occurrences of a substring in a string:

str.components(separatedBy: "substring") - 1

The method used in this approach not only is unintuitive for the task in question, but also unnecessarily inefficient – a shortcoming that is partially solved with .lazy without gaining any advantage in semantics, ergonimics an readability.

StringProtocol has a contains<T>(_ other: T) method as part of the NSString compatibility API, which can become a special case of the proposed method, thus avoiding breaking changes and redundancy.

Arrays and Data

There is neither a way nor a relatively short workaround for counting the number of occurrences of a given subcollection in a collection or verify whether an ordered collection contains a given subcollection.
As @xwu pointed out, It is sensible to wonder and have a efficient way to find out if or how many times a sequence occurs in an instance of Data, Array and other ordered collections.

Issues

Counting how many times a subsequence appears in a sequence can be done either considering or ignoring overlapping occurrences.

[1, 1, 1].count(of: [1, 1]) == 1 // non-overlapping
[1, 1, 1].count(of: [1, 1]) == 2 // overlapping

A straight-forward and natural solution is an additional parameter, overlapping: Bool.

Splitting the method and hence extending count for semantic integrity, in my opinion, is an intuitively harmful approach that also isn't consistent with the existing count name family.

Implementation

`StringProtocol`

It is reasonable to assume the current StringProtocol.contains(subsequence:) method is well optimized for strings, therefore it should be left as is or replaced by a Hashable optimization of the proposed method if appropriate.

`Collection` default implementation

(*) *If the provided collection is empty, the below implementations return 1 and true respectively. However, I understand this can be misleading and it is yet to be discussed which variant is most convenient.

`count`

Code

extension Collection where Element: Equatable {
    
    @_inlineable
    public func count<T: Collection>(of other: T, overlapping: Bool) -> Int where T.Element == Element {
        
        if other.startIndex == other.endIndex { return 0 }
        if self.startIndex == self.endIndex { return 0 }
        
        var count = 0
        
        var currentMainSelfIndex = self.startIndex
        var currentHelperSelfIndex = self.startIndex
        var currentOtherIndex = other.startIndex
        
        if overlapping {
            while currentMainSelfIndex < self.endIndex {
                
                while other[currentOtherIndex] == self[currentHelperSelfIndex] {
                    
                    if other.index(after: currentOtherIndex) == other.endIndex {
                        
                        count += 1
                        break
                    }
                    if self.index(after: currentHelperSelfIndex) == self.endIndex { return count }
                    
                    currentHelperSelfIndex = self.index(after: currentHelperSelfIndex)
                    currentOtherIndex = other.index(after: currentOtherIndex)
                }
                currentMainSelfIndex = self.index(after: currentMainSelfIndex)
                currentHelperSelfIndex = currentMainSelfIndex
                currentOtherIndex = other.startIndex
            }
            return count
        }
        while currentMainSelfIndex < self.endIndex {
            
            while other[currentOtherIndex] == self[currentHelperSelfIndex] {
                
                if other.index(after: currentOtherIndex) == other.endIndex {
                    
                    count += 1
                    currentMainSelfIndex = currentHelperSelfIndex
                    break
                }
                if self.index(after: currentHelperSelfIndex) == self.endIndex { return count }
                
                currentHelperSelfIndex = self.index(after: currentHelperSelfIndex)
                currentOtherIndex = other.index(after: currentOtherIndex)
            }
            currentMainSelfIndex = self.index(after: currentMainSelfIndex)
            currentHelperSelfIndex = currentMainSelfIndex
            currentOtherIndex = other.startIndex
        }
        return count
    }
}

Complexity

n is the collection count, m is the subcollection count.
Time
- best: ϴ(n)
- worst: ϴ(nm)
- average: O(nm)
Memory ϴ(1)

`contains`

Code

extension Collection where Element: Equatable {
    
    @_inlineable
    public func contains<T: Collection>(_ other: T) -> Bool where T.Element == Element {
        
        if other.startIndex == other.endIndex { return true }
        
        var currentMainSelfIndex = self.startIndex
        var currentHelperSelfIndex = self.startIndex
        var currentOtherIndex = other.startIndex
        
        while currentMainSelfIndex < self.endIndex  {
            
            while other[currentOtherIndex] == self[currentHelperSelfIndex] {
                
                if other.index(after: currentOtherIndex) == other.endIndex {
                    
                    return true
                }
                if self.index(after: currentHelperSelfIndex) == self.endIndex { return false }
                
                currentHelperSelfIndex = self.index(after: currentHelperSelfIndex)
                currentOtherIndex = other.index(after: currentOtherIndex)
            }
            currentMainSelfIndex = self.index(after: currentMainSelfIndex)
            currentHelperSelfIndex = currentMainSelfIndex
            currentOtherIndex = other.startIndex
        }
        return false
    }
}

Complexity

n is the collection count, m is the subcollection count.
Time
- best: ϴ(n)
- worst: ϴ(nm)
- average: O(nm)
Memory ϴ(1)

Being closely related to the aforementioned proposal, the motivation is pretty much the same – to introduce a method for a common task, make it intuitive, easy to use and prevent the user from accidentally writing inefficient code.

xwu · March 21, 2018, 10:28pm

A wrinkle here that isn't found in the other pitch is that there's not one single possible interpretation of "count" for substrings, just as there isn't for sub-sequences generally of a sequence. For example, a user could rightly expect "aaaa".count(of: "aaa") to evaluate to 2.

anthonylatsis · March 21, 2018, 10:32pm

Yes, but that's for the different representations of a string (UTF8View, ...), which are : Collection at least. Am I missing something?

xwu · March 21, 2018, 10:36pm

Sorry, I don't understand what you mean here.

What I'm thinking of has to do with the meaning of "count" and applies generally for any sub-sequence of a sequence. For instance, it's not inherent in the naming whether [1, 1, 1, 1].count(of: [1, 1, 1]) should evaluate to 1 or 2. By contrast, the other pitch is about counting elements and not sub-sequences, for which there's only one possible interpretation.

Consider, for instance, if we had both firstIndex(of:) and lastIndex(of:) (we don't yet, but it's not so far-fetched), and if arguments could be sub-sequences [though, in that case, it'd probably best be named firstSubrange(of:) and lastSubrange(of:)]:

In the case of [1, 1, 1, 1], I think a user would be right to expect that the first subrange of [1, 1, 1] would not match the last subrange of [1, 1, 1]. Yet count(of:), as pitched here, would be 1! So there are some wrinkles here to be considered.

jarod · March 21, 2018, 10:55pm

How about a boolean parameter like distinct or unique to allow the developer to choose the behavior?

"aaaa".count(of: "aaa", distinct: true) → 1
"aaaa".count(of: "aaa", distinct: false) → 2

anthonylatsis · March 21, 2018, 10:58pm

Ah, my bad, your call was clear enough.

Indeed, this has to be considered. I imagined this method to work without intersections, linearly traversing the string.

jarod:

How about a boolean parameter like distinct or unique to allow the developer to choose the behavior?
"aaaa".count(of: "aaa", distinct: true) → 1
"aaaa".count(of: "aaa", distinct: false) → 2

Not sure about the semantics but it could be a solution, considering we default distinct to true

xwu · March 21, 2018, 11:02pm

And I agree that it's a potentially useful addition.

My suggestion would be to look into whether you can generalize the idea to counting sub-sequences of sequence types, and whether we could come up not one but two clearly named methods, either through the use of an additional parameter as @jarod suggests or through distinct labels. However, since I don't think that one interpretation is more "intuitive" than the other, I'd argue that even if we use an additional parameter it shouldn't default to one or the other.

[On a quick search, it looks like some commonly used terms to distinguish between the two semantics are "overlapping," versus non-overlapping or "consecutive."]

anthonylatsis · March 21, 2018, 11:12pm

I am always keen to generalizing, but when I think of an array of arbitrary objects, count(of: SubSequence) doesn't seem to be useful at all (It is expectable for someone to use it with Ints for instance, but why would someone need array.count(of: [view1, view2]) ).

xwu · March 21, 2018, 11:19pm

Besides arrays of value types, it's sensible to wonder how many times a particular sequence of bytes appears in an instance of Data, and I could probably come up with others if I thought about it.

You're right that any proposed method should have good motivation for being generic over X rather than Y, but I think if you were keen on trying to generalize, it'd be possible to come up with a respectable number of strong examples.

But in any case, I like the pitch.

anthonylatsis · March 22, 2018, 11:09pm

I just realized that Sequence doesn't have a count requirement – they can be single-pass.
That said, I think it would be sensible to start off with ~~Collection~~ and it's ~~Slice<Self>~~ .

Edit: Better BidirectionalCollection, since the methods in question become senseless for unordered collections.

SusanCheng · March 26, 2018, 1:45am

I have done this before
this is Boyer–Moore–Horspool algorithm

extension RandomAccessCollection where Indices : RandomAccessCollection {
    
    @_inlineable
    public func range<C : RandomAccessCollection>(of pattern: C, where isEquivalent: (Element, Element) throws -> Bool) rethrows -> Range<Index>? where C.Indices : RandomAccessCollection, C.Element == Element {
        
        let pattern_count = IndexDistance(pattern.count)
        if count < pattern_count {
            return nil
        }
        let reverse_pattern = pattern.reversed()
        var cursor = self.index(startIndex, offsetBy: pattern_count - 1, limitedBy: endIndex) ?? endIndex
        while cursor < endIndex {
            guard let not_match = try zip(self.indices.prefix(through: cursor).reversed(), reverse_pattern).first(where: { try !isEquivalent(self[$0], $1) }) else {
                let strat = self.index(cursor, offsetBy: 1 - pattern_count)
                let end = self.index(cursor, offsetBy: 1)
                return strat..<end
            }
            let notMatchValue = self[not_match.0]
            if let pos = try reverse_pattern.dropFirst().index(where: { try isEquivalent(notMatchValue, $0) }) {
                cursor = self.index(not_match.0, offsetBy: IndexDistance(reverse_pattern.distance(from: reverse_pattern.startIndex, to: pos)), limitedBy: endIndex) ?? endIndex
            } else {
                cursor = self.index(not_match.0, offsetBy: pattern_count, limitedBy: endIndex) ?? endIndex
            }
        }
        if try self.reversed().starts(with: reverse_pattern, by: isEquivalent) {
            let strat = self.index(endIndex, offsetBy: -pattern_count)
            return strat..<endIndex
        }
        return nil
    }
}

anthonylatsis · March 26, 2018, 4:12pm

Hello Susan,

I assume you meant this was an analogue for the contains method. It seems you didn't try to compare the speed with that of the algorithm I proposed.

Your algorithm is very hard to read, to be honest. It wouldn't matter, of course, if it was faster, but unfortunately it isn't. A quick test shows an approximate 20% delay even though it is supposed to be faster being written for RandomAccessCollection. With @_inlinable the difference grows up to 35%. Anyway, thank you for contributing. I'll see if I can extract something useful from your algorithm.

Please at least provide some complexity information if you plan on suggesting something else.

Edit. Actually, your algorithm is faster. I just fixed an issue that resulted in a slowdown. I will see if I can take advantage of the Boyer–Moore techniques you suggested as an optimization for Hashable types.

SusanCheng · March 27, 2018, 1:21am

My code doesn’t optimized. The idea comes from here

And here
https://www.google.com.hk/amp/blog.krzyzanowskim.com/2015/08/16/fast-pattern-search-in-swift-since-1974/amp/

I also think that it’s more useful with the api tell us the range of found pattern instead of just telling the count of pattern

anthonylatsis · March 27, 2018, 1:33am

You mean from here, originally.

The problem is that the Boyer–Moore algorithm is only useful in concrete cases, since we need some information to be able to preprocess the pattern. For example, strings. StringProtocol already has a well optimized (I assume) contains<T>(_ other: T) method, so we're not touching that part.

Those are two separate methods

CTMacUser · April 4, 2018, 7:40pm

This doesn't make sense.

BidirectionalCollection and unordered collections have nothing to do with each other. Adding this restriction for no reason doesn't just take out Dictionary and Set, but any other collection that happens to be forward-only.
There's no guarantee that Dictionary and Set won't become bi-directional in the future, ruining your correspondence.
Technically, the unordered collections are ordered! Although the regular user never uses Dictionary and Set in an ordered manner, they are ordered since they conform to both Collection and Sequence. (This is acknowledged by Dictionary and Set hiding their ordered docs in a sub-page, but those docs do exist.)

This means that count(of:) and contains(other:) should be on plain Collection, but:

You didn't want to put the methods on Sequence because count isn't on that protocol because it allows single-pass sequences. But Sequence has plenty of methods that have "so sad, too bad" consequences if the conforming type is truly single-pass. (One is a single-element contains!)

This finally means that count(of:) and contains(other:) should be on Sequence instead of BidirectionalCollection.

I'll think about this some more....

anthonylatsis · April 4, 2018, 7:57pm

Having useless methods on unordered collections is senseless, and since the Standard Library doesn't have an appropriate protocol for ordered collections, it is best to place those methods in BidirectionalCollection, to which all ordered collections of the Standard Library currently conform. That is at least due to the fact that a bidirectional collection has to be ordered. Hence, regarding your second statement – it is very hard to believe Set or Dictionary will become bidirectional.
Sequence is not a choice not only for this very reason and the absence of count – it is undesirable to load hierarchically high-level protocols with extended functionality.

lancep · April 4, 2018, 8:04pm

The "too bad, so sad" consequences for single pass sequences are at least sensible for the single element contains because that method only needs to visit each element in the sequence at most once. So the bad stuff in that case is that you might consume the entire sequence.

For the proposed additions, you can (potentially) visit each element more than once. Which may not be possible for all sequences.

anthonylatsis · April 4, 2018, 8:06pm

That is a good point. In other words, contains(element:) is O(n) with a constant <= 1, the proposed methods are O(mn).

CTMacUser · April 6, 2018, 9:15am

The problem is that you’re punishing all forward-only Collections to make sure Dictionary and Set don’t get the interface. That’s a price too high (unless the algorithm really needs two-way iteration). Design against Murphy, not Machiavelli; and this is deep in Machiavelli territory.

What’s the harm in letting Dictionary and Set have those methods? There is no (real) consequences. No one would be forcing you to call them on your objects.

Jens · April 15, 2018, 5:15pm

I think it's reasonable to assume that a method that checks whether a sequence contains a possible subsequence should be added on Sequence.

Especially since Sequence already has starts(with: Sequence) and our new contains(Sequence) would just be a more general variant of that, checking for the possible subsequence at every position instead of only the first.

So, assuming we add contains(Sequence) to Sequence …

(click to see implementation)

extension Sequence where Element: Equatable {
    /// Returns a Boolean value that indicates whether the sequence contains
    /// the given subsequence.
    ///
    /// - Parameter possibleSubsequence: The subsequence to find within the
    ///   the sequence. `possibleSubsequence` must be finite.
    /// - Returns: `true` if the sequence contains `possibleSubsequence`;
    ///   otherwise, `false`.
    public func contains<S>(_ possibleSubsequence: S) -> Bool
        where Element == S.Element, S : Sequence
    {
        // Pardon my quick and dirty implementation:
        var myIter = self.makeIterator()
        var subsIter = possibleSubsequence.makeIterator()
        guard let subsFirst = subsIter.next() else { return false }
        while let myNext = myIter.next() {
            if myNext == subsFirst {
                while true {
                    let myNext = myIter.next()
                    let subsNext = subsIter.next()
                    if myNext != subsNext {
                        if myNext == nil { return false }
                        if subsNext == nil { return true }
                        break
                    } else {
                        if myNext == nil { return true }
                    }
                }
                subsIter = possibleSubsequence.makeIterator()
                let _ = subsIter.next()
            }
        }
        return false
    }
}

… we can now do this:

let a = [1, 2, 3, 4, 5]
let b = [3, 4]
print(a.contains(b)) // true

But note that we can also do this:

let a = [1, 2, 3, 4, 5] as Set
let b = [3, 4] as Set
print(a.contains(b)) // false or true

I guess that the behavior of this last example is not in accordance with what most people would expect, from simply reading the code. Yet, we have done nothing wrong except forgetting that an unordered set is a sequence.

(Note that set a may or may not contain set b, depending on the order that their elements happens to be in, and this will be different between runs in recent (development snapshot) versions of the compiler.)