Removing elements at more than one index from an array?

Thanks! That's way more efficient than the simple solution I used:

extension Array {
    mutating func remove(elementsAtIndices indicesToRemove: [Int]) -> [Element] {
        let removedElements = indicesToRemove.map { self[$0] }
        for indexToRemove in indicesToRemove.sorted(by: >) {
            remove(at: indexToRemove)
        }
        return removedElements
    }
}

It would be interesting to see how this method would be properly (including more generically) implemented, for example I guess that it should trap if indicesToRemove contains duplicated indices.

Thank you for the small challenge, I hope I did not forget any corner case.

extension RangeReplaceableCollection where Index: Hashable {
  public mutating func removeAll<C>(at collection: C) -> Self where
    C: Collection,
    C.Element == Index
  {
    let indices = Set(collection)
    // Trap if number of elements in the set is different from the collection.
    // Trap if an index is out of range.
    precondition(
      indices.count == collection.count &&
      indices.allSatisfy(self.indices.contains)
    )
    // Create a new instance and reserve enough capacity to store all removed
    // elements.
    var result = Self.init()
    result.reserveCapacity(indices.count)
    // Remove the elements and fill the new collection.
    indices
      .lazy
      .sorted()
      .enumerated()
      .forEach { (offset, index) in
        let shiftedIndex = self.index(index, offsetBy: -offset)
        let element = remove(at: shiftedIndex)
        result.append(element) 
      }
    // Return the result collection.
    return result
  }
}

The whole trick is, you need to fix the indices for each iteration, which is quite easy since we know that each removal will decrease the count by one. With that knowledge we can simply offset every index by its negative enumerated offset. The indices must also be sorted before we do that.

1 Like

Checking sequence.underestimatedCount might be a not so good idea, a sequence is free to return zero (or some other value less than the actual number of elements).

var a = Array(0..<10)
let toRemove = a.indices.lazy.filter { $0 % 3 == 0 }

print(toRemove.underestimatedCount) // 0
let b = a.removeAll(at: toRemove) // 💣
2 Likes

Thank you, I updated the method to operate on collections instead.

Can someone answer the question if inout means the same as in-place if we would restructure the above example like this?!

extension RangeReplaceableCollection where Index: Hashable {
  public mutating func removeAll<C>(at collection: C) -> Self where
    C: Collection,
    C.Element == Index
  {
    let indices = Set(collection)
    // Trap if number of elements in the set is different from the collection.
    // Trap if an index is out of range.
    precondition(
      indices.count == collection.count &&
      indices.allSatisfy(self.indices.contains)
    )
    return indices
      .lazy
      .sorted()
      .enumerated()
      .reduce(into: .init()) { result, value in
        let (offset, index) = value
        if offset == 0 {
          result.reserveCapacity(indices.count)
        }
        let shiftedIndex = self.index(index, offsetBy: -offset)
        let element = remove(at: shiftedIndex)
        result.append(element)
      }
  }
}

Nice, but it seems to be even less efficient than my naive solution above.

Here's the time it takes to remove (a random list of) half the elements from an Array of Ints (containing 1024 * 256 elements):

0.013 seconds (@anon31136981's here)

2.9 seconds (mine here)

5.9 seconds (@DevAndArtist's here)

Shifting elements in order to be able to remove all elements in one go using removeLast(_:) (like @anon31136981's method does) is probably much faster than using remove(at:) for each element.

I used a command line program, release build, times are averages from 10 repeated runs in a loop, code was written to prevent any possible dead code elimination, MBP late 2013, 2 GHz i7.

Can you share your code that you used to benchmark the algorithms please?

This should be sufficient for trapping on duplicated indices, because shiftLeft() will be called with the same index twice in a row iff it was included at least twice in the array of indices to remove.

extension Array {
    mutating func remove(elementsAtIndices indicesToRemove: [Int]) -> [Element] {
        guard !indicesToRemove.isEmpty else {
            return []
        }
        
        // Copy the removed elements in the specified order.
        let removedElements = indicesToRemove.map { self[$0] }
        
        // Sort the indices to remove.
        let indicesToRemove = indicesToRemove.sorted()
        
        // Shift the elements we want to keep to the left.
        var destIndex = indicesToRemove.first!
        var srcIndex = destIndex + 1
        var prevRemovedIndex = destIndex
        func shiftLeft(untilIndex index: Int) {
            precondition(index != prevRemovedIndex)
            prevRemovedIndex = index
            while srcIndex < index {
                self[destIndex] = self[srcIndex]
                destIndex += 1
                srcIndex += 1
            }
            srcIndex += 1
        }
        for removeIndex in indicesToRemove[1...] {
            shiftLeft(untilIndex: removeIndex)
        }
        shiftLeft(untilIndex: self.endIndex)
        
        // Remove the extra elements from the end of the array.
        self.removeLast(indicesToRemove.count)
        
        return removedElements
    }
}

I’ll try to make it more generic.

1 Like

Here's the microbenchmark program I used, and it now uses Zach's updated version, which didn't change the time (still 0.013 s on my machine):

import QuartzCore // Only for using CACurrentMediaTime for timing.

extension Array {
    mutating func zachRemove(elementsAtIndices indicesToRemove: [Int]) -> [Element] {
        guard !indicesToRemove.isEmpty else {
            return []
        }
        
        // Copy the removed elements in the specified order.
        let removedElements = indicesToRemove.map { self[$0] }
        
        // Sort the indices to remove.
        let indicesToRemove = indicesToRemove.sorted()
        
        // Shift the elements we want to keep to the left.
        var destIndex = indicesToRemove.first!
        var srcIndex = destIndex + 1
        var prevRemovedIndex = destIndex
        func shiftLeft(untilIndex index: Int) {
            precondition(index != prevRemovedIndex)
            prevRemovedIndex = index
            while srcIndex < index {
                self[destIndex] = self[srcIndex]
                destIndex += 1
                srcIndex += 1
            }
            srcIndex += 1
        }
        for removeIndex in indicesToRemove[1...] {
            shiftLeft(untilIndex: removeIndex)
        }
        shiftLeft(untilIndex: self.endIndex)
        
        // Remove the extra elements from the end of the array.
        self.removeLast(indicesToRemove.count)
        
        return removedElements
    }
    
    mutating func jensRemove(elementsAtIndices indicesToRemove: [Int]) -> [Element] {
        let removedElements = indicesToRemove.map { self[$0] }
        for indexToRemove in indicesToRemove.sorted(by: >) {
            remove(at: indexToRemove)
        }
        return removedElements
    }
}

extension RangeReplaceableCollection where Index: Hashable {
    public mutating func removeAll<C>(at collection: C) -> Self where
        C: Collection,
        C.Element == Index
    {
        let indices = Set(collection)
        // Trap if number of elements in the set is different from the collection.
        // Trap if an index is out of range.
        precondition(
            indices.count == collection.count &&
                indices.allSatisfy(self.indices.contains)
        )
        // Create a new instance and reserve enough capacity to store all removed
        // elements.
        var result = Self.init()
        result.reserveCapacity(indices.count)
        // Remove the elements and fill the new collection.
        indices
            .lazy
            .sorted()
            .enumerated()
            .forEach { (offset, element) in
                let index = self.index(element, offsetBy: -offset)
                let element = remove(at: index)
                result.append(element)
        }
        // Return the result collection.
        return result
    }
}


func test() {
    let count = 1024*256
    var a = (0 ..< count).map({ $0 }).shuffled()
    let indicesToRemove = Array(a.indices.shuffled().prefix(count/2))
    
    let t0 = CACurrentMediaTime()
    let b = a.zachRemove(elementsAtIndices: indicesToRemove)
    //let b = a.jensRemove(elementsAtIndices: indicesToRemove)
    //let b = a.removeAll(at: indicesToRemove)
    let t1 = CACurrentMediaTime()

    print("  ", t1 - t0, "seconds",
        "(checksum:", a.reduce(1, &+) &+ b.reduce(1, &+), ")")
}

for _ in 0 ..< 7 { test() }

Note that it's very rudimentary, and you have to comment out / in the line using the method you want to test.

This is a nice implementation! It would make a good addition to the std lib. Some suggestions on tweaks:

  • Making it generic over MutableCollection where Self: RangeReplaceableCollection is pretty straightforward. The compiler will tell you what you need to change. Int becomes Index, += 1 becomes formIndex(after:)
  • The input indices can trivially be converted to a generic Sequence
  • If shiftLeft returns destIndex, you can use that to trim the end using replaceSubrange(lastRemaining..<endIndex) instead of using the count, which wouldn't be valid with a non-Int or non-zero-based index.
  • It's a shame, with an efficient in-place operation like this, to have to still allocate memory for the result and for the sorted indices
    • For the indices: if you didn't want to require the indices come pre-sorted, you could have a variant that checks each index is increasing, and when it isn't, falls back to sorting the remainders. I suspect the vast majority of use cases the input would be sorted
    • For returning the elements: while the user could discard the result, they'll still pay the price of allocating it unless the optimizer is really on their game. It may be better to expect users to save down the to-be-removed elements externally first. As you show, it's a simple one-liner
  • I'd recommend using MutableCollection.swapAt(_:_) rather than sliding elements over each other. It can be more efficient for non-trivial types as it avoids unnecessary reference count ops. It's slightly less efficient for trivial types (since you're copying the to-be-discarded bits) but trivial types are so fast to copy that's a trade-off worth making. It also means it'll still work in a future of collections of move-only types.
18 Likes

Thanks for the suggestions, Ben! Here’s my more generic version, with no memory allocations in the case where indicesToRemove is already sorted. indicesToRemove is a generic Collection rather than a Sequence because we need to iterate through it twice (once to check if it’s sorted, once to perform the removals) and as I understand it, Sequences don’t guarantee that they return the same elements each time.

extension MutableCollection where Self: RangeReplaceableCollection {
    private mutating func remove<C>(elementsAtSortedIndices indicesToRemove: C) where
        C: Collection,
        C.Element == Index
    {
        // Shift the elements we want to keep to the left.
        var destIndex = indicesToRemove.first!
        precondition(indices.contains(destIndex), "Index out of range")
        
        var srcIndex = index(after: destIndex)
        var previousRemovalIndex = destIndex
        func shiftLeft(untilIndex index: Index) {
            precondition(index != previousRemovalIndex, "Duplicate indices")
            while srcIndex < index {
                swapAt(destIndex, srcIndex)
                formIndex(after: &destIndex)
                formIndex(after: &srcIndex)
            }
            formIndex(after: &srcIndex)
        }
        let secondIndex = indicesToRemove.index(after: indicesToRemove.startIndex)
        for removeIndex in indicesToRemove[secondIndex...] {
            precondition(indices.contains(removeIndex), "Index out of range")
            shiftLeft(untilIndex: removeIndex)
        }
        shiftLeft(untilIndex: endIndex)
        
        // Remove the extra elements from the end of the collection.
        removeSubrange(destIndex..<endIndex)
    }
    mutating func remove<C>(elementsAtIndices indicesToRemove: C) where
        C: Collection,
        C.Element == Index
    {
        guard !indicesToRemove.isEmpty else {
            return
        }
        
        // Check if the indices are sorted.
        var isSorted = true
        var prevIndex = indicesToRemove.first!
        let secondIndex = indicesToRemove.index(after: indicesToRemove.startIndex)
        for index in indicesToRemove[secondIndex...] {
            if index < prevIndex {
                isSorted = false
                break
            }
            prevIndex = index
        }
        
        if isSorted {
            remove(elementsAtSortedIndices: indicesToRemove)
        } else {
            remove(elementsAtSortedIndices: indicesToRemove.sorted())
        }
    }
}
1 Like

It seems to me that isSorted might be useful as a property on Collection.

Also, I'm still figuring out this ownership thing, but I was wondering whether it makes sense to annotate the indicesToRemove argument as __owned and sort the indices in place where C: RandomAccessCollection & MutableCollection.

1 Like

I tried to ask @Ben_Cohen about the application he used to benchmark algorithms in his recent talk but it always resulted in radio silence. Luckily I just accidentally stumbled around the project on GitHub. :hugs: So if you want to benchmark the improvements of the algorithms in this thread feel free to try it with that tool and share your result.

In general I think we should use that tool more often in the evolution process.

3 Likes

This is pretty interesting and would actually fit in quite well with a proposal I was thinking of writing for the addition of distinct() and distinct(by:) to the stdlib (for the removal of duplicate elements from a collection).

There might be a few proposals that are worth chaining together here for some convenient collection additions.

You can probably get even more efficient if you use IndexSet and do a reverse enumeration on its contiguous subranges (using its rangeView). Building up an IndexSet from a collection of indices is by itself pretty efficient.

Something like (not tested in compiler):
mutating func removeElements(atIndexes indexes: IndexSet) { for subrange in indexes.rangeView.reversed() { removeSubrange(subrange) } }

Apologies for the formatting.

The method is called remove(elementsAtIndices:) and changes the number of elements (and therefore invalidates all outstanding indexes), so it's a RangeReplaceableCollection method. That is, make it generic over RangeReplaceableCollection where Self: MutableCollection.

I was thinking of a variant that doesn't depend on MutableCollection, which has to selectively copy into a new instance and then replace.

...

Looking through "RangeReplaceableCollection.swift," this philosophy is used by removeAll(where:), which has implementations that do and don't depend on MutableCollection. (Remember that a RRC+MC implementation can't be used by String!)

Hmm, here's some cool potential docs:

extension RangeReplaceableCollection {

    /// Removes the elements from the collection at the indices in the specified sequence.
    ///
    ///     var bugs = ["Aphid", "Bumblebee", "Cicada", "Damselfly", "Earwig"]
    ///     bugs.removeOnly(thoseAt: [1, 4])
    ///     print(bugs)
    ///     // Prints "["Aphid", "Cicada", Damselfly"]"
    ///
    /// Calling this method may invalidate any existing indices for use with this
    /// collection.
    ///
    /// - Parameter list: The positions of the elements to remove. Each index must
    ///   be a valid index of the collection that is not equal to the collection's
    ///   end index.
    ///
    /// - Complexity: O(*n*), where *n* is the length of the collection.
    public mutating func removeOnly<S: Sequence>(thoseAt list: S) where S.Element == Index {
        // Put one of those awesome algorithms here.
    }

    /// Removes the elements from the collection at the specified indices.
    ///
    ///     var bugs = ["Aphid", "Bumblebee", "Cicada", "Damselfly", "Earwig"]
    ///     bugs.removeOnly(elementsAt: 1, 4)
    ///     print(bugs)
    ///     // Prints "["Aphid", "Cicada", Damselfly"]"
    ///
    /// Calling this method may invalidate any existing indices for use with this
    /// collection.
    ///
    /// - Parameter list: The positions of the elements to remove. Each index must
    ///   be a valid index of the collection that is not equal to the collection's
    ///   end index.
    ///
    /// - Complexity: O(*n*), where *n* is the length of the collection.
    public mutating func removeOnly(elementsAt list: Index...) {
        removeOnly(thoseAt: list)
    }

}

For removeOnly(thoseAt:), I assumed that sorting the list and/or needing repeated access is an internal detail, not needed for the user to worry about. Now the second version should probably stay an extension method, but should the first one move to being a requirement? Its philosophy violates how replaceSubrange(_: with:) works since the targeted elements may be discontiguous, so optimized implementations may be desired as needed, but can still have a reasonable default implementation using replaceSubrange(_: with:). Having the first version as a requirement may also ease using the RRC-only vs. RRC+MC split default implementations that removeAll(where:) does.

I've been writing up a similar idea, based on adaptations of the C++ functions adjacent_find, unique, and unique_copy. I couldn't figure out a good names for some of these, but right now I got:

  • For std::adjacent_find( ForwardIt first, ForwardIt last, BinaryPredicate p )
    • Collection.firstAdjacentRange(by:)
    • BidirectionalCollection.lastAdjacentRange(by:)
  • For std::unique( ForwardIt first, ForwardIt last, BinaryPredicate p )
    • MutableCollection.partitionAwayAdjacentValues(by:)
    • RangeReplaceableCollection.removeAdjacentValues(by:), which also needs MC
  • For std::unique_copy( InputIt first, InputIt last, OutputIt d_first, BinaryPredicate p )
    • Sequence.filterUnique(storingAs: by:)
    • LazySequenceProtocol.filterUnique(by:)

where the last one's return type of LazyFilterUniqueSequence follows Collection or BidirectionalCollection if the wrapped sequence does.

Does the distinct(by:) method grabs all subsequent copies of a value, or only the ones that are contiguous (like C++ and my methods)? The difference doesn't matter if the collection is sorted first, like the implementations of remove in this thread probably would.

Calling removeSubrange() repeatedly might be marginally faster than remove(at:) in specific circumstances (where you have lots of contiguous indices), but it's still quadratic. My implementation could be adapted to use IndexSet where Index == Int though. You wouldn't have to worry about whether or not the indices are sorted then.

At least for arrays, I think if we get the unsafe access to uninitialized contents proposal it will make it fairly easy to implement one that both avoids unnecessary reference ops and copies like this (note: This is the version that doesn't save the elements it's about to remove)

extension Array {
	mutating func remove(elementsAtSortedIndices indicesToRemove: [Int]) {
		withUnsafeMutableBufferPointerToStorage(capacity: count) { (buffer, initializedCount) -> () in
			// `buffer.baseAddress` can only be nil if the array is already empty
			guard let basePtr = buffer.baseAddress else { return }
			var removalIndexIterator = indicesToRemove.makeIterator()

			var nextRemoval = removalIndexIterator.next()
			// Guarantees that the first loop will hit the removal condition
			// This way we're guaranteed to not attempt to moveInitialize a value from itself, since I have no idea if that's allowed
			guard var writeIndex = nextRemoval else { return }
			for readIndex in writeIndex..<initializedCount {
				if readIndex == nextRemoval {
					(basePtr + readIndex).deinitialize(count: 1)
					nextRemoval = removalIndexIterator.next()
				}
				else {
					(basePtr + writeIndex).moveInitialize(from: basePtr + readIndex, count: 1)
					writeIndex += 1
				}
			}

			initializedCount = writeIndex
		}
	}
}
2 Likes

Yup your algorithm is going to be more efficient overall, although it could probably still use an IndexSet version working on contiguous ranges so the library can block-move stuff around.

Noting too that if we're not mutating in place then it becomes a variant of an IndexSet subscript (just use the inverse of the removal index set over the full initial array's range), which is an easy enough algorithm to implement efficiently.