OrderedSet bulk insert operation?

Is there a way to bulk insert into an OrderedSet?

If I insert 1_000_000 items individually at index 0 into an ordered set it doesn't finish. I think a bulk insert would allow me to do this in o(n) by only having to update internal indexes and array once?

let count = 1_000_000
var collection = OrderedSet<Int>(minimumCapacity: count)
for i in 0..<count {
    collection.insert(i, at: 0)
}

On the other hand if I append those same 1_000_000 items to an OrderedSet it takes a reasonable 0.097 sec.

let count = 1_000_000
var collection = OrderedSet<Int>(minimumCapacity: count)
for i in 0..<count {
    collection.append(i)
}

Thanks,
Jesse

1 Like

Not an answer, but you can try a range

let count = 1_000
var collection = OrderedSet<Int>(minimumCapacity: count)
collection = OrderedSet(0...1_000_000)

Thanks, and yes that performs well. Most operations perform well. It's just the case where you need to insert many items at list start that is slow... and seems like bulk operation would solve that issue pretty easily.

Don't think there is any bulk insert and from the OrderedSet Documentation,

Inserting or removing a single member (or a range of members) needs to perform the corresponding operation in the storage array, in addition to renumbering any subsequent members in the hash table. Therefore, these operations are expected to have performance characteristics similar to an Array : inserting or removing an element to the end of an ordered set is expected to execute in O(1) operations, while they are expected to take linear time at the front (or in the middle) of the set.

Thus repeated calls to insert(_: at:) at the front or the middle of an OrderedSet will take longer than repeated append(_:) calls.

From Meet the Swift Algorithms and Collections packages

while OrderedSet is able to quickly look up existing elements and append new ones, it cannot efficiently implement removing or inserting an item at the front or middle of the set. Like array, these operations need to slide elements around in the storage array, but they also need to renumber subsequent indices in the hash table. This means that removals and insertions turn into operations with linear complexity, making these slower than the regular set. There is always a trade-off!

3 Likes

I think an insert(contentsOf:at:) method would probably be a reasonable addition, with some caveats.

The current version of OrderedSet doesn’t provide an insert(contentsOf:at:) method because

  1. the need to filter out duplicates makes it somewhat tricky to efficiently implement this — the number of elements we need to insert can be less then the number of items in the source collection, so we cannot easily make room for the new items in one pass.
  2. probably more importantly, I’m worried that the semantics of such insertions would not be obvious — i.e., clients may not expect the number of inserted elements to be less than the count of the input. insert(contentsOf:at:) is uncomfortably close to the RangeReplaceableCollection requirement, but it would work in a potentially unexpected way.

Like Set, OrderedSet does provide a direct bulk insertion operation in formUnion — the name of the method helps set expectations about the result there. On the other hand, the name formUnion doesn’t tell me where exactly the new elements will be inserted, so that isn’t a perfect name, either. formUnion is hardwired to append items; it would be possible to add a set of overloads of it that take an explicit target index, but it’s not clear to me if it’d better/clearer than insert(contentsOf:at:) would be.

Bulk insertions can also be done through the OrderedSet.elements view, which is a read-write property.

var set: OrderedSet<Int> = …
set.elements.insert(contentsOf: 0 ..< 100, at: 0)

Mutations of elements need to finish by regenerating the entire hash table & filtering out duplicates to restore uniqueness. This adds overhead, but this is still an O(n) operation, as opposed to O(n^2) for your original insert loop.

Note that the order of elements in the resulting ordered set is very sensitive to the way the insertions are implemented:

var s0: OrderedSet = [2, 4, 6]
for value in 0 ..< 8 { s0.insert(value, at: 0) } // O(n^2) đź’Ł
// s0 is [7, 5, 3, 1, 0, 2, 4, 6]

var s1: OrderedSet = [2, 4, 6]
s1.formUnion(0 ..< 8) 
// s1 is [2, 4, 6, 0, 1, 3, 5, 7]

var s2: OrderedSet = [2, 4, 6]
s2.elements.insert(contentsOf: 0 ..< 8, at: 0)
// s2 is [0, 1, 2, 3, 4, 5, 6, 7]

var s3: OrderedSet = [2, 4, 6]
s3.insert(contentsOf: 0 ..< 8, at: 0)
// s3 is [0, 1, 3, 5, 7, 2, 4, 6]

Of these, the behavior exhibited by s1 and s3 seem the most appropriate. Reordering existing items or reversing newly added items seems undesirable in the typical use case.

There are several ways we could resolve point (1) above — e.g., by appending items then moving them in place at the end, or by copying new items into a temp buffer first. In fact, the trickiness of these solutions would actually make a really good argument in favor of adding these in the package.

Point (2) I’m more worried about — for example, while RRC does not technically require insertions to increase the count of the collection by the number of elements given, but this seems like an obvious default expectation. On the other hand, it is true that this expectation is already violated by e.g. String:

var string = “hello”
var new = “\u{301}” // combining acute accent
string.count // 4
new.count // 1
string.insert(contentsOf: new, at: string.index(after: string.startIndex))
string // “héllo”
string.count // 4, not 4 + 1

However, while this weirdness is a niche edge case with String, in the case of OrderedSet, such behavior would be the norm, not the exception.

I think if we added OrderedSet.insert(contentsOf:), it would probably make sense to consider conforming OrderedSet to the full RangeReplaceableCollection. The question is: are there any generic RRC algorithms that would stop working correctly when operating on such an unruly type? For instance, any code that tried to reorder elements by inserting duplicates before removing the old items wouldn’t work. But then again, generic algorithms must assume that RRC methods invalidate indices, so perhaps this isn’t a huge concern. But still, RRC conformance would probably be a step too far.

I guess this post is just a long-winded way of saying that I think we should implement OrderedSet.insert(contentsOf:at:), in O(count + new.count) time, implementing the same ordering as the s3 example above. The “append new items then move them in place” strategy seems the most appropriate — it reduces the insertion to a formUnion followed a range-based move operation. (Which may make sense to itself add as a standalone public member.)

PRs would be welcome! :relaxed:

5 Likes