[API Review] Sorted Collections

vihan · August 11, 2021, 1:03am

Sorted Collections API Review

Hello! I'd like to post the overview of the new SortedDictionary and SortedSet APIs bring brought to swift-collections. Please feel free to take a look below and leave any feedback!

Introduction

A module of Swift Collections containing new data structures which efficiently maintain their elements in sorted order. The initial release includes SortedSet and SortedDictionary.

Motivation

The Swift Collections package currently includes two data structures which maintain its members in a well-defined order: OrderedSet and OrderedDictionary. These types are useful in a variety of use-cases, as they enumerate their elements in insertion order. However, there also exist many situations where its desired to maintain elements in an order specified by a predefined comparator. For example, a common pattern in user interfaces is displaying a list with entries sorted in some order, such as in chronological order.

Quick ad hoc implementations of sorted data structures can have many pitfalls. A naive implementation can devolve to quadratic performance. A smarter implementation using binary search is difficult to get correct and has subtle potential pitfalls. For this reason, there is natural place to provide high-performance, production-grade sorted data structures.

Proposed Solution

Two new types, SortedSet and SortedDictionary, which guarantee logarithmic performance, mimicking their traditional Set and Dictionary counterparts as closely as possible.

Underlying these types is a B-Tree data structure. The use of a tree is unique in that it departs from the use of a contiguous storage buffer (such as in hash tables or arrays). Instead B-Trees can be thought of as a variant of a binary search tree where each node contains multiple 'keys' stuffed into a single node (as a contiguous buffer). The amount of keys is a careful balancing act. Binary search will typically be fastest within a single contiguous buffer, however as a contiguous buffer gets larger, random insertions become slower, combined with the costs of large allocations. A B-Tree implementation optimizes between these two characteristics.

B-Trees also have notable efficiency implementing the CoW optimization. As the data is not stored within a single contiguous data structure, only the nodes along the path must be copied. The following example demonstrates this on a tree mutating element E:

         ┌───┐                        ┌───┐
         │ D │                        │ D*│
     ┌───┴───┴───┐                ┌───┴───┴───┐
     │           │                │           │
   ┌─┴─┐       ┌─┴─┐            ┌─┴─┐       ┌─┴─┐
   │ B │       │ F │    ───►    │ B │       │ F*│
  ┌┴───┴┐     ┌┴───┴┐          ┌┴───┴┐     ┌┴───┴┐
  │     │     │     │          │     │     │     │
┌─┴─┐ ┌─┴─┐ ┌─┴─┐ ┌─┴─┐      ┌─┴─┐ ┌─┴─┐ ┌─┴─┐ ┌─┴─┐
│ A │ │ C │ │ E │ │ G │      │ A │ │ C │ │ E*│ │ G │
└───┘ └───┘ └───┘ └───┘      └───┘ └───┘ └───┘ └───┘

For larger trees, this means the vast majority of the data structure can be re-used.

Sorted data structures based on B-Trees are not bound to a uniqueness constraint. Dropping such a constraint opens the door to the possibility of efficient 'multi' variants of data structures such as dictionaries capable of storing multiple instances of key-value pairs with the same key, or sets with multiple instances of the same element. This module focuses on developing a versatile and high performance B-Tree data structure, along with sorted and unique variants of Set and Dictionary, however the place of these 'multi' data structures are an open question enabled by the implementation of the underlying tree data structure.

Detailed Design

SortedSet

A SortedSet mimics the API of a traditional Set, however it automatically maintains its elements in ascending sorted order:

let states: SortedSet = ["Nebraska", "Arkansas", "Colorado"]

// Familiar methods from SetAlgebra
states.formUnion(["Iowa", "Virginia"])

// Element-range based operations
states["D"...] // SubSequence with ["Iowa", "Nebraska", "Virginia"]

states // ["Arkansas", "Colorado", "Iowa", "Nebraska", "Virginia"]

Definition

struct SortedSet<Element: Comparable>
: BidirectionalCollection,
  SetAlgebra,
  ExpressibleByArrayLiteral,
  CustomStringConvertible, CustomDebugStringConvertible, CustomReflectable {
  struct Index { ... }
  typealias Indices = DefaultIndices<Self>
  struct Iterator { ... }
  struct SubSequence: BidirectionalCollection, Equatable {
    typealias Element = SortedSet.Element
    typealias Index = SortedSet.Index
    typealias SubSequence = Self
    struct Iterator { ... }

    var base: SortedSet { get }

    subscript(range: Range<Element>) -> Self
    subscript(range: PartialRangeFrom<Element>) -> Self
    subscript(range: PartialRangeThrough<Element>) -> Self
    subscript(range: PartialRangeUpTo<Element>) -> Self
  }
}

// Conditional conformances
extension SortedSet: Hashable where Element: Hashable {}
extension SortedSet: Encodable where Element: Encodable {}
extension SortedSet: Decodable where Element: Decodable {}

extension SortedSet.SubSequence: Hashable where Element: Hashable {}

By using a custom Index, SortedSet can maintain a constant time subscript.
Additionally, it provides a fast path for advancing the index. However, mutations
with an index such as remove(at:) by nature are still logarithmic.

SortedSet Specific APIs

SortedSet provides a new methods which do not exist on a traditional Set:

extension SortedSet {
  /// Creates a sorted set from a sequence of sorted elements.
  ///
  /// If duplicates are encountered within the sequence, the last instance of
  /// the element is kept within the collection and preceeding instances are
  /// discarded.
  ///
  /// This provides an optimized method of constructing a ``SortedSet``, when
  /// the initializer does not need to sort the data.
  ///
  /// On debug builds, this will trap if the provided sequence is not sorted.
  /// Otherwise, providing a non-sorted sequence can put the collection in an
  /// undefined state.
  ///
  ///  - Complexity: O(`n`) where `n` is the number of elements in `elements`.
  init<S: Sequence>(sortedElements elements: S) where S.Element == Element
}

Additionally, SortedSet offers new and convenient APIs which are unique to its sorted nature:

extension SortedSet {
  /// Returns a sequence of elements in the collection bounded by
  /// limiting element values.
  ///
  /// This is particularly useful when applied with a bound corresponding to some
  /// group of elements.
  ///
  ///     let students: SortedSet = ...
  ///     students["A"..<"B"]  // Sequence of students with names beginning with "A"
  ///
  /// - Complexity: O(log(`self.count`))
  subscript(range: Range<Element>) -> SubSequence<Element>
  subscript(range: PartialRangeFrom<Element>) -> SubSequence<Element>
  subscript(range: PartialRangeThrough<Element>) -> SubSequence<Element>
  subscript(range: PartialRangeUpTo<Element>) -> SubSequence<Element>
}

Individual implementations must exist for each of the Range types. The reason a generic implementation over RangeExpression cannot be used is that there are no known bounds to clamp to to obtain a concrete range.

These range subscripts are also implemented upon SortedSet.SubSequence.

SetAlgebra and the Set Interface Mimicked

SortedSet contains respective implementations of almost all the methods
specific to Set:

extension SortedSet: SetAlgebra {
  /// Creates an empty sorted set.
  public init()

  ///  - Complexity: O(`n`) where `n` is the number of elements in `elements`.
  public init<S: Sequence>(_ elements: S) where S.Element == Element

  // Testing for membership

  /// - Complexity: O(log(`self.count`))
  func contains(_ element: Self.Element) -> Bool


  // Adding and Removing Elements

  /// - Complexity: O(log(`self.count`))
  mutating func insert(_ newMember: Self.Element) -> (inserted: Bool, memberAfterInsert: Self.Element)
  /// - Complexity: O(log(`self.count`))
  mutating func update(with newMember: Self.Element) -> Self.Element?
  /// - Complexity: O(log(`self.count`))
  mutating func remove(_ member: Self.Element) -> Self.Element?


  // Combining Sets

  /// - Complexity: O(`self.count` + `other.count`)
  func union(_ other: Self) -> Self
  /// - Complexity: O(`self.count` + `other.count`)
  mutating func formUnion(_ other: Self) -> Self
  /// - Complexity: O(`self.count` + `other.count`)
  func intersection(_ other: Self) -> Self
  /// - Complexity: O(`self.count` + `other.count`)
  mutating func formIntersection(_ other: Self) -> Self
  /// - Complexity: O(`self.count` + `other.count`)
  func symmetricDifference(_ other: Self) -> Self
  /// - Complexity: O(`self.count` + `other.count`)
  mutating func formSymmetricDifference(_ other: Self) -> Self
  /// - Complexity: O(`self.count`)
  func subtract(_ other: Self) 
  /// - Complexity: O(`self.count`)
  mutating func subtracting(_ other: Self) -> Self
   

  // Comparing Sets

  /// - Complexity: O(max(`self.count`, `other.count`))
  func isStrictSubset(of other: Self) -> Bool
  /// - Complexity: O(max(`self.count`, `other.count`))
  func isStrictSuperset(of other: Self) -> Bool
  /// - Complexity: O(max(`self.count`, `other.count`))
  func isDisjoint(with other: Self) -> Bool
  /// - Complexity: O(max(`self.count`, `other.count`))
  func isSubset(of other: Self) -> Bool
  /// - Complexity: O(max(`self.count`, `other.count`))
  func isSuperset(of other: Self) -> Bool
}

The methods which are not implemented are those related to capacity. See below
for rational on exposing capacity

init(minimumCapacity: Int)
var capacity: Int { get }
func reserveCapacity(Int)

Custom Implementations

SortedSet implements optimized implementations for certain methods from Collection:

/// - Complexity: O(log(`self.count`))
func randomElement<T: RandomNumberGenerator>(using generator: inout T) -> Element?

An optimized implementation can reduce the amount of (non-trivial) indexing operations that would have been performed by the default implementation.

Partial RangeReplaceableCollection

SortedSet implements a subset of the RangeReplaceableCollection methods it can
support, similar to OrderedSet. SortedSet cannot fully conform to
RangeReplaceableCollection as some of its methods could result in violating the sortedness invariant.

extension SortedSet {
  /// - Complexity: O(`self.count`)
  func filter(_ isIncluded: (Element) throws -> Bool) rethrows -> Self
  /// - Complexity: O(log(`self.count`))
  mutating func popFirst() -> Element?
  /// - Complexity: O(log(`self.count`))
  mutating func popLast() -> Element?
  /// - Complexity: O(log(`self.count`))
  @discardableResult
  mutating func removeFirst() -> Element
  /// - Complexity: O(log(`self.count`) + `k`)
  mutating func removeFirst(_ k: Int)
  /// - Complexity: O(log(`self.count`))
  @discardableResult
  mutating func removeLast() -> Element
  /// - Complexity: O(log(`self.count`) + `k`)
  mutating func removeLast(_ k: Int)
  /// - Complexity: O(log(`self.count`))
  @discardableResult
  mutating func remove(at index: Index) -> Element
  /// - Complexity: O(log(`self.count` + `range.count`))
  mutating func removeSubrange(_ range: Range<Self.Index>)}}

The methods which were not implemented due to design decisions are:

// Not implemented due to the potential to violate the uniqueness
// and sortedness invariants.
//
// However, it may be possible to design a variant of this which
// does not allow violating the invariants.
mutating func applying(_ difference: CollectionDifference<Element>) -> Self?

// In the case of duplicate elements, this does not make it clear
// which collections' element takes precedence.
//
// Instead use of formUnion(_:) and union(_:) is preferred. This 
// is the same approach taken by Set.
static func +(Self, Other) -> Self

// See below for rationale on capacity-related methods
func reserveCapacity(Int)

The methods which are not logical to implement are:

// Would violate the uniqueness invariant
init(repeating: Element, count: Int)

// May violate the sortedness invariant
mutating func append(_ element: Element)

// May violate the sortedness invariant
mutating func append<S: Sequence>(_ element: S) where S.Element == Element

// May violate the sortedness invariant
mutating func insert(_ element: Element, at index: Index)

// May violate the sortedness invariant
mutating func replaceSubrange<C: Collection>(_ range: Range<Index>, with: C) where C.Element == Element

SortedDictionary

A SortedDictionary mimics the API of a traditional Dictionary, however it automatically maintains its elements in ascending sorted order by its keys:

let httpCodes: SortedDictionary = [
  200: "OK",
  403: "Access forbidden",
  404: "File not found",
  500: "Internal server error"] 

// Familiar operations from Dictionary
httpCodes[301] = "Moved permanently"

httpCodes
// [200: "OK",
//  301: "Moved permanently"
//  403: "Access forbidden",
//  404: "File not found",
//  500: "Internal server error"]

Definition

struct SortedDictionary<Key: Comparable, Value>
: BidirectionalCollection,
  ExpressibleByDictionaryLiteral,
  CustomStringConvertible, CustomDebugStringConvertible, CustomReflectable {
  typealias Element = (key: Key, value: Value)
  struct Index { ... }
  typealias Indices = DefaultIndices<Self>
  struct Iterator { ... }
  struct SubSequence: BidirectionalCollection {
    typealias Element = SortedDictionary.Element
    typealias Index = SortedDictionary.Index
    typealias SubSequence = Self
    struct Iterator { ... }

    var base: SortedDictionary { get }
  }
}

// Key and Value views
extension SortedDictionary {
  struct Keys: BidirectionalCollection, Equatable {
    typealias Element = Key
    typealias Index = SortedDictionary.Index
    typealias Indices = Range<Index>
    struct Iterator { ... }
    struct SubSequence: BidirectionalCollection {
      typealias Element = SortedDictionary.Key
      typealias Index = SortedDictionary.Index
      typealias SubSequence = Self
      struct Iterator { ... }

      var base: SortedDictionary.Keys { get }
    }
  }

  struct Values: BidirectionalCollection {
    typealias Element = Value
    typealias Index = SortedDictionary.Index
    typealias Indices = Range<Index>
    struct Iterator { ... }
    struct SubSequence: BidirectionalCollection {
      typealias Element = SortedDictionary.Value
      typealias Index = SortedDictionary.Index
      typealias SubSequence = Self
      struct Iterator { ... }

      var base: SortedDictionary.Values { get }
    }
  }
}

// Conditional conformances
extension SortedDictionary: Equatable where Value: Equatable {}
extension SortedDictionary: Hashable where Key: Hashable, Value: Hashable {}
extension SortedDictionary: Encodable where Key: Encodable, Value: Encodable {}
extension SortedDictionary: Decodable where Key: Decodable, Value: Decodable {}

extension SortedDictionary.SubSequence: Equatable where Value: Equatable {}
extension SortedDictionary.SubSequence: Hashable where Key: Hashable, Value: Hashable {} 

extension SortedDictionary.Keys: Hashable where Key: Hashable {} 
extension SortedDictionary.Keys.SubSequence: Hashable where Element: Hashable {} 

extension SortedDictionary.Values: Equatable where Value: Hashable {} 
extension SortedDictionary.Values: Hashable where Element: Hashable {} 
extension SortedDictionary.Values.SubSequence: Equatable where Element: Value {} 
extension SortedDictionary.Values.SubSequence: Hashable where Element: Hashable {}

SortedDictionary Specific APIs

SortedDictionary exposes a few APIs which are not available on a traditional Dictionary.

extension SortedDictionary {
  /// Creates a sorted dictionary from a sequence of sorted elements.
  ///
  /// If duplicates are encountered within the sequence, the last instance of
  /// the element is kept within the collection and preceeding instances are
  /// discarded.
  ///
  /// This provides an optimized method of constructing a ``SortedDictionary``, when
  /// the initializer does not need to sort the data.
  ///
  /// On debug builds, this will trap if the provided sequence is not sorted.
  /// Otherwise, providing a non-sorted sequence can put the collection in an
  /// undefined state.
  ///
  ///  - Complexity: O(`n`) where `n` is the number of elements in `elements`.
  init<S: Sequence>(sortedKeysWithValues keysAndValues: S) where S.Element == Element

  /// Allows in-place mutation of a dictionary's value at a particular key.
  ///
  /// - Parameters:
  ///   - key: The key to look up (or append). If `key` does not already exist
  ///       in the dictionary, it is appended with the supplied default value.
  ///   - defaultValue: The default value to append if `key` doesn't exist in
  ///       the dictionary.
  ///   - body: A function that performs an in-place mutation on the dictionary
  ///      value.
  /// - Returns: The return value of `body`.
  /// - Complexity: O(log(`self.count`))
  mutating func modifyValue<R>(
    forKey key: Key,
    default defaultValue: @autoclosure () -> Value,
    _ body: (inout Value) throws -> R
  ) rethrows -> R 
}

The func modifyValue(forKey:default_:) allows for efficient in-place mutation of a dictionaries value at a key. This API parallels a similar method on OrderedDictionary.

Additionally, SortedDictionary offers new and convenient APIs which are unique to its sorted nature:

extension SortedDictionary {
  /// Returns a sequence of elements in the collection bounded by
  /// limiting key values.
  ///
  /// This is particularly useful when applied with a bound corresponding to a
  /// some group.
  ///
  ///     let httpCodes: SortedDictionary = ...
  ///     let successCodes = httpCodes[200..<300]
  ///
  /// - Complexity: O(log(`self.count`))
  subscript(range: Range<Key>) -> SubSequence<Element> { get }
  subscript(range: PartialRangeFrom<Key>) -> SubSequence<Element>
  subscript(range: PartialRangeThrough<Key>) -> SubSequence<Element>
  subscript(range: PartialRangeUpTo<Key>) -> SubSequence<Element>
}

Individual implementations must exist for each of the Range types. The reason a generic implementation over RangeExpression cannot be used is that there are no known bounds to clamp to to obtain a concrete range.

These range subscripts are also implemented upon:

SortedDictionary.SubSequence
SortedDictionary.Keys
SortedDictionary.Keys.SubSequence

Dictionary Interface Mimicked

SortedDictionary implements almost entirely the same API as a standard dictionary:

extension SortedDictionary {
  // Initializers

  /// - Complexity: O(1)
  init()
  /// - Complexity: O(n log n) where n is the number of elements in `keysAndValues`
  init<S: Sequence>(_ keysAndValues: S) where S.Element == Element
  /// - Complexity: O(n log n) where n is the number of elements in `keysAndValues`
  init<S: Sequence>(grouping: S, by: (S.Element) -> Key) where Value == [S.Element]

  
  // Accessing Keys and Values
  /// - Complexity: O(log(`self.count`))
  subscript(key: Key) -> Value? { get set _modify }
  /// - Complexity: O(log(`self.count`))
  subscript(
    key: Key,
    default defaultValue: @autoclosure () -> Value
  ) -> Value { get set _modify }

  /// - Complexity: O(log(`self.count`))
  func index(forKey key: Key) -> Index?

  /// - Complexity: O(1)
  var keys: Keys { get }
  /// - Complexity: O(1)
  var values: Values { get set _modify }


  // Adding Keys and Values
  /// - Complexity: O(log(`self.count`))
  mutating func updateValue(_ value: Value, forKey key: Key) -> Value?
  /// - Complexity: O(`self.count` + `other.count`)
  mutating func merge<S: Sequence>(_ other: S, uniquingKeysWith combine: (Value, Value) throws -> Value) rethrows
  /// - Complexity: O(`self.count` + `other.count`)
  mutating func merge(_ other: SortedDictionary<Key, Value>, uniquingKeysWith combine: (Value, Value) throws -> Value) rethrows 
  /// - Complexity: O(`self.count` + `other.count`)
  func merging<S: Sequence>(_ other: S, uniquingKeysWith combine: (Value, Value) throws -> Value) rethrows -> SortedDictionary<Key, Value>
  /// - Complexity: O(`self.count` + `other.count`)
  func merging(_ other: SortedDictionary<Key, Value>, uniquingKeysWith combine: (Value, Value) throws -> Value) rethrows -> SortedDictionary<Key, Value>


  // Removing Keys and Values
  /// - Complexity: O(log(`self.count`))
  mutating func removeValue(forKey key: Key) -> Value?
  

  // Transforming a Dictionary
  /// - Complexity: O(`self.count`)
  func mapValues<T>(_ transform: (Value) throws -> T) rethrows -> SortedDictionary<Key, T>

  /// - Complexity: O(`self.count`)
  func compactMapValues<T>(_ transform: (Value) throws -> T?) rethrows -> SortedDictionary<Key, T>
}

The methods not implemented are the following:

// SortedDictionaries can efficiently handle duplicate elements so these 
// initializers are not as useful.
init<S: Sequence>(uniqueKeysWithValues: S) where S.Element == Element
init<S: Sequence>(S, uniquingKeysWith: (Value, Value) -> Value) where S.Element == Element

// See below for the rationale behind not implementing capacity-related
// methods.
init(minimumCapacity: Int)
var capacity: Int
func reserveCapacity(_ capacity: Int)

Custom Implementations

SortedDictionary implements optimized implementations for certain methods from Collection:

func randomElement<T: RandomNumberGenerator>(using generator: inout T) -> Element?

Partial RangeReplaceableCollection

SortedDictionary implements most of the methods specified by RangeReplaceableCollection. These additional methods are possible to implement thanks to the well-defined ordering of a SortedDictionary:

extension SortedDictionary {
  // Implemented on Dictionary
  /// - Complexity: O(`self.count`)
  func filter(_ isIncluded: (Element) throws -> Bool) rethrows -> Self

  /// - Complexity: O(log(`self.count`))
  mutating func popFirst() -> Element?

  /// - Complexity: O(log(`self.count`))
  mutating func popLast() -> Element?
  /// - Complexity: O(log(`self.count`))
  @discardableResult
  mutating func removeFirst() -> Element
  /// - Complexity: O(log(`self.count`) + `k`)
  mutating func removeFirst(_ k: Int)
  /// - Complexity: O(log(`self.count`))
  @discardableResult
  mutating func removeLast() -> Element
  /// - Complexity: O(log(`self.count`) + `k`)
  mutating func removeLast(_ k: Int)
  /// - Complexity: O(log(`self.count`))
  @discardableResult
  mutating func remove(at index: Index) -> Element
  /// - Complexity: O(log(`self.count`) + `range.count`)
  mutating func removeSubrange(_ range: Range<Self.Index>)}

The methods which were not implemented due to design decisions are:

// Not implemented due to the potential to violate the uniqueness
// and sortedness invariants. For this reason, this is also not
// implemented on a traditional Dictionary
//
// However, it may be possible to design a variant of this which
// does not allow violating the invariants.
mutating func applying(_ difference: CollectionDifference<Element>) -> Self?

// In the case of duplicate elements, this does not make it clear
// which collections' element takes precedence.
//
// Instead use of merge(...) methods are preferred. This is the 
// same approach taken by the standard Dictionary type
static func +(Self, Other) -> Self

// See below for rationale on capacity-related methods
func reserveCapacity(Int)

The methods which are not logical to implement are:

// Would violate the uniqueness invariant
init(repeating: Element, count: Int)

// May violate the sortedness invariant
mutating func append(_ element: Element)

// May violate the sortedness invariant
mutating func append<S: Sequence>(_ element: S) where S.Element == Element

// May violate the sortedness invariant
mutating func insert(_ element: Element, at index: Index)

// May violate the sortedness invariant
mutating func replaceSubrange<C: Collection>(_ range: Range<Index>, with: C) where C.Element == Element

Considerations

Constant-time startIndex

The sorted collections do notably depart from the specified requirement of an O(1) Collection.startIndex. Maintaining a reference to the first node adds considerable complexity to the implementation and has non-trivial performance trade-offs on all other operations.

Exposing offset-based subscripts.

Many of the other collections expose offset-based subscripts. Specifically, Array, Deque, and the ordered collection module use primarily integer-offset subscripts. The sorted collections also have their elements efficiently cardinally identifiable and therefore can also expose such an API. As a result, there is motivation to provide some extent of offset-based APIs.

However, those data structures conform RandomAccessCollection and can provide an O(1) integer subscript. This is not the case for a sorted collection which requires logarithmic time to find elements. Additionally, in contrast to OrderedSet and OrderedDictionary, a sorted collection cannot efficiently use an integer as its index type (as it would not be able to provide an O(1) subscript or offset advancement fast-path). Using an optimized implementation, a custom Index can achieve O(1) amortized performance when advancing an index†, while integer-offsets would only be able to achieve O(log n) performance.

This means that if integer-offsets are being exposed, they wouldn't make sense to use as the Index type and would require a parallel-set of integer-offset based APIs to exist. This does increase API surface area considerably and add maintenance burden.

Furthermore, there is a potential for ambiguity in the case of SortedDictionary:

let httpCodes: SortedDictionary = [
  200: "OK",
  403: "Access forbidden",
  404: "File not found",
  500: "Internal server error"]

httpCodes[200] // Does `200` refer to the offset or key.

These sorted collections are not the only type facing these concerns. The String type in the standard library shares similar performance questions. Rather than implement an ad hoc solution for this, we believe the best approach would be future language-level or standard library developments which ease the syntactical burden of expressing index manipulations.

Instead, it is currently recommend to use the Index type to perform offset-based operations:

let measurements: SortedSet<Int> = ...
var medianIndex = measurements.startIndex
measurements.formIndex(&medianIndex, offsetBy: measurements.count / 2)
let median: Int = measurements[medianIndex]

† while a custom Index can achieve O(1) amortized advancement. The current implementation does not, as doing so require maintaining a memory-heavy stack of all nodes— a trade-off deemed too significant. Instead, O(1) amortized advancement is achieved through the iterator implementation on the sorted data structures and their respective subsequences. The current implementation, does however, use an efficient fast-path which means it is near constant-time for advancement. Currently, for all practically possible dataset sizes, the log(n) factor is below 3.

Violating Sortedness Invariants

It is expected that an object's Comparable implementation is properly defined. If not,
the collections will not panic or trap, but their behavior is undefined and they may
return incorrect values. An example of a ill-defined Comparable implementation is:

struct Person: Comparable {
  var name: String
  var age: Int

  public func <(lhs: Self, rhs: Self) -> Bool {
    if externalSource.useNameForComparison() {
      return lhs.name < rhs.name
    } else {
      return lhs.age < rhs.age
    }
  }
}

The standard library in multiple places puts trust that an implementation does indeed fulfill its stated invariants. Since the sorted collections do neither trap, panic, nor violate the safe-nature of Swift, we've decided it is alright to expect this.

Conforming to `RandomAccessCollection`

While RandomAccessCollection does not provide new methods on its own, other tooling, libraries, and frameworks often depend on performant random access by checking conformance to RandomAccessCollection. Examples are the Swift Algorithms package and SwiftUI.

While the proposed implementation has very slow growth in access performance across collection size, almost all operations are O(log collection.count) and it would thereby be wrong to conform to RandomAccessCollection.

Exposing the concept of capacity

The B-Trees used within these sorted collections do have a concept of capacity (in contrast to other trees such as AVL or Red-Black Trees). However, there are a few pitfalls to exposing them.

It is not immediately obvious how varying a B-Trees capacity will affect performance in contrast to other data structures such as Array. An Array is a contiguous buffer and therefore the func reserveCapacity(Int) method provides a clear optimization in reducing future allocations. However, with B-Trees their performance in comparison to their capacity followed a 'U'-shaped curve. A larger node capacity will likely mean slower lookup performance as the default capacities have been tuned to be the most optimally performant possible.
A tree where self.count < self.capacity does not necessarily mean that an element will be able to be inserted without a need for allocation as the node with the free slot could be different from the node in which the element needs to be inserted to maintain sortedness.
Efficiently maintaining the overall tree's free capacity would require additional memory and runtime performance overhead, and it is not clear, potentially aside from debugging and instrumentation, what a capacity property would be used for to warrant the trade-offs.

Providing custom comparators

A common situation may be to customize the property by which elements are sorted within these collections. A few of the most obvious and ergonomic approaches are:

Passing a KeyPath parameter to specify a property on which to sort
Providing a closure implementing some comparison predicate such as < similar to the Sequence.sorted(by:) API

The primary shortfall is that both approaches would induce significant runtime overhead as every internal method would need to store and defer to either the key-path or closure to perform a comparison operation.

The current goal of the package is to provide a data-structure that is low-level and capable of performing common operations efficiently. This means in its current form, it is possible to build such an interface on top of the existing SortedSet and SortedDictionary implementations such as:

struct CustomSort<Element, Key: Comparable>: Comparable {
  var element: Element
  var property: KeyPath<Element, Key>

  init(_ value: Element, on property: KeyPath<Element, Key>) {
    self.element = element
    self.property = property
  }

  static func ==(lhs: Self, rhs: Self) -> Bool {
    lhs[keyPath: property].element == rhs[keyPath: property].element
  }

  static func <(lhs: Self, rhs: Self) -> Bool {
    lhs[keyPath: property] < rhs[keyPath: property]
  }
}

let students = SortedSet(
  students.map { student in
    CustomSort(student, on: \.name)
  }
)

Whether this is within scope and a common enough pattern to warrant an additional data structure in the package maybe a future question.

kennyc · August 11, 2021, 4:58am

This is great to see. Is there any rough timeframe when these data structures might appear in the current Collections package?

I'm currently using the BTree implementation provided by @lorentey over on GitHub. I use it in a file manager application for managing an internal data structure that loosely maps to the hierarchical nature of file system entries.

I use the raw BTree interface over the provided Map and SortedSet implementations because in that package, the BTree API allows for more control during iteration.

Being able to start the iteration at a known offset and being able to skip the iterator forward a known amount have been beneficial in my use-cases.

Looking forward to seeing how this turns out. Thanks for sharing.

jonathanpenn · August 11, 2021, 3:26pm

Is it the plan to make the underlying BTree implementation public as well? That would provide a fantastic building block for many other kinds of structures, too.

lorentey · August 11, 2021, 7:28pm

Our plan is to release version 1.0 of the package relatively soon with the three original data structures, then follow it up quickly with a 1.1 containing new additions, including these sorted collections.

(If things go well, I'm hoping for the new additions to get into a stable release later this year.)

The equivalent solution in the proposed API is to iterate using indices. Would that work for you?

The only potential pitfall I can see is that iterating using indexing might be slightly slower than using the iterator -- I guess we'll see whether this is a problem worth addressing once we compare benchmarks between the two. Adding new methods to reposition an iterator isn't necessarily a bad idea -- however, indices are the "official" abstraction for such use cases, so they would be the preferred approach. (E.g., generic algorithms wouldn't be able to use custom iterator methods.)

For simply iterating from a specific position, I expect slices would work well. (The BTree package (incorrectly!) used self-slicing which would have made this approach a bit more costly.)

lorentey · August 11, 2021, 7:45pm

Some of those other kinds of structures would likely make great additions to this package. (For instance, we already have a feature request for sorted multisets and -multimaps. A tree-based list structure could also be interesting.)

I would prefer not to expose the raw B-tree implementation (yet?), as that would mean we'd need to commit to a source stable interface for it. (Its interface is full of ad hoc solutions and sharp edges that make it a lot less polished/safe than the public types.) I also expect we'll need to change parts of the current interface when we implement some data structures that support duplicate keys, like the multiset above.

For custom one-off data structure implementations that wouldn't make good additions to the package, it may be a good strategy to manually copy the internal B-tree code. (Following the terms of the license, of course.) I'm hoping this would rarely be necessary though.

kennyc · August 11, 2021, 7:59pm

I honestly don't know and haven't yet explored it enough to comment on, sorry. I only recently started playing around with your private BTree package after realizing I had purchased the ObjC book years ago...

FWIW, I'll provide a bit more context to explain my use-case just to provide one datapoint of how the package might be used.

I used a BTree to store sorted, hierarchical items that represent a file system hierarchy. The items are presented to the user in an outline view. They key is an IndexSet while the value is a Swift struct.

Because my key represents hierarchical information, I sometimes want to "skip all the children" of a given node to get to the next sibling while I'm iterating. This might be the case if the user had collapsed an item and I want to find the next entry that is a sibling of the entry that was collapsed. (Because all entries in-between will now be hidden as they represent children of the collapsed node.)

The iterator + seek/offets proved to be great for that because I just stay within a single while loop, skipping the iterator ahead as necessary.

More than happy to explore other techniques, just figured I'd share what I'm currently doing. Can't wait to see this as an official package.

lorentey · August 11, 2021, 8:59pm

Oh, outline views are a really interesting use case.

It sounds like indices would work really well for this. SortedSet and SortedDictionary implement the standard index(_:offsetBy:) method in O(log(count)) time -- so there is no algorithmic advantage to iterator repositioning.

kennyc · August 11, 2021, 9:47pm

offsetBy: requires that I know the offset of the next key, doesn't it? In my case, I don't necessarily know what the offset is, I just know what the next key might be. Given an iterator's current position, I only ever want to move it forward to a given key, not necessarily a given offset. (Though perhaps that makes no difference from a "Big O" perspective...)

Note that I also use the BTreeCursor API for doing batch removal and insertion. In my case, if the user were to expand an item, I might re-fetch the directory entries from the file system, remove all existing child entries from the BTree and replace them with the newly fetched entries. I guarantee that the ordering is correct because the entries will be sorted as part of the function that reads them from disk.

Here's is some rough code of how I might use a cursor. This code might not necessarily be the correct way to do this.

// treeItem is the item that represents a node
// in the outline that is being expanded.

let startKey = treeItem.key
let endKey = treeItem.key.nextSiblingKey()

myTree.withCursor(onKey: startKey) { cursor in
  
  // Skip over the item being expanded.
  if !cursor.isAtEnd, cursor.key == startKey {
    cursor.moveForward()
  }
  
  // Remove all items that are children of the item being expanded.
  while !cursor.isAtEnd, cursor.key < endKey {
    cursor.remove()
  }
  
  // Move the cursor back to one-past the item being expanded. (Necessary?)
  cursor.move(to: startKey)
  cursor.moveForward()

  // Insert the new items as children of the item being expanded.
  cursor.insert(incomingItems.lazy.map { ($0.key, $0) })
}

odmir · August 11, 2021, 10:43pm

vihan:

Additionally, SortedSet offers new and convenient APIs which are unique to its sorted nature:

extension SortedSet {
  /// Returns a sequence of elements in the collection bounded by
  /// limiting element values.
  ///
  /// This is particularly useful when applied with a bound corresponding to some
  /// group of elements.
  ///
  ///     let students: SortedSet = ...
  ///     students["A"..<"B"]  // Sequence of students with names beginning with "A"
  ///
  /// - Complexity: O(log(`self.count`))
  subscript(range: Range<Element>) -> SubSequence<Element>
  subscript(range: PartialRangeFrom<Element>) -> SubSequence<Element>
  subscript(range: PartialRangeThrough<Element>) -> SubSequence<Element>
  subscript(range: PartialRangeUpTo<Element>) -> SubSequence<Element>
}

I feel like this is more generally useful for other collections as well, like in an array of integers where they would be returned in the same order, or a dictionary where the order could be undefined as long as the keys are within the range, or a String where…
Maybe this could be moved to a separate, more generally applicable proposal? The same for the SortedDictionary subscripts.

kylemacomber · August 12, 2021, 12:03am

What's unique about SortedSet and SortedDictionary is that they can calculate this in O(log n) time and return a SubSequence because the elements are guaranteed to be adjacent.

A similar affordance on Array, Dictionary, or String would be O(n) and would need to allocate to return the result (or be a lazy algorithm). For those cases, filter is probably sufficient:

let x = (0...5).shuffled() 
x.filter { (1...4).contains($0) } // [2, 1, 4, 3]

fbartho · August 12, 2021, 2:59pm

I’d be terribly disappointed if this API shipped and if we couldn’t set the comparator ourselves, or even produce a copy of the SortedCollection with a different comparator.

The rest of the API seems great! I just would find myself having to use SortedDictionary instead of SortedSet 90% of the time, and calculate keys such that my desired sorting is achieved, and that’s… incredibly inconvenient. Or I’d have to box them in a struct that compares the way I prefer? Either way, ouch.

calebkleveter · August 12, 2021, 4:17pm

In the proposed implementation, would the sort always be ascending? I think this feature could be even more useful if sorted types had an additional Algorithm generic type that would conform to a SortingAlgorithm protocol that would define how the elements in the collection should be sorted. I would suggest having a default Ascending type for this generic type, but since that's not a thing yet, I won't.

protocol SortingAlgorithm {
    associated Element

    func sort(_ lhs: Element, before rhs: Element) -> Bool
}

struct SuffixAscendingSort: SortingAlgorithm {
    func sort(_ lhs: String, before rhs: String) -> Bool {
        guard let lhsLast = lhs.last else { return true }
        guard let rhsLast = rhs.last else { return false }
        return lhsLast < rhsLast
    }
}

let ascSec = SortedSet<String, SuffixAscendingSort>(["Bar", "Ahz", "Foo", "", "Pi"])
// "", "Pi", "Foo", "Bar", "Ahz"

kylemacomber · August 12, 2021, 11:04pm

We also discussed this in the PriorityQueue review:

No comparator is still my preference for the sorted collections, as it's consistent with the existing types in the stdlib and the rest of the collections package.

This is what I'd recommend.

Part of my thinking is that if we keep everything consistent, it improves the likelihood we can identify a language feature that addresses this and applies equally well to all collections. For example, imagine something like a macro that synthesizes the boxes on your behalf:

Equal(person, by: \.name)
Compare(person, by: \.name)
Hash(person, by: \.name)

fbartho · August 12, 2021, 11:14pm

Unfortunately, this feels like abuse of the Dictionary type, and if I wanted to sort people by name, then by age, I have no intrinsic idea about how I would go about generating a suitable sortable key.

Alphabetical sorting is just one of many ways you might want to sort strings.

davedelong · August 13, 2021, 12:09am

This was my thought too. SortedSet and SortedDictionary shouldn't require Comparable items; they should take comparator functions ((Element, Element) -> Bool) and then have convenience initializers that show up when the element happens to also be Comparable.

But I agree: this API otherwise looks terrific.

lorentey · August 13, 2021, 9:45pm

Protocol conformances are the bread and butter of Swift. Just like Dictionary and Set relies on the Hashable protocol, the (core versions of) SortedDictionary and SortedSet must rely on Comparable.

This isn't really negotiable -- Comparable gives us significant benefits that are impossible to achieve with closures:

Closures aren't equatable, by their nature. This means that there'd be no way to quickly determine/assume that two sorted sets use the same sort order. This would e.g. slow down set operations. In contrast, a particular type can only implement Comparable once, so the type system guarantees that two SortedSets of the same type will always use the same ordering.
When using Comparable, the implementation of </== can be known at compile time, so the compiler is able to optimize these during specialization -- e.g., by making dispatch static, or inlining their implementations, or inferring aspects of their behavior: e.g. a "pure" comparison can be better optimized than one with observable side effects. None of these optimizations are possible with closures.

Protocols like Comparable are there for a reason; we should not be afraid to use them, and to define custom conformances as needed. To customize the sort order, we simply need to define a wrapper type over Element (or Key). This isn't difficult -- conforming types to protocols is a basic Swift skill.

For instance, here is a simple adapter type that reverses the sort order.

struct Decreasing<Base: Comparable>: Comparable {
  var base: Base 

  static func ==(left: Self, right: Self) -> Bool { left.base == right.base }
  static func <(left: Self, right: Self) -> Bool { right.base < left.base }
}

let set = SortedSet<Decreasing<Int>>((0 ... 10).lazy.map { Decreasing(base: $0) })
// `set` now contains the values 10, 9, 8, 7, 6, 5, 4, 3, 2, 1, 0
print(set.first!.base) // 10

The big limitation, of course, is that the sort order must be known at compile time, because it is encoded in the type of SortedSet. Sorted collections that support selecting a sort order at runtime have some very important applications, especially in human interfaces. I would certainly not be opposed to adding additional collection types that support a dynamic sort order -- that would be obviously desirable followup work. There are several ways we can go about it; key paths (or some swiftified version of sort descriptors) seem like a reasonable idea. This is followup work that does not need to block the landing of the APIs proposed here.

(Edit: Note that the order in which data is presented on screen doesn't necessarily need to be reflected in the underlying model. Arguably, user interface engineers shouldn't need to manually deal with sorting their data themselves -- rather, in contexts where the user can select the sort order, the UI widget (such as a multi-column table view) should itself implement sorting in whatever way makes the most sense in its context. Of course, a dynamic sorted collection type could come useful there -- but this would be an implementation detail of the UI library that wouldn't necessarily be exposed to clients.)

lorentey · August 14, 2021, 12:31am

Ah, this is an interesting point -- I assumed each node in the outline hierarchy would know how many descendants it has. If that's not a reasonable assumption in general, then we could add some variation of index(_:offsetBy:) that takes an element/key instead of an offset.

One complication is that for this to be useful, we'd want the operation to return an index even if the target element isn't in the collection -- it shouldn't return an optional like index(of:)/index(forKey:) does. This makes naming the method a bit tricky. (The current design elegantly sidesteps such naming issues by using key-based range subscripts.)

// Unsatisfying naming attempts:

extension SortedSet {
  func firstIndex(
    ofLowerBound target: Element, 
    relativeTo hint: Index
  ) -> Index
}

extension SortedSet.Iterator {
  mutating func skip(upTo target: Element) // skips to first member >= target
  mutating func skip(through target: Element) // skips to first member > target
}

The engineering complication is that indices do not store direct references to all the nodes on the path they address -- they only have a reference to the final node. Implementing fully relative search requires us to be able to traverse the tree upwards, which isn't directly possible in this representation. (Changing the index representation is not out of the question, but we probably wouldn't do it just for the sake of this particular operation.) But this isn't necessarily a problem! Assuming we keep the current Index representation, we could still have a relative search operation that only operated within the current node (and its descendants) -- reverting to a regular lookup if the target is outside of this subtree.

Such a limited relative search may work better anyway: I'm not entirely convinced of the practical performance benefits of a fully relative lookup -- this would need to be proven by benchmarks. (I was somewhat reckless about adding public APIs to the BTree package; just because an operation is implemented there doesn't necessarily mean it's a good idea to use it. ) If the target key is guaranteed to be nearby, then a relative search will probably be faster, but if targets can be more distant, it probably makes more sense to just start from the root (or the current Node, if it turns out to be an ancestor of the one we're looking for). For example, in the case of an outline view over filesystem contents, I wonder if we can reasonably assume that skipping over a subtree would result in a nearby target...

typesanitizer · August 14, 2021, 2:07am

I don't fully follow the wording here. Can having an improper comparison operation lead to memory unsafety? For comparison, here's Rust's BTreeSet which explicitly states that you won't have undefined behavior on such logic errors:

It is a logic error for an item to be modified in such a way that the item’s ordering relative to any other item, as determined by the Ord trait, changes while it is in the set. This is normally only possible through Cell , RefCell , global state, I/O, or unsafe code. The behavior resulting from such a logic error is not specified, but will not result in undefined behavior. This could include panics, incorrect results, aborts, memory leaks, and non-termination.

lorentey · August 14, 2021, 2:31am

Removing items one by one through the cursor (or from the top level) will definitely be much, much slower than removeSubrange*. I don't see a reason this would be meaningfully (if at all) faster when done through a mutable cursor. (BTreeCursor did not provide a key-based range removal method. It did have a log(n) operation for removing a given number of items, but it wasn't doing anything clever.)

* Once we replace its current implementation with log(n) splits and joins, that is.

I think the same applies for bulk insertions -- BTreeCursor implemented this by constructing a standalone tree out of the new items, then splicing it into existing tree using split/join operations. I expect this would be approximated pretty well by the top-level tree merge operation.

So I expect that for this usecase, a removeSubrange followed by formUnion with a tree initialized from the replacement items will easily match or exceed the performance of the old BTreeCursor. (The BTree package is also unfairly disadvantaged because it predates the advent of @inlinable, so it isn't invited to the specialization party. If you're happy with cursor performance in the old package, I think you'll love the fully implemented SortedSet.)

The problem with the cursor abstraction is that it adds significant API complexity for a very small practical benefit. Properly using it is tricky -- it tends to encourage elementwise mutations instead of bulk operations, which isn't a good idea. So I think something like BTreeCursor could perhaps be interesting as an internal implementation detail of select high-level in-place mutations, but I believe making it public in attaswift/BTree was a mistake.

lorentey · August 14, 2021, 2:47am

We have the same guarantees. If the Comparable conformance is incorrect, or a key in the tree is mutated in a way that changes its ordering, then elements may appear out of order, or some operations might trap, but such mistakes will not lead to out of bounds memory accesses. These collections are memory safe no matter how badly the keys violate Comparable requirements.

This is similar to how Set and Dictionary behave when given incorrectly Hashable keys.

Of particular note is that we cannot tell where NaN values will appear in sorted collections with floating point keys.

[API Review] Sorted Collections

Sorted Collections API Review

Introduction

Motivation

Proposed Solution

Detailed Design

SortedSet

Definition

SortedSet Specific APIs

SetAlgebra and the Set Interface Mimicked

Custom Implementations

Partial RangeReplaceableCollection

SortedDictionary

Definition

SortedDictionary Specific APIs

Dictionary Interface Mimicked

Custom Implementations

Partial RangeReplaceableCollection

Considerations

Constant-time startIndex

Exposing offset-based subscripts.

Violating Sortedness Invariants

Conforming to RandomAccessCollection

Exposing the concept of capacity

Providing custom comparators

Conforming to `RandomAccessCollection`