SummarizedCollection: Flexible B+tree looking for feedback

Jesse_Grosjean · January 31, 2023, 3:26pm

I'm working on a flexible b+ tree and looking for anyone who would like to try it out and provide feedback. Hopefully the code will become better and I'll learn some things.

This package is tailored to my own wants and needs and has a fairly large surface area. I'm hopefully that the Swift Collections project will offer a more documented and focused alternative in the future.

The code has tests and benchmarks, but isn't yet used in any real projects. I expect to use in my own projects and I hope other people will use it too, but I don't expect to put a huge effort into further documentation at this time.

SummarizedCollection stores all elements in buffers at the tree's leaf nodes. The internal nodes store a summary (in simplest case just a count) of the elements they contain. Those summaries make it fast to find the leaf where a particular element is stored.

For simplest usage you can use the provided List<Element>. Comparing to Array this list will:

Be slower for smaller numbers of elements
Be faster at most mutations for larger numbers of elements
Be faster/use less memory at COW mutations due to structural sharing

For usage somewhat similar to what Swift Collections OrderedSet provides you can use IdentifiedList<Element: Identifiable>. This works like List<Element>, while also maintaining a Dictionary from element IDs to the tree leaf that contains them. This allows you to quick find the offset of a particular element by finding its leaf and then walking up the tree counting offsets.

The benefit over OrderedSet is that you can mutate the list without having to rebuild the position lookup table each time, so mutations are much faster. On the other hand index lookups are slower (but still very fast) since you need to walk the tree backwards instead of just read value direction.

For the most flexible option you can implement your own SummarizedTreeContext. A good start is to see how the above mentioned List and IdentifiedList do this.

Jesse_Grosjean · January 31, 2023, 3:31pm

While I'm at it here's a question... The the as! here seems uneccessary:

github.com

jessegrosjean/SummarizedCollection/blob/77fad1c63103706c30554cf74e79647fe5ba1778/Sources/SummarizedCollection/SummarizedTree/SummarizedTree+Cursor.swift#L523


      
              )
          
          
}
          
          
public typealias ContainsClosure = (_ start: Summary, _ node: Summary) -> Bool
          public typealias SeekClosure<C> = (_ start: Summary, _ node: Summary, Slot, C) -> Slot? where C: RandomAccessCollection, C.Index == Slot
          
          
@inlinable
          public mutating func seek<C>(contains: ContainsClosure, seek: SeekClosure<C>) -> Int? where C: RandomAccessCollection, C.Index == Slot {
              seekInternal(contains: contains) { start, node, slot, subsequence in
                  seek(start, node, slot, subsequence as! C)
              }
          }
          
          
// MARK: Seek Internal
          
          
@inlinable
          mutating func seekInternal(contains: ContainsClosure, seek: SeekClosure<LeafStorage.SubSequence>) -> Int? {
              if isAfterEnd {
                  return nil
              }

But I can't figure out how to get rid of it. My goal is to have the public API SeekClosure work on a RandomAccessCollection, while allowing the internal code to work on a more specific internal LeafStorage.SubSequence type.

jrose · January 31, 2023, 4:56pm

If you want it to work on LeafStorage.SubSequence, then your closure argument should be SeekClosure<LeafStorage.SubSequence>. On the other hand, if you’re trying to avoid exposing this type to users, then you’ll have to use any RandomAccessCollection in the definition of SeekClosure, because Swift does not allow functions with unknown generic parameters to be passed as values.

jeremyp · January 31, 2023, 4:57pm

The problem is that the last parameter of the seek closure in you pass to seek is of type C. However the last parameter of the closure passed to seekInternal is of type LeafStorage.SubSequence which means that in the closure you pass to seekInternal, the compiler infers the type of subsequence to be LeafStorage.SubSequence which is fine, but then in that closure, you call the closure you passed to seek (confusingly also called seek) which expects the last parameter to be of type C.

The problem becomes more obvious if you type annotate the subsequence parameter. ie. if seek looks like this:

    public mutating func seek<C>(contains: ContainsClosure, seek: SeekClosure<C>) -> Int? where C: RandomAccessCollection, C.Index == Slot {
		seekInternal(contains: contains) { (start, node, slot, subsequence: C) -> Slot? in
			seek(start, node, slot, subsequence)
        }
    }

The error now flags on the declaration of subsequence, basically saying it can't convert C to LeafStorage.SubSequence although it looks like a mess because it expands all the type aliases out.

Jesse_Grosjean · January 31, 2023, 5:47pm

@jrose @jeremyp Thanks. I think I'll wrap the internal type in a public type. It's a bit of extra code, but makes things strait forward.