Understanding the internals of the Set type

jasonzurita · May 1, 2018, 9:15pm

I am hoping to learn a little more about the internals of the Set type: swift/Set.swift at main · apple/swift · GitHub

Some guidance on the below topics would be great:

Looks like the worst case runtime to make a Set from an Array is O(n). Is my understanding correct?
I see that Set uses VariantSetBuffer and NativeBuffer under the hood. Where can I learn more about these two? I have looked around, but I have been coming up short. Maybe I am missing something here.

Thanks!

Karl · May 3, 2018, 1:57pm

Do you have any specific questions? The declarations are in the file you linked to.

NativeSetBuffer (and RawNativeSetStorage) contains the Swift Set implementation:

github.com

apple/swift/blob/a2a8c5d17c58cbd6021d5ae7bb85b71e07f8f7f5/stdlib/public/core/Set.swift#L1719


      
          
          /// A wrapper around _RawNativeSetStorage that provides most of the
          /// implementation of Set.
          ///
          /// This type and most of its functionality doesn't require Hashable at all.
          /// The reason for this is to support storing AnyObject for bridging
          /// with _SwiftDeferredNSSet. What functionality actually relies on
          /// Hashable can be found in an extension.
          @usableFromInline
          @_fixed_layout
          internal struct _NativeSetBuffer<Element> {
          
            internal typealias RawStorage = _RawNativeSetStorage
            internal typealias TypedStorage = _TypedNativeSetStorage<Element>
            internal typealias Buffer = _NativeSetBuffer<Element>
            internal typealias Index = _NativeSetIndex<Element>
          
            internal typealias Key = Element
            internal typealias Value = Element
            internal typealias SequenceElementWithoutLabels = Element

VariantBuffer is a union of the potential *SetBuffers that might be providing the Set's backing storage. On non-ObjC platforms, there is a single case: .native(NativeSetBuffer). On Obj-C platforms, the Set might have been bridged from an NSSet, so we gain an additional case: .cocoa(CocoaSetBuffer).

github.com

apple/swift/blob/a2a8c5d17c58cbd6021d5ae7bb85b71e07f8f7f5/stdlib/public/core/Set.swift#L2641


      
            internal static func fromArray(_ elements: [SequenceElementWithoutLabels])
              -> _CocoaSetBuffer {
          
              _sanityCheckFailure("this function should never be called")
            }
          }
          #endif
          
          @usableFromInline
          @_frozen
          internal enum _VariantSetBuffer<Element: Hashable>: _HashBuffer {
          
            internal typealias NativeBuffer = _NativeSetBuffer<Element>
            internal typealias NativeIndex = _NativeSetIndex<Element>
          #if _runtime(_ObjC)
            internal typealias CocoaBuffer = _CocoaSetBuffer
          #endif
            internal typealias SequenceElement = Element
            internal typealias SequenceElementWithoutLabels = Element
            internal typealias SelfType = _VariantSetBuffer

jasonzurita · May 4, 2018, 9:24pm

Ahh, I totally missed that they are defined in the same file, thanks!!

I don't have any specific questions at the moment. I am just trying to dig into the implementation details to better understand the runtime complexity of the different Set operations. This will definitely help!

Any thoughts on my first question? If not, no worries :).

saagarjha · May 7, 2018, 11:59am

It sure looks like it, since the initializers copy the array into their own buffer.

Karl · May 7, 2018, 11:08pm

At least. Going from non-unique -> unique is going to be quadratic in the worst-case, because you have to test every element against every element you saw before. Actual performance will depend on the quality of your data's hash-values.

If a unique hash can be calculated in constant time (best case), the entire operation will be linear. If your data takes a long time to hash, or produces poor hash values which frequently collide, you will get more quadratic behaviour (worst case).