Pitch: ContainmentSet

Gist version: Proposal: ContainmentSet · GitHub

Adding the ContainmentSet Protocol to Swift

A new public protocol named ContainmentSet unifies set semantics.

Motivation

A major source of "why doesn't this just work?" frustration rises from the turbulent edge of Cocoa and Swift. Consider the disparity between CharacterSet and Set<Character>. The former, bridged from Foundation, is an algorithmic set describing Unicode.Scalar values. The latter is an actual Set of extended grapheme clusters.

Being a struct, Set<Element> cannot perform simple set-like operations. It cannot be inverted like a CharacterSet. This is a common operation to perform when parsing a string, for example. Given a CharacterSet that describes allowed characters, the inversion describes disallowed characters.

let newlines = CharacterSet.newlines
let charactersThatAreNotNewlines = newlines.inverted

Set<Element> is a Collection. Inverting it is impossible. Set<Character> cannot know the entire domain of Character values over which it should iterate.

Inverting CharacterSet over a Swift type is also impractical. It is limited to Unicode.Scalar values and cannot, for example, contain emoji. Most emoji are composed of multiple Unicode.Scalar values.

Swift needs a way to describe an abstract set, which can test containment but does not promise iteration support.

Detailed Design

public protocol ContainmentSet {
    associatedtype ContainedElement
    
    /// Returns a truth value indicating whether an element is a
    /// member of a conforming instance.
    /// Where possible, this should be O(1).
    func contains(_ element: ContainedElement) -> Bool
    
    /// Returns a union of two instances of the conforming type
    /// This is a logical-or operation.
    func union(_ other: Self) -> Self
    
    /// Returns the intersection (possibly empty) of two instances
    /// of the conforming type.
    /// This is a logical-and operation.
    func intersection(_ other: Self) -> Self
    
    /// Returns `self`, removing any items that appear in the union of
    /// `self` and `other`.
    func subtracting(_ other: Self) -> Self
    
    /// Returns the union of items unique to `self` and `other`.
    /// This is a logical exclusive-or operation.
    func symmetricDifference(_ other: Self) -> Self

}

ContainmentSet will also have a protocol extension that defines:

public extension ContainmentSet {
    
    /// Returns a `ContainmentSet` over the associated type domain, 
    /// with the current elements removed.
    /// This is a logical-not operation.
    var inverted: PredicateSet<ContainedElement> { get }
    
    func union<C: ContainmentSet>(_ other: C) -> PredicateSet<ContainedElement> where C.ContainedElement == ContainedElement
    
    func intersection<C: ContainmentSet>(_ other: C) -> PredicateSet<ContainedElement> where C.ContainedElement == ContainedElement
    
    func symmetricDifference<C: ContainmentSet>(_ other: C) -> PredicateSet<ContainedElement> where C.ContainedElement == ContainedElement
    
    func subtracting<C: ContainmentSet>(_ other: C) -> PredicateSet<ContainedElement> where C.ContainedElement == ContainedElement
    
}

The Standard Library will also add a new type, called PredicateSet, which adopts ContainmentSet:

public struct PredicateSet<T>: ContainmentSet {
    
    public typealias ContainedElement = T
    
    /// Initializes a `PredicateSet` with another `ContainmentSet`,
    /// effectively making a type-erased version of the other `ContainmentSet`
    public init<C>(_ set: C) where C : ContainmentSet, ContainedElement == C.ContainedElement
   
    /// Create a `PredicateSet` containing all elements that pass the predicate
    public init(predicate: @escaping (T) -> Bool) 
    
    /// Create a `PredicateSet` that contains everything in the provided `Collection`
    public init<C>(collection: C) where C : Collection, C.Element == ContainedElement
    
    /// Updates `self` by inverting its predicate
    public mutating func invert()
   
    /// Updates `self` by forming a union with `other`
    public mutating func formUnion<C>(_ other: C) where C : ContainmentSet, ContainedElement == C.ContainedElement
   
    /// Updates `self` by removing any members that do not appear
    /// in `other`.
    public mutating func formIntersection<C>(_ other: C) where C : ContainmentSet, ContainedElement == C.ContainedElement
   
    /// Updates `self` by removing members that appear in `other` and
    /// adding items that appear in `other` that are not in `self`.
    public mutating func formSymmetricDifference<C>(_ other: C) where C : ContainmentSet, ContainedElement == C.ContainedElement
   
    /// Updates `self` by removing any members that appear in `other`.
    public mutating func subtract<C>(_ other: C) where C : ContainmentSet, ContainedElement == C.ContainedElement
    
}

SetAlgebra will be made to conform to ContainmentSet, and it will lose its direct declarations of union, subtracting, etc, since those will be provided by ContainmentSet.

Since SetAlgebra will be adopting ContainmentSet, this means that Set<Element>, IndexSet, OptionSet, and CharacterSet will all adopt ContainmentSet. No change will be needed to their respective implementations.

Usage in the Standard Library

While ContainmentSet has utility on its own, it is clear that deeper integration with the standard library is warranted.

Since ContainmentSet's core functionality is fundamentally a (Element) -> Bool check, it stands to reason that other instances of (Element) -> Bool checks throughout the library could reasonably be supplemented with analogous ContainmentSet versions.

A survey of the various methods on Sequence, Collection, and MutableCollection indicates the following additions might be appropriate:

extension Sequence {
    func filter<C>(_ isIncluded: C) where C : ContainmentSet, C.ContainedElement == Element
    func prefix<C>(in: C) where C : ContainmentSet, C.ContainedElement == Element
    func first<C>(in: C) where C : ContainmentSet, C.ContainedElement == Element
    func drop<C>(while: C) where C : ContainmentSet, C.ContainedElement == Element
    func last<C>(in: C) where C : ContainmentSet, C.ContainedElement == Element // assuming SE-0204 passes
}

extension Collection {
    func index<C>(in: C) where C : ContainmentSet, C.ContainedElement == Element
    func lastIndex<C>(in: C) where C : ContainmentSet, C.ContainedElement == Element // assuming SE-0204 passes
    func split<C>(maxSplits: Int = default, omittingEmptySubsequences: Bool = default, whereSeparator isSeparator: C) -> [Self.SubSequence] where C : ContainmentSet, C.ContainedElement == Element
}

extension MutableCollection {
    func partition<C>(by belongsInSecondPartition: C) -> Self.Index where C : ContainmentSet, C.ContainedElement == Element
}

Additionally, existing uses of CharacterSet in the standard library can now be expanded to take ContainmentSets instead.

Thus, these methods on StringProtocol:

extension StringProtocol {
    func components(separatedBy separator: CharacterSet) -> [String]
    func addingPercentEncoding(withAllowedCharacters allowedCharacters: CharacterSet) -> String?
    func trimmingCharacters(in set: CharacterSet) -> String
    func rangeOfCharacter(from aSet: CharacterSet, options mask: String.CompareOptions = default, range aRange: Range<Self.Index>? = default) -> Range<Self.Index>?
}

will become:

extension StringProtocol {
    func components<C>(separatedBy separator: C) -> [String] where C : ContainmentSet, C.ContainedElement == Element
    func addingPercentEncoding<C>(withAllowedCharacters allowedCharacters: C) -> String? where C : ContainmentSet, C.ContainedElement == Element
    func trimmingCharacters<C>(in set: C) -> String where C : ContainmentSet, C.ContainedElement == Element
    func rangeOfCharacter<C>(from aSet: C, options mask: String.CompareOptions = default, range aRange: Range<Self.Index>? = default) -> Range<Self.Index>? where C : ContainmentSet, C.ContainedElement == Element
}

Open Questions

Character Classes

CharacterSet has a bunch of nice static properties to define things like alphanumerics, illegal characters, symbols, uppercase letters, etc.

Should PredicateSet provide these same static properties when its contained element is Character?

Operators

ContainmentSet types are naturally usable with operators.

  • Union is expressible as concatenation (+) or logical-or (||)
  • Intersection is expressible using the logical-and operator (&&)
  • Subtraction is expressible using the subtraction operator (-)
  • Symmetric Difference is expressible using the exclusive-or operator (^)
  • Inversion is expressible using the prefix operators for negation (!) and bitwise-not (~)

Should the standard library provide implementations of these operators that work on ContainmentSets?

Expressibility

Under certain circumstances, you may want to create a ContainmentSet using literals:

let everything: PredicateSet<Foo> = true
let someBars: PredicateSet<Bar> = [bar1, bar2, bar3]

Should PredicateSet conform to ExpressibleByBooleanLiteral and ExpressibleByArrayLiteral?

Additionally, you might want shorthand syntax for defining a set based on the characters in a string:

let allowed: PredicateSet<Character> = "0123456789"
let nowAllowed = allowed.inverted

Should PredicateSet conform to ExpressibleByStringLiteral when the contained element is Character?

Source compatibility

While this proposal is strictly additive, it will involve changes to existing standard library methods.

However, since the changes will be to relax the type constraints on the methods (ex: Allowing a ContainmentSet of Unicode.Scalar where previously only a CharacterSet was allowed), there is no source compatibility issue.

Functionality Compatibility

There is a potential for minor functionality changes. To illustrate these, consider the current StringProtocol method:

func trimmingCharacters(in set: CharacterSet) -> String

Since this takes a CharacterSet, it is limited to trimming Unicode.Scalar values off the ends of the receiver. However, if this method is changed to:

func trimmingCharacters<C>(in set: C) -> String where C : ContainmentSet, C.ContainedElement == Element

then it changes to trimming Characters off the ends of the receiver.

Effect on ABI stability

This proposal does not affect ABI stability.

Effect on API resilience

This proposal does not affect ABI resilience.

Alternatives Considered and Rejected

Add this functionality to SetAlgebra directly.

This sounds appealing at first, but SetAlgebra defines methods that are at odds with the non-iterability of ContainmentSet, such as isEmpty or isSuperset(of:).

inverted should be part of the base protocol, and not solely implemented in an extension.

The logical way to do this would have var inverted: Self { get }. However, it's impossible to implement this on Set<Element>, because a Set does not know its entire domain over which to iterate.

Thus, the only rational course is to implement it in a protocol extension, where we can specify an explicit return type of PredicateSet<ContainedElement>.

I did consider adding more associated types to the base protocol, so an adopter of the protocol could define a custom type to return for inversion, unioning, subtraction, etc. However, the utility of this seemed quite minimal, as there are only 4 SetAlgebra types.

Sequence (or Collection) should adopt ContainmentSet.

The fundamental idea of ContainmentSet is to express the notion that you can query a value to see if it contains other values.

From this point of view, it seems natural to make Sequence adopt ContainmentSet. However, it seems unnatural to want to union two arrays. Additionally, if we adopt operators for ContainmentSet operations, then + would become ambiguous: it could mean either concatenation or unioning for two arrays.

Other Names Considered In Lieu of Containment

  • Space
  • Manifold
  • Surface
  • Group
  • Class
  • Domain
  • Container
13 Likes

The protocol needs at least 1 associatedtype for inversion, unioning, subtraction, etc.
Because otherwise the extension methods returning PredicateSet would be overloads, and not default implementations which would be preferable.
However it may be 1 associatedtype for all of them, as opposed to for each of the methods.

I'm not sure I follow. I've implemented this locally, and I don't get build warnings when I have a simple PredicateSet<T>: ContainmentSet struct.

After reading your proposal, and it is not clear to me how ContainmentSet differs from SetAlgebra.

The Motivation section describes a situation where you found the existing concrete types Set and CharacterSet to be insufficient, which to me suggests what you want is a new concrete type. Immediately following that, the Detailed Design section describes a protocol almost identical to SetAlgebra, and a concrete type implementing it.

Could you please explain what you find insufficient with SetAlgebra, and why you want a whole new protocol rather than making small adjustments to that one?

2 Likes

Of course!

SetAlgebra has a couple of semantics that make it impractical for being the base of this functionality:

  1. Implied iterability - there are some methods on SetAlgebra that only work if you're able to deterministically know everything in the set. For example: isSubset(of:) is only possible to implement if you can know what every element in the set is, to check for containment in the proposed superset. This is possible to know for finite sets, but it is impossible to know for infinite sets.

  2. Self - the types defined by SetAlgebra are all relative to Self. This works great for most operations, but again, because of the implied iterability, it does not work for others (mainly inversion).

An alternative approach to this proposal would be to add a specifically-typed inverted extension to SetAlgebra (ie var inverted: PredicateSet<Element>). However, the resulting type (PredicateSet) could not be a SetAlgebra type, because it wouldn't be iterable (and thus couldn't implement isSubset(of:)). So in that case, you'd end up duplicating most of the SetAlgebra functionality on PredicateSet. This might seem simpler at first glance, but I'm of the opinion that it's worth lifting out common declarations like this in to protocols. This means you need a protocol that is most of SetAlgebra, but without the methods that require iterability: ContainmentSet.

I would love to have ContainmentSet in the standard library. SetAlgebra is extremely restrictive, and it seems to be designed around the commonalities of Set and OptionSet way more than based on the abstract notion of what a set mathematically is. I think your proposal might need a bit more work on highlighting the exact difference between the two protocols though.

At first I thought the range types in the standard library would be good candidates for ContainmentSet since those all have a constant-time contains method, but the union or symmetric difference of two ranges isn't necessarily another range. You could think about whether ContainmentSet really needs the func union(_ other: Self) -> Self (and other) requirements, or if the default implementations that return predicate sets are sufficient.

Lastly, I doubt whether the overloads on filter, prefix etc. can be justified. array.filter(set) is not a lot shorter than array.filter(set.contains) and IMO not as expressive.

1 Like

I don't have any comments yet on the viability of the specific APIs presented here (from either a Swift point of view or an algebraic point of view), but IMO the proposal would be improved by excising the parts about CharacterSet, strings, characters from it. I understand that part of the motivation here is the behavior of CharacterSet vs Set<Character>, but jumping back and forth between CharacterSet and more abstract set theory concepts makes it feel a bit unfocused and more difficult to see what the big picture is. The set-theoretic aspects stand well on their own.

CharacterSet has other significant issues that aren't related to its SetAlgebra conformance or its differences compared to Set<Character> and it really deserves its own proposal if someone wants to try to overhaul that.

2 Likes

OK, I'll rework the prose.

Yeah, I'm not proposing overhauling CharacterSet; I'm trying to build an API that would be sufficient to replace it.

A quick thought on source compatibility:
In cases where a user is relying on dot syntax type inference in methods whose type constraints are being relaxed,
e.g.

let words = line.components(separatedBy: .whitespacesAndNewlines)

we now have ambiguity in that .whitespacesAndNewlines is not an immediate property of ContainmentSet where ContainedElement == Character.

Hi Dave,

Thanks for bringing this up. I've long felt that there's a need for a non-enumerable protocol underneath SetAlgebra that doesn't have the enumeration-requiring parts (i.e. isSubset) so I'd definitely like to see this idea progress. Here are some initial quick thoughts:

I'm unclear on whether associatedtype ContainedElement should just be Element. I suspect it should. It's Element for SetAlgebra. One case where it differs is ExpressibleByArray.ArrayLiteralElement but that's mostly for backwards-compatibility (because when we were finally able to add the constraint Sequence.Element == Iterator.Element, there were some collections in the wild that were expressible by a different type to their element).

Big +1 for adding PredicateSet and the extension using it to implement union etc. But since the methods on ContainmentSet all return Self, this means that the extension won't fulfill the protocol requirements. There would need to be another associated type for this to work, similar to SubSequence (though hopefully without a single unifying type being so problematic).

I would strongly recommend not touching the third rail of operator overloading on a first pass :slight_smile:

Should PredicateSet conform to ExpressibleByArrayLiteral: yes! But there's a moratorium on ExpressibleByBooleanLiteral conformances outside of the current 3. Conformance to ExpressibleByStringLiteral probably isn't necessary. Instead I'd suggest init<C: Collection>(_: C) which would allow PredicateSet("foo") (and which ExpressibleByArrayLiteral could also be implemented in terms of).

Re adding overloads that take a ContainmentSet to the standard library: this doesn't seem necessary. If you have a type that has a contains method, you can just pass it point-free style into the predicate-taking versions:

let set: Set = [1,2,3]
(0..<10).filter(where: set.contains)

Seconded. The idea of using operators for SetAlgebra operations was highly contentious during the initial Evolution process and ultimately rejected. It would be best not to revisit that decision unless there are clear-cut new use cases that would motivate their addition now.

So, something like this?

protocol ContainmentSet {
    associatedtype Element
    associatedtype UnionType : ContainmentSent where UnionType.Element == Element
    func union(_ other: Self) -> UnionType
}

I was also thinking that union could have a generic parameter, instead of just being Self, like so:

func union<C>(_ other: C) -> UnionType where C : ContainmentSet, C.Element == Element

IIRC I ran in to some conceptual problems with that route, because SetAlgebra et. al. all want that parameter type to be Self.

Yes, like that, but with one associated type to return for union, intersection, subtracting and the rest (for which maybe UnionType would still be a good name, idk...):

protocol ContainmentSet {
    associatedtype Element
    // the = here is just a default rather than a requirement, 
    // implementations can choose something else, like Self, if they want
    associatedtype UnionType: ContainmentSent = PredicateSet<Element>
    where UnionType.Element == Element

    func union(_ other: Self) -> UnionType
    func intersection(_ other: Self) -> UnionType
    func subtracting(_ other: Self) -> UnionType
}

I should probably expand my slightly vague comment about SubSequence. Sequence follows a similar pattern, where prefix, suffix (and even split :exploding_head: ) all return SubSequence. This is actually a royal pain, because it means you have to use type erasure, because there is no common general implementation of both prefix and suffix that return the same type. However, since PredicateSet, being closure-based, is fundamentally type-erasing, I don't think that will be a problem here (I hope!)

I'm conflicted on this front. If an API handed me a PredicateSet sight-unseen, I would assume the contains check happens in constant time. With ExpressibleByArrayLiteral (assuming it's not constrained to Hashable), the contains check would be O(n).

Do O(1) semantics make sense for ContainmentSet? Should it be documented either way?

Set also conforms to ExpressibleByArrayLiteral, which makes it convenient to initialize a set from an Array literal. I assume this would be the same thing.

Either way, you could write this and end up with O(n) time:

let array: [NonHashable] = //..
let predicateSet = PredicateSet<NonHashable>(predicate: array.contains)

I'm asking if this should be allowed or not (semantically).

PredicateSet could conditionally conform to ExpressibleByArrayLiteral.

Overall in support of this.

I don't know if it belongs in this proposal, or maybe another, but there's a very important operation on predicate sets that should be supported, and that's "contramap".

If you start using something like PredicateSet<T> in real world code, you will quickly come to a situation where you want to map it like you would an array, but if you try to define such a function you will run into problems. That's because PredicateSet<T> doesn't have a standard map, but has something that is flipped, and therefore called contramap (this should remind you of covariance/contravariance in subtyping if you are familiar with those concepts).

Its signature is the following:

extension PredicateSet {
  func contramap<U>(_ f: @escaping (U) -> T) -> PredicateSet<U> {
    // implementation left to the reader
  }
}

Notice that the transformation (U) -> T is flipped from what you would expect in map, which goes (T) -> U.

You can think of it as allowing you to lift predicates on simple types up to more complicated types. For example, a predicate on Int could be lifted to a predicate on User:

let isEven = PredicateSet { $0 % 2 == 0 }
let userIdIsEven: PredicateSet<User> = isEven.contramap { $0.id }

Even better, if Swift ever bridged key paths to functions you could even do:

let isEven = PredicateSet { $0 % 2 == 0 }
let userIdIsEven = isEven.contramap(\User.id)
let userNameCountIsEven = isEven.contramap(\User.name.count)
// ...

I think this operation should be named contramap since it's a term of art, much like map. Also the contra prefix is generally used to "flip" a concept, such as the covariance/contravariance in subtyping I mentioned. However, if that's a sticking point, other names could be lifting or projecting:

isEven.lifting(\User.id)
isEven.projecting(\User.id)

Just some food for thought.

5 Likes

Looking at recent discussions about Set (SE-0203 — Rename Sequence.elementsEqual - #51 by adamkemp), it's a pity that we can't just replace that with your datatype - the current situation should make every mathematician cry ;-)
But I guess our "collection with hidden order and without duplicates" is more useful for common applications...

Have you thought about starting with Predicate, and building a set type on top of that? NSPredicate needs an overhaul as well ;-)

Imagine, for a moment, that we could extend structural types, so we could use a solution like this instead:

extension <Element> @escaping (Element) -> Bool {
  var inverted: (Element) -> Bool {
    return { !self($0) }
  }
  func union(_ other: @escaping (Element) -> Bool) -> (Element) -> Bool {
    return { self($0) || other($0) }
  }
  func intersection(_ other: @escaping (Element) -> Bool) -> (Element) -> Bool {
    return { self($0) && other($0) }
  }
  // etc.
}

Would we prefer this solution? We already have all sorts of APIs which take predicate functions; this would work better with them. On the other hand, we also can't do this today, while we can do ContainmentSet.

2 Likes