Range based Collection mutations

Pitch: Range based Collection mutations

Introduction

Finding, replacing, removing, counting, and splitting one ordered collection based on the contents of another is extremely useful (and would fill some holes in today's String API as well). We should add facilities to BidirectionalCollection and RangeReplaceableCollection that allow users to do this.

Two of these additions are currently being pitched here:
Count(of:) and contains(other:) for BidirectionalCollection - Pitches - Swift Forums
I included these two methods here as well because I think it makes sense to consider them as a whole rather than individually as they should all be named more or less uniformly.

Motivation

Mutating a collection based on its contents is a common thing to do. For example, you might want to split a collection up based on a delimiter. We have this in Swift today:

func split(separator: Element, maxSplits: Int = default, omittingEmptySubsequences: Bool = default) -> [AnyBidirectionalCollection<Element>]

This is very useful, but doesn't allow you you split the collection up by another ordered collection that contains the same element. For example, it would be useful to be able to do this*:

let moons = "Metis, Adrastea, Amalthea, Thebe".split(separatedBy: ", ")

*you can do this on String today but it requires bridging to NSString. It also doesn't work on Array .

It is also useful to be able to find ranges, replace, remove, count, check for containment, and split the collection based on ordered collections that contain the same type.

Proposed Solution

extension RangeReplaceableCollection where Self: BidirectionalCollection, Element: Equatable {
	//remove
  mutating func removeFirst<C: BidirectionalCollection>(occurrenceOf pattern: C) where C.Element == Element
  mutating func removeLast<C: BidirectionalCollection>(occurrenceOf pattern: C) where C.Element == Element
  mutating func removeAll<C: BidirectionalCollection>(occurencesOf pattern: C) where C.Element == Element

	//replace
  mutating func replaceFirst<C: BidirectionalCollection>(occurrenceOf pattern: C, with replacement: C) where C.Element == Element
  mutating func replaceLast<C: BidirectionalCollection>(occurrenceOf pattern: C, with replacement: C) where C.Element == Element
	mutating func replaceAll<C: BidirectionalCollection>(occurrencesOf pattern: C, with replacement: C) where C.Element == Element
}

extension BidirectionalCollection where Element: Equatable {
  func contains<C: BidirectionalCollection>(occurrenceOf pattern: C) -> Bool where C.Element == Element

  func count<C: BidirectionalCollection>(of pattern: C, overlapping: Bool = false) -> Int where C.Element == Element

	//Ranges
  func lastRange<C: BidirectionalCollection>(of pattern: C) -> Range<Index>? where C.Element == Element
  func firstRange<C: BidirectionalCollection>(of pattern: C) -> Range<Index>? where C.Element == Element
  func ranges<C: BidirectionalCollection>(of pattern: C, overlapping: Bool = false) -> [Range<Index>] where C.Element == Element

  func split<C: BidirectionalCollection>(separatedBy pattern: C, maxSplits: Int = Int.max, omittingEmptySubsequences: Bool = true) -> [SubSequence] where C.Element == Element
}

Implementation

This change is on the large side for Standard Library additions, but I think it's worth considering these APIs together.

All these can be implemented in terms of firstRange(of:), thus, there are benefits to designing these all together.

Performance

Many of these methods would benefit from specializations from some of the more concrete Swift types, namely Array and String. Therefore, we should make the ones that make sense, customization points on the respective collection. That is, they would be requirements on the protocol with default implementations, rather than being extension methods only.

Alternatives considered

We could put all the BidirectionalCollection methods on Collection, none of the new methods really make sense on unordered collections like Dictionary and Set. BidirectionalCollection also makes all of the last methods much more performant as we can search from the back. There are also algorithmic benefits for specializations of these methods where having the ability to traverse the supplied collection in both directions would yield better performance.

Impact on String

Under this proposal, the following String methods would (eventually?) be deprecated as their functionality is replaced by the above methods*:

extension String {
  public func contains<T : StringProtocol>(_ other: T) -> Bool
  public func components<T : StringProtocol>(separatedBy separator: T) -> [String]
  public func components(separatedBy separator: CharacterSet) -> [String]
}

*Note that the only thing that will be deprecated is the extension on String that exposes these NSString methods on String, users who still need these could cast to NSString and get them back.

11 Likes

I really, really like this pitch. I think it makes strings easier to work with for common cases, and the way the methods are written on generic types, we'll see value out of this that we didn't expect. I love it.

One thought I do have is that because these methods are not string specific, they can't gain any of the string specific options that users might want, like those found in Foundation: .diacriticInsensitive, .caseInsensitive, etc.

Currently, if you compare the version of "find me the range of this substring" found in Swift to the version in Foundation, the two methods look like this:

// on String
string.firstRange(of: substring) // returns a Range<Index>

// on NSString
string.range(of: substring, options: .caseInsensitive) // returns an NSRange

It would be really nice if these new methods had the same basic signature as the Foundation ones (namely, adding first to the Foundation version), so that just adding the options parameter would let you bridge between the two.

Caveats to this idea:

  • gotta fix the return types/ranges somehow
  • there's a bit of magic, and inexperienced users wouldn't realize they're calling a totally different method in a totally different module
  • the string manifesto suggests a different approach for options (one parameter for each option, with a default)
1 Like

That's a great point, both range(of:) and replacingOccurrences(of:) have overloads which take CompareOptions and that functionality would not be completely replaced by this. We'd only deprecate the methods in the String overlay that would be completely redundant. I updated the Impact on String` section accordingly.

Luckily, if you call range(of:) on String today, you already get back a Range<Index>, you only get NSRange if you as NSString cast your String.

There are benefits to having firstRange and range(of:) be named similarly but I'm also concerned about not having a clear enough separation between the NSString method and the new one. I probably lean more towards leaving the old name intact to keep them distinct which has the side benefit of introducing less migration induced code churn. This also leaves room in the future to augment the API in the future to cover the cases like locale and the search options that we currently don't handle (should we chose to).

1 Like

I just mentioned on that thread that the reasons for restricting those methods on BidirectionalCollection, instead of generalizing on Sequence, are unjustified. So maybe you can redo that section. I'm thinking:

  • contains(occurrenceOf:), count(of: overlapping:), and split(separatedBy: maxSplits: omittingEmptySubsequences:) on Sequence
  • firstRange(of:) and ranges(of: overlapping:) on Collection
  • lastRange(of:) on BidirectionalCollection.

For those following along at home, I posted a response there since that's where the bulk of your reply is.

I'm with Daryle that it's really weird to have the searchy methods restricted to BidirectionalCollection. They don't make sense on basically-unordered collections, sure, but they totally do make sense on linked lists or sequences-made-from-functions. (Well, most of them.)

I'll note that I had a use case recently where I wanted split to actually preserve the separators in the results. You can do this from the API that's here, but it's, uh, not great:

let splits = bigLongString.split(separatedBy: "### ")
let starts = splits.lazy.map { $0.startIndex }
let splitsWithSuffixesExceptLast = zip(starts, starts.dropFirst()).map {
  bigLongString[$0.0..<$0.1]
}
let splitsWithSuffixes =
  splitsWithSuffixesExceptLast + [bigLongString[starts.last!...]]

Perhaps the right answer here is an enum:

enum IncludingSeparatorsBehavior {
  case omitted
  case asPrefix
  case asSuffix
}

and then the behavior I want is includingSeparators: .asSuffix. (Note that the only empty subsequences when including separators will be at the beginning or end of the result.)

Or perhaps the right answer is something like what I wrote above, which doesn't pull its weight in making the implementation more complex.

1 Like

Thanks Daryle and Jordan, I was thinking about this more today and I agree. The right tradeoff is probably to put most of these on Collection, with the exception of the last methods which still should require BidirectionalCollection so we can search from the back (and the RangeReplaceableCollection ones stay where they are obviously)

If we add specializations of these in the future, there is benefit to keeping the parameter's bidirectional at least because some of the String search algorithms benefit from working with the end of the search string. Does that seem ok?

Here is where I have the methods falling currently:

extension Collection where Element: Equatable {
  public func count<C: BidirectionalCollection>(of pattern: C, overlapping: Bool = false) -> Int where C.Element == Element
  public func contains<C: BidirectionalCollection>(occurrenceOf: C) -> Bool where C.Element == Element
  public func firstRange<C: BidirectionalCollection>(of pattern: C) -> Range<Index>? where C.Element == Element
  public func allRanges<C: BidirectionalCollection>(of pattern: C, overlapping: Bool = false) -> [Range<Index>] where C.Element == Element
  public func split<C: BidirectionalCollection>(separatedBy separator: C, maxSplits: Int = Int.max, omittingEmptySubsequences: Bool = true) -> [SubSequence] where C.Element == Element
}

extension BidirectionalCollection where Element: Equatable {
  public func lastRange<C: BidirectionalCollection>(of pattern: C) -> Range<Index>? where C.Element == Element
}

extension RangeReplaceableCollection where Element: Equatable {
  public mutating func removeFirst<C: BidirectionalCollection>(occurrenceOf pattern: C) where C.Element == Element
  public mutating func removeAll<C: BidirectionalCollection>(occurencesOf pattern: C) where C.Element == Element
  public mutating func replaceAll<C: BidirectionalCollection>(occurrencesOf pattern: C, with replacement: C) where C.Element == Element
  public mutating func replaceFirst<C: BidirectionalCollection>(occurrenceOf pattern: C, with replacement: C) where C.Element == Element
}

extension RangeReplaceableCollection where Self: BidirectionalCollection, Element: Equatable {
  public mutating func removeLast<C: BidirectionalCollection>(occurrenceOf pattern: C) where C.Element == Element
  public mutating func replaceLast<C: BidirectionalCollection>(occurrenceOf pattern: C, with replacement: C) where C.Element == Element
}

I'm not sure how I feel about adding the IncludingSeparatorsBehavior customization. While useful for your use-case, I can see other people wanting the separators to be interleaved as separate elements in the output, do we have an enum case for that as well? If this is a common use of split, we could consider adding it, but we'd want the right list of additional behaviors. On the other hand, what options are there besides the three you listed and the addition I made to the list? That list might be comprehensive enough. The code you wrote was not ideal.

I still wonder if it's useful to have some of these operations on Sequence as well. After all, ad-hoc Sequences are easy to make (you just need a generator function). Ad-hoc Collections a little less so.

(Specifically, count and contains "make sense" on Sequences, though you'd have to be careful about overlapping-count.)


I do like included as another case of IncludingSeparatorsBehavior, good point. That said, if you don't make split a customization point / requirement of Collection, then nothing says IncludingSeparatorsBehavior has to be frozen just yet. And FWIW, I can't see a reason to make it a customization point. It's basically built out of allRanges (except not literally because that would allocate an extra array for no reason).

I agree some of these could be useful on Sequence, but we'd need a cache the size of the search collection which kind of stinks. As @Michael_Ilseman mentioned offline, we probably don't want to add more methods on Sequence until we have a better answer about the future of it's single/multi-passness.

Split won't be a customization point, only first/last rangeOf would be that I think. Let's make the enum non-frozen for now and then we can lock it down in the future.

1 Like