SE-0270 (fourth review): Add Collection operations on noncontiguous elements

Hello, Swift community.

SE-0270 was originally reviewed in a series of reviews beginning in late 2019 and accepted into the Preview package in January 2020. The proposal authors have decided that they would like to graduate these APIs into the standard library as part of winding down the Preview package. This step requires an evolution review, which will run until December 18th. As part of this review, the original proposal has been modified in three ways:

  • The subranges(of:) and subranges(where:) methods have been renamed to indices(of:) and indices(where:) to avoid conflicts with certain string processing APIs.
  • The conformance of DiscontiguousSlice to MutableCollection has been removed.
  • Several missing conformances to Equatable, Hashable, and Sendable have been added.

The focus of this review is on those three revisions, as well as the general questions we ask below about the appropriateness of the proposal. Review feedback on the overall proposal will not be ignored, but I would like to encourage reviewers with questions outside of the focus to read through previous reviews, especially the third review, to see if their questions were answered there.

Trying it out

The differences between the latest proposal and the existing version in the Preview package are relatively minor, and we think the Preview package should be good enough for people who'd like to try this proposal out.

What goes into a review?

The goal of the review process is to improve the proposal under review through constructive criticism and, eventually, determine the direction of Swift. When writing your review, here are some questions you might want to answer in your review:

  • What is your evaluation of the proposal?
  • Is the problem being addressed significant enough to warrant a change to Swift?
  • Does this proposal fit well with the feel and direction of Swift?
  • If you have used other languages or libraries with a similar feature, how do you feel that this proposal compares to those?
  • How much effort did you put into your review? A glance, a quick reading, or an in-depth study?

More information about the Swift evolution process is available at:

https://github.com/apple/swift-evolution/blob/main/process.md

Thank you,

John McCall
Review Manager

19 Likes

Naming is in an unfortunate place now. As I recall it was named RangeSet, in part, to avoid collision with Foundation.IndexSet. Now the operations here are all indices but return the RangeSet type. I don't know that there's any way to fix it now, other than being okay with the collision with Foundation.IndexSet.

4 Likes

No, the name was chosen after arduous discussion that it is the best name because this type models a set of ranges; there was lengthy discussion prior to and in the first review, and speaking personally I would really, really not like to see this rehashed four reviews and four years later—given that the type still (as was discussed in the fourth pitch preceding this fourth review) models a set of ranges.

As it happens, indices is also as originally proposed; here, this proposal is simply reverting back to the first version of the API. None of the options were to universal acclaim but this has been discussed at length in the second and third reviews.

7 Likes

Fair enough. I don't mind the naming, just pointing it out.

The more important question is, what happens to the preview package once this is merge? It was never really used, but there are likely at least a few users out there. Will it be kept online?

Personally, I'm really disappointed the preview package never caught on, as, at the least, it seemed like an excellent way to effectively backport new features once they were part of the language. I'd hoped we'd see continual, official usage for things like Regex, but that never materialized. Is anything like this possible, or does this shut the door on any sort of package mirroring newer APIs?

1 Like

I'm going to make a proposal to formally sunset it, but it'll continue to exist for source-stability purposes. That's mostly beside the point of this review, however.

I don't think this is a problem, the indices operations do return a set of index ranges, which I think is perfectly fine. That is, it does indeed return indices, just that they are expressed as a set of ranges.

2 Likes

I happy to see this finally making its way to the main swift standard library and I'm happy with the naming updates, they make sense. I read the proposal 'back in the day' and was happy with it then and I'm still happy with it now.

1 Like

Overall it looks great.

My only minor gripe is one aspect of the naming.(*)

It seems that there's an analogy in the types, RangeSet : Range :: DiscontiguousSlice : Slice, and so I wish their names were similarly analogous (eg, {RangeSet, SliceSet} or {DiscontiguousRange, DiscontiguousSet}). In both cases, the left is a "contiguous" one, and the right is either a "discontiguous" one or a "ordered chain of contiguous" ones, depending on how you want to view them. Seems like their corresponding names should reflect that in the same way?

(*) I didn't participate in the prior reviews and tried a cursory search on them to see if this was brought up but didn't see it, so sorry if this is a rehash.

2 Likes

I don't think the original reviews discussed DiscontiguousSlice vs. SliceSet directly, but they did discuss the naming of RangeSet vs. other names like DiscontiguousRange at length so I don't want to rehash that part of the discussion here. As for the DiscontiguousSlice vs. something like SliceSet point - I still feel that DiscontiguousSlice is fitting. My main reasoning here is that DiscontiguousSlice isn't really a set - it doesn't implement any of the SetAlgebra-like APIs that RangeSet does and it doesn't act like a set per se in its use. Rather it acts just like a Slice does, so I see the names as analogous since RangeSet is a set-like type dealing with Ranges of indices and DiscontiguousSlice is a slice-like type dealing with discontiguous views into the base collection.

4 Likes

please, please… if func has taught us anything, it's that names don't really matter in the long run. no matter how silly the name, you get used to it pretty quickly. we need this in stdlib so badly. no more about the names, I beg you.

1 Like

Please don't police other people's conversations. This isn't going to show up any faster in the standard library if we don't have a naming discussion.

3 Likes

I’m struggling to reconcile that with @xwu’s comment that the naming discussion already took place:

I am the review manager for this proposal. I described the scope of the review in my initial post in this thread.

2 Likes

Quite so, and to be clear, my comment is not that the naming discussion took place: a naming discussion took place and (as per @John_McCall's urging) it's encouraged to see if something's already been covered previously.

Please don't take that to mean that we ought to accept a less good design instead of having a thorough discussion: think of it as an invitation to fast forward, not to eject (how's that for a VCR analogy?).

2 Likes

"Noncontiguous" as a title doesn't point to the virtues of the proposal. I like the document title "RangeSet and collection operations" because the key operations derive from the properties of RangeSet, i.e., distinct, ordered ranges over the comparable Index.

indices as an API seems to lose these RangeSet properties.

In the third review, I didn't see the term select or selection considered, and I don't see a conflict of that with string API's.

Unlike indices, the concept of selection seems to support adjunct semantics, the shadow of the query used. Selection strictly just means elements-from, but practical usage it gets additional per-query semantics e.g., that selecting with query order preserves a partial order, or that selecting on a key type makes the key visible. Selection is flexible enough to mean the bag of elements themselves, or traversable/navigable references that require reification. Unlike filter, a selection is clearly distinct from the collection itself (not a wrapper but a view), with a dependent lifecycle. So the action and the result get the right query-matching semantics.

So in this case DiscontiguousSlice can be thought of as Selection<RangeSet, Element>.

But as that suggests, selection as a concept is more abstract and could start to consume settled names. e.g., some might argue that parameter names could be selection as well:

  • fr: subscript(subranges: RangeSet<Index>)
  • to: subscript(selection: RangeSet<Index>)

But this proposal is only for ranges, so selection should be reserved for another time, and perhaps select for the same reason: to avoid a slide into generality when support a specific feature to leverage the comparability of collection indexes.

Perhaps then indices becomes selectRanges(...), leaving API space for other type- or query-specific select API's?


As an aside, I didn't see in the API documentation what happens when some portion of the indexes have changed, e.g., if an element in the subrange was deleted, and then the subrange is used to move or delete. In the same vein, I didn't see an API to check if the RangeSet is still valid, or to recover or re-run the query to renew the RangeSet validity (that's another API entirely).

2 Likes

I think I still feel going back to the first revision's indices naming would be best suited here over going with something like selectRanges. I don't believe we use the term "select" in any Swift stdlib APIs currently. While there could be future API space around query-based "select" APIs if that were desired, I don't think we should use this API to define that term and open up that naming space. Rather, "indices" is a term that we use in other APIs that makes it clear to the caller what is being provided and I think it's the best option given the naming discussions that have happened in the previous revisions.

As for your aside, you are correct there is no API for this - there also isn't any API for checking/renewing index validity for swift collections in general. Once a collection is mutated, indices of that collection are invalidated and can no longer be used with the collection. So developers should treat the indices within a RangeSet as any other index: valid until the collection has been mutated. Usage of invalid indices in a collection does not have defined behavior (in many cases will lead to an abort for "smarter" indices, or usage of incorrect elements in more trivial indices)

2 Likes

Overall, this looks very good to me and I'm hope that it will make its way into the standard library as soon as possible.

Elements view

In an earlier version of the proposal, RangeSet included a collection view of the indices, conditionally available when the Bound type was strideable with an integer stride. This collection view was removed for the same reasons as covered in the section above. Users can access these indices by using a RangeSet as a subscript for the source collection's indices property.

I haven't had time to look through all the previous threads in detail, so forgive me if that has been discussed there already. I understand the rationale for removing the Elements view, but could we at least get a convenience function like

extension RangeSet {
    func elements<C>(using collection: C) -> DiscontiguousSlice<C.Indices>
        where C: Collection, C.Index == Bound {
        collection.indices[self]
    }
}

I think that would increase discoverability, because people would normally search for the elements on RangeSet instead of subscripting collection.indices.
I'm not so sure about the name elements(using:) though, other ideas are welcome (maybe indices(within:) which would be similar to RangeSet.init(_:within:)?).

The first version of this pitch originally introduced an elements(within:) function that provided this convenience. It was renamed to individualIndices(within:) and then eventually removed as part of the discussion of the first pitch. The consensus of the discussion about this during the first pitch was that the convenience wasn't necessary because we want to encourage people to use the collection.indices[myRangeSet] spelling instead. We can certainly ensure there's adequate documentation to convert back and forth between a RangeSet and its contained indices, but the original review decided that shouldn't be a convenience in the API contract.

It might be worth mentioning a similar data type that I created: SegmentedLine<Bound, Element>

[Code | Tests]

It is very much like the proposed RangeSet<Bound>, but each range is associated with a value. You could think of it as being to RangeSet what Dictionary is to our regular Set. SegmentedLine is perhaps a bit of a weird name, but it's less weird than RangeDictionary, and it has a very visual quality that I quite like.

The reason I created this type was to generate optimised Unicode databases: my WebURL library contains an implementation of IDNA, which is a way of normalising and validating Unicode domain names, and requires several pieces of data about each scalar, such as:

  • Bidi_Class, for detecting ambiguous uses of bidirectional text
  • Joining_Type, for detecting disallowed uses of joining characters
  • Whether the scalar is a virama
  • Whether the scalar is a spacing/non-spacing/enclosing mark

Performing several database lookups for each and every scalar would clearly be less than ideal, so I wanted to create a combined database which gave me all of this information in a single lookup. Binary size was also a significant concern - Swift packages are embedded in to applications, meaning this database would contribute to an App's overall download and install size.

But I couldn't find a good data type (either in a package or the standard library) which let me do this.

  1. I wanted to parse the raw Range -> Data values out from the Unicode tables, then map the raw data to simplified representations containing the minimum needed to implement the IDNA algorithm, then

  2. I wanted to merge all of those data-sets in to one big Range -> Data table, then

  3. I wanted to combine adjacent ranges with the same Data.

  4. I also wanted this to be built using high-level operations. The table I've described here is for validation data, which is actually the second stage of the IDNA process. The first stage uses a different table, and it may be possible to reduce the table size further by identifying redundancies across the two.

So that's why I created SegmentedLine. Here's the code which does the things I just described and generates the validation table. I think it's pretty easy to read and maintain, and I'm very happy with how it turned out.

I've considered whether this data structure might be useful to other developers, and here are two usecases which I think are promising:

  1. A generalised AttributedString, able to tag any range of any collection with any associated value (here we're tagging ranges of a String, where each range has an Array of Any associated with it, just to illustrate the concept -- but really you can tag ranges of any collection with anything).
let string = "Bob is feeling great"

// Create a SegmentedLine for the collection's contents.
// Start by setting a font attribute over the entire string.

var tags = SegmentedLine(
  bounds: string.startIndex..<string.endIndex,
  value: [Font.custom("Comic Sans")] as [Any]
)

// Set each word to a different color.
// Use 'modify' to append the attribute, but only for the region
// we're modifying.

for word: Substring in string.split(separator: " ") {
  tags.modify(word.startIndex..<word.endIndex) { attributes in
      attributes.append(Color.random())
  }
}

// Check the result.
// - ✅ Every segment still contains the font attribute.
// - ✅ Each word also contains its individual color attribute.

for (range, attributes) in tags.segments {
  print(#""\#(string[range])""#, "-", attributes)
}

// "Bob"     [Font.custom("Comic Sans"), Color.orange]
// " "       [Font.custom("Comic Sans")]
// "is"      [Font.custom("Comic Sans"), Color.green]
// " "       [Font.custom("Comic Sans")]
// "feeling" [Font.custom("Comic Sans"), Color.pink]
// " "       [Font.custom("Comic Sans")]
// "great"   [Font.custom("Comic Sans"), Color.yellow]
  1. Timelines naturally fit this kind of structure. There's no collection containing "all dates", but they are Comparable and so you can create ranges of them, and there interesting range-based operations which apply to them. For instance, a simple calendar:
import Foundation

struct Appointment {
    // ...
}

enum CalendarEntry {
    case available
    case busy(Appointment)
}

let cal   = Calendar(identifier: .gregorian)
let start = cal.startOfDay(for: .now)
let end   = cal.date(byAdding: .day, value: 1, to: start)!

var timeline = SegmentedLine<Date, CalendarEntry>(
    bounds: start..<end, value: .available
)

// Make an appointment.
let oneHourAgo     = cal.date(byAdding: .hour, value: -1, to: .now)!
let oneHourFromNow = cal.date(byAdding: .hour, value: 1, to: .now)!
timeline.set(oneHourAgo..<oneHourFromNow, to: .busy(Appointment()))

print(timeline)
// ┬
// ├ [2023-12-17 23:00:00 +0000..<2023-12-18 13:31:11 +0000]: available
// ├ [2023-12-18 13:31:11 +0000..<2023-12-18 15:31:11 +0000]: busy(Appointment())
// ├ [2023-12-18 15:31:11 +0000..<2023-12-18 23:00:00 +0000]: available
// ┴

The reason I bring this up is because I think it raises questions about future directions: if we decide that RangeSet belongs in the standard library, wouldn't we also want to one day introduce a version which can take an associated value? As I described it above, a type which is to RangeSet what Dictionary is to our regular Set.

For instance, if you wanted to build a timeline using RangeSet, you could mark regions of Date as being included in the set, but you wouldn't be able to associate those ranges with data containing appointment information.

And while the proposal is focussed on tagging Collection indexes, it can only mark presence. That works if you consider DiscontiguousSlice to be the only interesting client, but more advanced clients which want to, say, associate ranges of a String with font/colour information (as demonstrated above) would still find this to be somewhat lacking. There's this hole in its capabilities which becomes hard to ignore once you know about it.

Another interesting future direction is indexing.

While SegmentedLine is the thing that I use for building the Unicode database, when I actually emit it, since the Bound type is a binary integer, I also emit an index of the top N bits (each table can customise N). This can dramatically improve lookup performance for large tables by limiting the range that needs to be binary-searched.

For instance, the validation table I mentioned previously has 909 entries for the BMP Unicode plane (0x0000...0xFFFF). The index captures the exact locations of the top 6 bits, and we binary-search in that region for the remaining 10. Given the way the data is distributed, even in the worst case those 10 bits only consist of 165 entries, which still saves us almost 3 rounds of binary search and is more cache-friendly since it doesn't hop around as much of the table. That's the worst space; most spaces are much smaller than that - for Latiny code-points, we only need to search over 39 entries, and for CJK code-points, we can pinpoint the exact entry just from the index and don't even need to binary search at all. But of course, there are costs to calculating and storing the index: it's definitely worth it for this table (especially since the index is calculated offline), but it may not be worth it for all tables.

RangeSet is also built atop binary search, and it stands to reason that a large RangeSet which is queried often might also benefit from an index. Should this be exposed as a separate IndexedRangeSet type? I'm not sure; perhaps, but there are reasonable arguments either way.


All of this makes me wonder: is this really a good fit for the standard library?

I mean, I think putting it in swift-collections would not be unreasonable. I don't see it as more fundamental than SortedSet, Deque, Heap, or TreeDictionary, for instance.

And if it does go in the standard library, should it go in the core module? It appears we have recently adopted a trend of making things in to separate modules in the stdlib, so I wonder if this should belong in some kind of separate Collections or RangeCollections module?

AFAIK there isn't any sort of guiding principle which determines when something gets broken off in to a separate module, so it's hard for me to say.

Perhaps it depends how far we want to go with it? If we think we may be interested in something like SegmentedLine in the future, and perhaps variants of that and of RangeSet which incorporate an index, then we would have a clear logical grouping containing enough range-based collections that it may make sense to namespace them.

The thing that makes RangeSet different is that it enables a group of operations that apply across many different collection types. That makes it akin to the group of range types that are already in the standard library, in addition to the use cases where it can be useful as a standalone data structure on its own.

Your SegmentedLine type certainly looks useful! I don't think there's anything in this proposal that would prohibit consideration of a range-based dictionary as an addition to the standard library or another package.