It might be worth mentioning a similar data type that I created: SegmentedLine<Bound, Element>
[Code | Tests]
It is very much like the proposed RangeSet<Bound>
, but each range is associated with a value. You could think of it as being to RangeSet
what Dictionary
is to our regular Set
. SegmentedLine
is perhaps a bit of a weird name, but it's less weird than RangeDictionary
, and it has a very visual quality that I quite like.
The reason I created this type was to generate optimised Unicode databases: my WebURL library contains an implementation of IDNA, which is a way of normalising and validating Unicode domain names, and requires several pieces of data about each scalar, such as:
Bidi_Class
, for detecting ambiguous uses of bidirectional text
Joining_Type
, for detecting disallowed uses of joining characters
- Whether the scalar is a virama
- Whether the scalar is a spacing/non-spacing/enclosing mark
Performing several database lookups for each and every scalar would clearly be less than ideal, so I wanted to create a combined database which gave me all of this information in a single lookup. Binary size was also a significant concern - Swift packages are embedded in to applications, meaning this database would contribute to an App's overall download and install size.
But I couldn't find a good data type (either in a package or the standard library) which let me do this.
-
I wanted to parse the raw Range -> Data
values out from the Unicode tables, then map the raw data to simplified representations containing the minimum needed to implement the IDNA algorithm, then
-
I wanted to merge all of those data-sets in to one big Range -> Data
table, then
-
I wanted to combine adjacent ranges with the same Data
.
-
I also wanted this to be built using high-level operations. The table I've described here is for validation data, which is actually the second stage of the IDNA process. The first stage uses a different table, and it may be possible to reduce the table size further by identifying redundancies across the two.
So that's why I created SegmentedLine
. Here's the code which does the things I just described and generates the validation table. I think it's pretty easy to read and maintain, and I'm very happy with how it turned out.
I've considered whether this data structure might be useful to other developers, and here are two usecases which I think are promising:
- A generalised
AttributedString
, able to tag any range of any collection with any associated value (here we're tagging ranges of a String
, where each range has an Array of Any
associated with it, just to illustrate the concept -- but really you can tag ranges of any collection with anything).
let string = "Bob is feeling great"
// Create a SegmentedLine for the collection's contents.
// Start by setting a font attribute over the entire string.
var tags = SegmentedLine(
bounds: string.startIndex..<string.endIndex,
value: [Font.custom("Comic Sans")] as [Any]
)
// Set each word to a different color.
// Use 'modify' to append the attribute, but only for the region
// we're modifying.
for word: Substring in string.split(separator: " ") {
tags.modify(word.startIndex..<word.endIndex) { attributes in
attributes.append(Color.random())
}
}
// Check the result.
// - ✅ Every segment still contains the font attribute.
// - ✅ Each word also contains its individual color attribute.
for (range, attributes) in tags.segments {
print(#""\#(string[range])""#, "-", attributes)
}
// "Bob" [Font.custom("Comic Sans"), Color.orange]
// " " [Font.custom("Comic Sans")]
// "is" [Font.custom("Comic Sans"), Color.green]
// " " [Font.custom("Comic Sans")]
// "feeling" [Font.custom("Comic Sans"), Color.pink]
// " " [Font.custom("Comic Sans")]
// "great" [Font.custom("Comic Sans"), Color.yellow]
- Timelines naturally fit this kind of structure. There's no collection containing "all dates", but they are
Comparable
and so you can create ranges of them, and there interesting range-based operations which apply to them. For instance, a simple calendar:
import Foundation
struct Appointment {
// ...
}
enum CalendarEntry {
case available
case busy(Appointment)
}
let cal = Calendar(identifier: .gregorian)
let start = cal.startOfDay(for: .now)
let end = cal.date(byAdding: .day, value: 1, to: start)!
var timeline = SegmentedLine<Date, CalendarEntry>(
bounds: start..<end, value: .available
)
// Make an appointment.
let oneHourAgo = cal.date(byAdding: .hour, value: -1, to: .now)!
let oneHourFromNow = cal.date(byAdding: .hour, value: 1, to: .now)!
timeline.set(oneHourAgo..<oneHourFromNow, to: .busy(Appointment()))
print(timeline)
// ┬
// ├ [2023-12-17 23:00:00 +0000..<2023-12-18 13:31:11 +0000]: available
// ├ [2023-12-18 13:31:11 +0000..<2023-12-18 15:31:11 +0000]: busy(Appointment())
// ├ [2023-12-18 15:31:11 +0000..<2023-12-18 23:00:00 +0000]: available
// ┴
The reason I bring this up is because I think it raises questions about future directions: if we decide that RangeSet
belongs in the standard library, wouldn't we also want to one day introduce a version which can take an associated value? As I described it above, a type which is to RangeSet
what Dictionary
is to our regular Set
.
For instance, if you wanted to build a timeline using RangeSet
, you could mark regions of Date
as being included in the set, but you wouldn't be able to associate those ranges with data containing appointment information.
And while the proposal is focussed on tagging Collection indexes, it can only mark presence. That works if you consider DiscontiguousSlice
to be the only interesting client, but more advanced clients which want to, say, associate ranges of a String with font/colour information (as demonstrated above) would still find this to be somewhat lacking. There's this hole in its capabilities which becomes hard to ignore once you know about it.
Another interesting future direction is indexing.
While SegmentedLine
is the thing that I use for building the Unicode database, when I actually emit it, since the Bound
type is a binary integer, I also emit an index of the top N bits (each table can customise N). This can dramatically improve lookup performance for large tables by limiting the range that needs to be binary-searched.
For instance, the validation table I mentioned previously has 909 entries for the BMP Unicode plane (0x0000...0xFFFF). The index captures the exact locations of the top 6 bits, and we binary-search in that region for the remaining 10. Given the way the data is distributed, even in the worst case those 10 bits only consist of 165 entries, which still saves us almost 3 rounds of binary search and is more cache-friendly since it doesn't hop around as much of the table. That's the worst space; most spaces are much smaller than that - for Latiny code-points, we only need to search over 39 entries, and for CJK code-points, we can pinpoint the exact entry just from the index and don't even need to binary search at all. But of course, there are costs to calculating and storing the index: it's definitely worth it for this table (especially since the index is calculated offline), but it may not be worth it for all tables.
RangeSet
is also built atop binary search, and it stands to reason that a large RangeSet which is queried often might also benefit from an index. Should this be exposed as a separate IndexedRangeSet
type? I'm not sure; perhaps, but there are reasonable arguments either way.
All of this makes me wonder: is this really a good fit for the standard library?
I mean, I think putting it in swift-collections
would not be unreasonable. I don't see it as more fundamental than SortedSet
, Deque
, Heap
, or TreeDictionary
, for instance.
And if it does go in the standard library, should it go in the core
module? It appears we have recently adopted a trend of making things in to separate modules in the stdlib, so I wonder if this should belong in some kind of separate Collections
or RangeCollections
module?
AFAIK there isn't any sort of guiding principle which determines when something gets broken off in to a separate module, so it's hard for me to say.
Perhaps it depends how far we want to go with it? If we think we may be interested in something like SegmentedLine
in the future, and perhaps variants of that and of RangeSet
which incorporate an index, then we would have a clear logical grouping containing enough range-based collections that it may make sense to namespace them.