Add `SortedArray`, `SortedSet` and `SortedDictionary` with binary search

Jeehut · April 10, 2021, 12:44pm

Binary Search has been asked for in this forums quite some time ago and the discussion also went into the direction that first we need the relevant data structures as binary search is only possible if there are related sorted structures. Seeing the release of the Collections library now, I think it's time to add binary search and the related structures.

The very simplest use case I'd have for this in code form:

let sortedNumbers: SortedArray = [10, 1, 9, 2, 8, 3, 7, 4, 6, 5] // => [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
if let firstMatchingIndex = sortedNumbers.index(where: { $0 > 5 }) { // .index(where:) uses binary search and runs in O(log n)
  let firstMatch = sortedNumbers[firstMatchingIndex]
}

If you want to see a possible implementation, feel free to checkout my HandySwift library where I implemented the SortedArray type for my own needs.

This feature was also requested as the very first this GitHub issue, but I feel like it's more a discussion topic than an "issue", thus I'm creating this topic.

What do you think, am I misunderstanding something or is this finally the right place and time to add sorted data structures and binary search?

lorentey · April 13, 2021, 11:47pm

I think SortedArray would work great as an immutable type. As soon as we allow insertions though, we hit a performance wall at around ~2000 word-sized items or so, rendering the data structure impractical. To me, this makes it less likely to be a useful addition to the Collections package.

But the new package is indeed the right place to add sorted collections, and the time is ripe, too. In fact, we have scheduled time & resources to implement them. (Which is why issue #1 is currently assigned to myself -- it's to claim the issue for the Stdlib team. )

As explained in the issue, we're planning to base the implementation on B-trees, similar to the old attaswift/BTree project. (Yes, I do have a secret plan to replace all my previous work with (far better) official reimplementations.)

Binary search is a separate topic -- I prefer to consider it an algorithm that can be applied to any random-access collection, rather than a data structure. I.e., I want to be able to run binary searches within any array slice, deque, unsafe buffer pointer etc. etc.

The binary search algorithm does have an unusually complex precondition, but I don't think this precondition needs to be modeled in the type system. In fact, I think trying to do so would make it less useful. @nnnnnnnn and @timv may have other opinions though!

Note that SortedSet, SortedDictionary (SortedBag, SortedMultimap) won't be random-access collections, so the binary search algorithm wouldn't apply to them. Index-based binary search would be a terrible way to find elements in a search tree! (Of course, these types would rely on binary search to find keys within B-tree nodes, as an internal implementation detail.)

Jeehut · April 14, 2021, 9:15am

Thanks for the update. Good to hear sorted types are being worked on soon. But I just want to express that when Swift-Algorithms was released, that I asked them for clarification what belongs there and what not and IIRC the gist of it was that binary search requires a data structure and therefore it isn‘t considered an „algorithm“ but more a structure type. Hence I‘m confused now because you see it the other way.

I also want to express that I didn‘t have another use case for sorted types yet than performance (through binary search for example), so I‘d be interested in learning what other problems can be solved by sorted types than that.

Please also note that I don‘t find my shared SortedArray type good or complete enough to base upon an official type - it‘s just to show the use case. For example, right after sharing I found that it has some limitations that are not documented well enough.

pyrtsa · April 14, 2021, 12:14pm

Please note that the O(log n) binary search is already part of Swift Algorithms, just hidden behind a slightly inconvenient API and undiscoverable name of partitioningIndex(where:).

Nevin · April 14, 2021, 1:52pm

A bit off-topic, but I looked at the implementation of that and noticed the bidirectional version of partition has two open-coded loops reimplementing firstIndex(where:) and lastIndex(where:) back-to-back.

Digging deeper, I found that the standard-library implementation does the exact same thing.

It seems rather antithetical to the goal of “use the standard algorithms that are already written” when both the standard library and the Algorithms package choose to spend a dozen lines reimplementing something twice (with complicated nested loops whose exit conditions are hidden in the middle, no less!), that would be easy one-liners using the standard index-finding methods.

lorentey · April 20, 2021, 7:49pm

The Swift Project is not an omniscient, immutable, monolithic entity, and we aren't dealing with absolute truths. Otherwise there wouldn't be a need for this website at all.

This project is made of people. People change. People have different opinions. People have different priorities. Decisions tend to be based on relevant hard-earned experience, and they always involve some careful balancing of competing priorities. However, sometimes even the most well established opinions need to be refined in light of new information. Otherwise, no progress would be possible.

Programming isn't turtles all the way down -- the cost of providing really nice abstractions is that somebody somewhere needs to implement them in terms of more primitive tools. The engineers working on the stdlib and these packages sweat the details of implementing these so that others don't have to.

Nevin · April 21, 2021, 3:02am

I think you may have misunderstood my point.

The standard library provides an abstraction for finding the first index satisfying a predicate, spelled firstIndex(where:).

The standard library also includes a partition function, one step of whose implementation involves finding the first index satisfying a predicate.

But, rather than simply calling firstIndex(where:), the partition function instead has an open-coded loop which duplicates the functionality of firstIndex(where:).

(It also has an open-coded loop which duplicates the functionality of lastIndex(where:).)

The Algorithms package has those same open-coded loops as well.

• • •

If there is something about firstIndex(where:) which makes the authors of both the standard library and the Algorithms package prefer to rewrite its functionality in open-coded loops, rather than simply call firstIndex(where:), then that speaks quite loudly.

lorentey · April 21, 2021, 9:58pm

Please do not look for conspiracies where none exist. The firstIndex(where:) and partition(by:) methods exist in the standard library because we think they're very useful, and people should use them when possible. They are also sibling APIs on roughly the same level within the stdlib module. Whether or not they call each other does not indicate anything about anything. (Reducing trivial code duplication within the stdlib is a minor cosmetic concern, not an absolute goal.)

The default MutableCollection.partition(by:) implementation does call firstIndex(where:). However, the bidirectional variant doesn't, simply because it isn't an exact fit. This fact implies absolutely nothing about whether we think people should use firstIndex(where:) -- we do!

Are you derailing this conversation to relitigate PR #34814? I'm sorry I offended you by declining your suggestion. I stated my reasoning in the PR -- but note that I felt this was a minor cosmetic disagreement (a matter of taste). I didn't realize you'll take it as a sign of some sinister intent, and that you'll bring it up months later in a completely unrelated context.

To clear this up, I spent some time to explain my reasoning in more detail below.

All things being equal, I preferred the code to reflect the algorithm in Hoare's original paper, because I find it has a pleasing symmetry about it. Hoare wrote his paper during the Kennedy administration in 1962, and back then he chose to present his algorithm using a popular programming language at the time, called "English".

The items to be sorted are scanned by two pointers; one of them, the lower pointer, starts at the item with lowest address, and moves upward in the store, while the other, the upper pointer, starts at the item with the highest address and moves downward. The lower pointer starts first. If the item to which it refers has a key which is equal to or less than the bound, it moves up to point to the item in the next higher group of locations. It continues to move up until it finds an item with key value greater than the bound. In this case the lower pointer stops and the upper pointer starts its scan. If the item to which it refers has a key which is equal to or greater than the bound, it moves down to point to the item in the next lower locations. It continues to move down until it finds an item with key value less than the bound. Now the two items to which the pointers refer are obviously in the wrong positions, and they must be exchanged. After the exchange, each pointer is stepped one item in its appropriate direction, and the lower pointer resumes its upward scan of the data. The process continues until the pointers cross each other, so that the lower pointer refers to an item in higher-addressed locations than the item referred to by the upper pointer. In this case, the exchange of items is suppressed, the dividing line is drawn between the two pointers, and the partition process is at an end.

If we port this code sample to Swift, we get the algorithm I committed.

If we wanted, we could reformulate this code in terms of firstIndex(where:) and lastIndex(where:), to get something like:

var lo = startIndex
var hi = endIndex
while lo < hi {
  guard let i = try self[lo ..< hi].firstIndex(where: belongsInSecondPartition) else { 
    return hi
  }
  guard let j = try self[i ..< hi].lastIndex(where { try !belongsInSecondPartition($0) }) else {
    return i
  }
  
  swapAt(i, j)
  lo = index(after: i)
  hi = j
}
return lo

This keeps the structure of the algorithm intact. It does not reduce the complexity of the algorithm at all, though -- this is still operating on the level of raw indices. In fact, I have probably introduced some bugs, because hammering the code into the shape of the two sibling algorithms was not trivial. And the resulting code still includes a raw index(after:) invocation, so the change has achieved essentially nothing, other than slightly obscuring the underlying logic -- but note that this is just my opinion, not an absolute fact.

None of this discussion is relevant to code outside the stdlib and its menagerie of packages though. (And it has absolutely no relevance whatsoever to the topic of sorted collections.) I do wholeheartedly believe that the best way to code is to never deal with raw indices. (Even firstIndex(where:) and partition(by:) are too low-level for direct use for my taste -- because they return naked indices. It is usually a much better idea to express things with higher-level algorithms such as the removeAll(where:) implementation that calls the half stable partition algorithm followed by removeSubrange.)

Note though that this is just a helpful coding guideline to prevent people from needlessly wasting their time; I'm not aware of any plans to eliminate the low-level indexing operations, and I would likely argue against them if someone proposed such a thing. Only a Sith deals in absolutes. The indexing methods are there for good reasons, and they aren't going to go away:

I think we can build out a rich library of standard algorithms that prevents the need to code raw loops in the vast majority of cases, but I expect there will always be niche problems that are best solved by rolling your own code from scratch.

The idea is that these should be rare, because they are so much more difficult to implement correctly. I'm not just talking about implementing the raw loop itself: every time we write a raw loop, we should also write a bunch of custom testing code (including benchmarks if there is a perf concern). Every one of these loops adds a large amount of complexity to the codebase, which increases the cost of changing the code later, and elevates the risk of introducing new bugs whenever we do so. It also prevents the codebase from making use of potential future improvements to the standard algorithm implementations.
There is also no problem writing raw loops outside production code -- the ability to do so when necessary is a great skill to have, and working out a correct loop is always a fun puzzle.

However, in production code, I do strongly believe it's better to use standard algorithms where they exist. (It usually isn't a responsible use of resources to build an industrial laser cutter to open a single packet of chips, no matter how fun it would be.)
It is okay to reinvent the wheel when your job is to design wheels in a wheel factory. The standard library codebase is an algorithms factory; as engineers working on it, it is our business is to write raw algorithms, so that our customers don't have to.

Of course, the stdlib engineers should by all means eat their own dog food -- how else would we know if it tastes good? But I think measuring the quality of the stdlib codebase itself by how pedantic it is about layering would be a costly and counter-productive mistake. We can (and do!) eat our own dog food outside the core stdlib. (Unfortunately, I find that working on the stdlib has tainted me with an insatiable hunger for low-level understanding and optimization -- which is very undesirable & completely impractical in most other software engineering contexts.)

To put it another way -- the stdlib tends to err on the side of Knuth, rather than McIlroy, but the reason for this is that we want to enable more people to use McIlroy-style solutions in more contexts.

Very few people can obtain the virtuoso services of Knuth (or afford the equivalent person-weeks of lesser personnel) to attack nonce problems such as Bentley’s from the ground up. But old Unix™ hands know instinctively how to solve this one in a jiffy. (So do users of SNOBOL and other programmers who have associative tables’ readily at hand-for almost any small problem, there’s some language that makes it a snap.) The following shell script was written on the spot and worked on the first try. It took 30 seconds to handle a 10,000-word file on a VAX-11/750™.

I think you and me both want the same thing -- that Swift should allow people to solve complex problems in a jiffy. We just have an aesthetic disagreement on the internal design of one of the tiny tools that get us there. I don't want to belittle the differences, but as things go, these are very minor.

The variants you recommended are still raw loops, so in my eyes they have the same complexity as the original code. They also break the symmetry I like so much, and these aren't backed by a well-known paper, so I declined to take them. If there is a technical reason we should go with these variants (say, they lead to better code generation, or smaller code, etc.) then please, do submit a PR, demonstrating these benefits. I don't think there are any, but I can be easily mistaken.

(Hoare's work is 60 years old at this point, and I'm sure there have been more recent research. This may be a fruitful path to attacking my bias as a Hoare fanboi. )