Add `SortedArray`, `SortedSet` and `SortedDictionary` with binary search

lorentey · April 21, 2021, 9:58pm

Please do not look for conspiracies where none exist. The firstIndex(where:) and partition(by:) methods exist in the standard library because we think they're very useful, and people should use them when possible. They are also sibling APIs on roughly the same level within the stdlib module. Whether or not they call each other does not indicate anything about anything. (Reducing trivial code duplication within the stdlib is a minor cosmetic concern, not an absolute goal.)

The default MutableCollection.partition(by:) implementation does call firstIndex(where:). However, the bidirectional variant doesn't, simply because it isn't an exact fit. This fact implies absolutely nothing about whether we think people should use firstIndex(where:) -- we do!

Are you derailing this conversation to relitigate PR #34814? I'm sorry I offended you by declining your suggestion. I stated my reasoning in the PR -- but note that I felt this was a minor cosmetic disagreement (a matter of taste). I didn't realize you'll take it as a sign of some sinister intent, and that you'll bring it up months later in a completely unrelated context.

To clear this up, I spent some time to explain my reasoning in more detail below.

All things being equal, I preferred the code to reflect the algorithm in Hoare's original paper, because I find it has a pleasing symmetry about it. Hoare wrote his paper during the Kennedy administration in 1962, and back then he chose to present his algorithm using a popular programming language at the time, called "English".

The items to be sorted are scanned by two pointers; one of them, the lower pointer, starts at the item with lowest address, and moves upward in the store, while the other, the upper pointer, starts at the item with the highest address and moves downward. The lower pointer starts first. If the item to which it refers has a key which is equal to or less than the bound, it moves up to point to the item in the next higher group of locations. It continues to move up until it finds an item with key value greater than the bound. In this case the lower pointer stops and the upper pointer starts its scan. If the item to which it refers has a key which is equal to or greater than the bound, it moves down to point to the item in the next lower locations. It continues to move down until it finds an item with key value less than the bound. Now the two items to which the pointers refer are obviously in the wrong positions, and they must be exchanged. After the exchange, each pointer is stepped one item in its appropriate direction, and the lower pointer resumes its upward scan of the data. The process continues until the pointers cross each other, so that the lower pointer refers to an item in higher-addressed locations than the item referred to by the upper pointer. In this case, the exchange of items is suppressed, the dividing line is drawn between the two pointers, and the partition process is at an end.

If we port this code sample to Swift, we get the algorithm I committed.

If we wanted, we could reformulate this code in terms of firstIndex(where:) and lastIndex(where:), to get something like:

var lo = startIndex
var hi = endIndex
while lo < hi {
  guard let i = try self[lo ..< hi].firstIndex(where: belongsInSecondPartition) else { 
    return hi
  }
  guard let j = try self[i ..< hi].lastIndex(where { try !belongsInSecondPartition($0) }) else {
    return i
  }
  
  swapAt(i, j)
  lo = index(after: i)
  hi = j
}
return lo

This keeps the structure of the algorithm intact. It does not reduce the complexity of the algorithm at all, though -- this is still operating on the level of raw indices. In fact, I have probably introduced some bugs, because hammering the code into the shape of the two sibling algorithms was not trivial. And the resulting code still includes a raw index(after:) invocation, so the change has achieved essentially nothing, other than slightly obscuring the underlying logic -- but note that this is just my opinion, not an absolute fact.

None of this discussion is relevant to code outside the stdlib and its menagerie of packages though. (And it has absolutely no relevance whatsoever to the topic of sorted collections.) I do wholeheartedly believe that the best way to code is to never deal with raw indices. (Even firstIndex(where:) and partition(by:) are too low-level for direct use for my taste -- because they return naked indices. It is usually a much better idea to express things with higher-level algorithms such as the removeAll(where:) implementation that calls the half stable partition algorithm followed by removeSubrange.)

Note though that this is just a helpful coding guideline to prevent people from needlessly wasting their time; I'm not aware of any plans to eliminate the low-level indexing operations, and I would likely argue against them if someone proposed such a thing. Only a Sith deals in absolutes. The indexing methods are there for good reasons, and they aren't going to go away:

I think we can build out a rich library of standard algorithms that prevents the need to code raw loops in the vast majority of cases, but I expect there will always be niche problems that are best solved by rolling your own code from scratch.

The idea is that these should be rare, because they are so much more difficult to implement correctly. I'm not just talking about implementing the raw loop itself: every time we write a raw loop, we should also write a bunch of custom testing code (including benchmarks if there is a perf concern). Every one of these loops adds a large amount of complexity to the codebase, which increases the cost of changing the code later, and elevates the risk of introducing new bugs whenever we do so. It also prevents the codebase from making use of potential future improvements to the standard algorithm implementations.
There is also no problem writing raw loops outside production code -- the ability to do so when necessary is a great skill to have, and working out a correct loop is always a fun puzzle.

However, in production code, I do strongly believe it's better to use standard algorithms where they exist. (It usually isn't a responsible use of resources to build an industrial laser cutter to open a single packet of chips, no matter how fun it would be.)
It is okay to reinvent the wheel when your job is to design wheels in a wheel factory. The standard library codebase is an algorithms factory; as engineers working on it, it is our business is to write raw algorithms, so that our customers don't have to.

Of course, the stdlib engineers should by all means eat their own dog food -- how else would we know if it tastes good? But I think measuring the quality of the stdlib codebase itself by how pedantic it is about layering would be a costly and counter-productive mistake. We can (and do!) eat our own dog food outside the core stdlib. (Unfortunately, I find that working on the stdlib has tainted me with an insatiable hunger for low-level understanding and optimization -- which is very undesirable & completely impractical in most other software engineering contexts.)

To put it another way -- the stdlib tends to err on the side of Knuth, rather than McIlroy, but the reason for this is that we want to enable more people to use McIlroy-style solutions in more contexts.

Very few people can obtain the virtuoso services of Knuth (or afford the equivalent person-weeks of lesser personnel) to attack nonce problems such as Bentley’s from the ground up. But old Unix™ hands know instinctively how to solve this one in a jiffy. (So do users of SNOBOL and other programmers who have associative tables’ readily at hand-for almost any small problem, there’s some language that makes it a snap.) The following shell script was written on the spot and worked on the first try. It took 30 seconds to handle a 10,000-word file on a VAX-11/750™.

I think you and me both want the same thing -- that Swift should allow people to solve complex problems in a jiffy. We just have an aesthetic disagreement on the internal design of one of the tiny tools that get us there. I don't want to belittle the differences, but as things go, these are very minor.

The variants you recommended are still raw loops, so in my eyes they have the same complexity as the original code. They also break the symmetry I like so much, and these aren't backed by a well-known paper, so I declined to take them. If there is a technical reason we should go with these variants (say, they lead to better code generation, or smaller code, etc.) then please, do submit a PR, demonstrating these benefits. I don't think there are any, but I can be easily mistaken.

(Hoare's work is 60 years old at this point, and I'm sure there have been more recent research. This may be a fruitful path to attacking my bias as a Hoare fanboi. )