Require Sequence.underestimatedCount be O(1)

Currently the default implementation of underestimatedCount property for a Collection is to simply return count. Problem with that is count is allowed to be O(n) operation on some collections and can have a side-effect of iterating over the whole collection, which defeats the whole purpose of underestimatedCount.

The proposed change makes underestimatedCount strictly O(1), which requires a certain pessimization of it's return value for Collection instances that are only forward or bidirectional. That is when such a collection is passed to a method generic over Sequence, append(contentsOf:) for example, the call to reserveCapacity might not result in allocation of enough memory up font.

RandomAccessCollection instances guarantee constant time count and therefore still can provide a default implementation of underestimatedCount in terms of count.

Implementation is available at: apple/swift#14994.

UPD: This seems particularly important when working with Strings. Since String conforms to Collection but it's count requires the traversal, so when str.append(contentsOf: anotherString) is called in generic context, anotherString will be iterated over twice unnecessarily.

UPD 2: swift-evolution proposal is available at Add an underestimatedCount proposal by moiseev · Pull Request #819 · apple/swift-evolution · GitHub

12 Likes

I've already brought this up for Collection, including a bug report.

Some of the viewers of that bug want to keep the property as O(n) for legacy compatibility for Collection users, although that by definition breaks legacy compatibility for pure-Sequence users. Since the property is supposed to be an under-estimate, I agree that we should sacrifice accuracy for speed.

2 Likes

My thinking is: relying on underestimatedCount was a mistake if anyone did it. And why would anyone do it in the first place when there's a much more discoverable property called count?

Delegating to isEmpty does not seem like the right thing to do. The naïve implementation can be just count == 0, which in combination with a linear count would result in the same problem we're trying to avoid.

+1. Without this, I cannot use underestimatedCount in any performance-sensitive context (which are all the contexts I currently work in) . The only workaround is to constrain all my generics to RandomAccessCollection (which has O(1) count), but that's not possible when implementing more generic API.

Long term, I think a more sophisticated estimate would make sense.

enum EstimatedCount {
  case unknown
  case infinite
  case exactly(Int)
  case underestimate(Int)
  case overestimate(Int)
}
extension EstimatedCount {
  func isAtLeast(_ n: Int) -> Bool { ... }
  func isAtMost(_ n: Int) -> Bool { ... }
}
4 Likes

+1

I think in general, performance guarantees follow your generic constraints. So count is potentially O(n) unless you add the RandomAccessCollection, constraint as @Michael_Ilseman notes. The whole purpose of underestimatedCount is to provide a version of count which is guaranteed O(1) in non-constrained contexts, with the tradeoff that it might not be accurate.

The only other thing I would change about this property (potentially more source-breaking) is to add the requirement that underestimatedCount must be at least 1 if the Sequence is not empty.

How would you tell that it's not empty? Iterating over a Sequence is consuming and this is pretty much the only way I can think of to tell whether there's anything in there: makeIterator().next() != nil.

1 Like

I was going to say use isEmpty, but I see that's only on Collection, not Sequence.

Can we at least make it a requirement for Collection? The same way we change the complexity if it conforms to RandomAccessCollection?

Otherwise your Collection-constrained code needs to account for somebody implementing this with return 0. That means that any time you use underestimatedCount, you instead have to write max(underestimatedCount, isEmpty ? 0:1) or you might create a bug.

See my reply above:

Can you please elaborate on the "bug" thing?

That would only be a valid implementation of isEmpty if count also happens to be O(1). Otherwise you already broke the performance requirements, and when you (the programmer) becomes less naïve, you can fix it and see performance wins from code that was using the protocol correctly.

Many would assume that an underestimatedCount of 0 means the Collection is empty (not knowing that it comes from Sequence, where such things cannot be easily tested). That might mean you skip processing altogether, allocate a buffer which can hold 0 items (if you gated by isEmpty before processing), or calculate a size for batch operations which is some fraction of 0 (i.e. 0).

Why would many assume such a thing? underestimatedCount is very clearly underestimated, and 0 is an underestimation.

2 Likes

You generally would use underestimatedCount only when you're using a Sequence. This is why the current notes of making Collection.underestimatedCount forward to count is a big mistake; if you're not calling the equally accessible count instead then you intentionally want speed over accuracy. Or your reference to the collection is only as a Sequence and the current version of Collection.underestimatedCount will secretly screw you over.

Delegating to isEmpty is the right thing to do. This property must be O(1), while count can be at least O(n); any implementation of isEmpty that calls count is broken right then and there (modulo that specific type following RandomAccessCollection) and you need to file a bug report for that. But the most naïve implementation of isEmpty is the existing one, which just calls startIndex and endIndex, which both have to be O(1). The developer shouldn't be calling count within isEmpty unless s/he knows that count is faster than the two index properties.

1 Like

You're right that when somebody does that, it's clearly a bug as 0 is still technically an underestimate.

Still, a non-empty collection returning 0 is still counter-intuitive. I've seen it. It's easy to work-around, but I can also appreciate the view that return 0 is a crappy estimation given that you can check isEmpty in constant-time.

It's documented as O(1). If you can satisfy that using count, all power to you. If not, find another way or drop down to Sequence.

I do not disagree with using isEmpty. Let me explain what situation I am trying to avoid.

Standard library provides a default implementation of underestimatedCount that's documented to be O(1), and internally calls isEmpty. Now enters the 3rd party developed Collection implementation that, for some reason, violates the "countisEmpty is O(1)" rule, and does not provide it's own implementation of underestimatedCount (rightfully so, because why would you care about the Sequence API in a Collection type). Voila! The default implementation provided by the standard library now violates it's own set of rules.

This, arguably artificial, situation is easily avoidable by just returning 0 all the time.

UPD: Corrected count to isEmpty.

1 Like

That is not a rule for Collection.

Yeah... I meant isEmpty. Sorry =)

I'm not sure if I understand you here. If you've violated isEmpty being O(1), can you really fault Swift for breaking other guarantees?

Then that third-party type does not actually conform to Collection, and its implementor was lying when they added that conformance.

Protocols are emphatically *not* bags of syntax. They have semantic requirements, and a type must meet those requirements in order to conform.

Collection.isEmpty must be O(1) or else the type does not conform.

2 Likes

It's artificial, but it's also hugely difficult to debug because the compiler can't help you with complexity requirements.

Given @CTMacUser's recent demonstration of difficulties with conforming to complex protocols and finding unintentional behavior from default implementations, it seems a terribly bad idea to intentionally add to the number of potential footguns.