Collection underestimatedCount() does _?


(Will Stanton) #1

Hello,

I came across `underestimatedCount()` and wonder what it does. I’m trying to become more familiar with Swift collections so (perhaps) I can write better-performing code!

/// Returns a value less than or equal to the number of elements in
/// `self`, *nondestructively*.
///
/// - Complexity: O(N).
public func underestimateCount() -> Int
/// Returns the number of elements.
///
/// - Complexity: O(1) if `Index` conforms to `RandomAccessIndexType`;
/// O(N) otherwise.
public var count: Self.Index.Distance { get }

I might have missed something since a lot of the results are for tests <https://github.com/apple/swift/search?utf8=✓&q=underestimatedCount&type=Code>
But why would `underestimatedCount` ever be used instead of `count`? Don’t most collections have O(1) `count` anyway?

Regards,
Will Stanton


(Dmitri Gribenko) #2

Hi Will,

This API comes from Sequence, which does not have a count that you can
inspect without possibly consuming the sequence.

Dmitri

···

On Fri, Mar 18, 2016 at 10:53 PM, Will Stanton via swift-users <swift-users@swift.org> wrote:

I might have missed something since a lot of the results are for tests <https://github.com/apple/swift/search?utf8=✓&q=underestimatedCount&type=Code>
But why would `underestimatedCount` ever be used instead of `count`? Don’t most collections have O(1) `count` anyway?

--
main(i,j){for(i=2;;i++){for(j=2;j<i;j++){if(!(i%j)){j=0;break;}}if
(j){printf("%d\n",i);}}} /*Dmitri Gribenko <gribozavr@gmail.com>*/


(Will Stanton) #3

Hello Dmitri,

Thank you – wasn’t aware and jumped to CollectionType’s redeclaration!

File https://bugs.swift.org/browse/SR-991 since I thought the documentation could say that right there.

Regards,
Will Stanton

···

On Mar 19, 2016, at 2:08 AM, Dmitri Gribenko via swift-users <swift-users@swift.org> wrote:

Hi Will,

This API comes from Sequence, which does not have a count that you can
inspect without possibly consuming the sequence.


(Fritz Anderson) #4

Am I right that notionally-infinite sequences (like bytes from a random-number generator) should report Int.max (making friends with zip(_:,_:wink: and laziness), or am I advised not to use Sequence semantics for those at all?

(Contemplating protocol InfiniteSequence, but not asking for it as a future.)

    — F

···

On Mar 19, 2016, at 1:08 AM, Dmitri Gribenko via swift-users <swift-users@swift.org> wrote:

This API comes from Sequence, which does not have a count that you can
inspect without possibly consuming the sequence.


(Brent Royal-Gordon) #5

I might have missed something since a lot of the results are for tests <https://github.com/apple/swift/search?utf8=✓&q=underestimatedCount&type=Code>
But why would `underestimatedCount` ever be used instead of `count`? Don’t most collections have O(1) `count` anyway?

Hi Will,

This API comes from Sequence, which does not have a count that you can
inspect without possibly consuming the sequence.

To add some color:

A sequence merely says that it has a bunch of elements and you can walk through them. It does *not* promise several important things:

* That you can get the elements in the sequence more than once
* That it knows the number of elements in the sequence and can quickly return them
* That you can get the number of elements in the sequence without walking through and counting them one by one

This adds up to an ugly conclusion: With just a sequence, it is impossible to efficiently discover the count, and doing so might destroy the sequence.

That's obviously not so great. In particular, it's not so great when you want to load a sequence into an array. The simple, naive way is to do this:

  // Notionally
  struct Array<Element> {
    init<Seq: SequenceType where Seq.Generator.Element == Element>(_ seq: Seq) {
      self.init()
      
      for elem in seq {
        appendElement(elem)
      }
    }
  }

But this may resize the array several times, which would be very slow. The fastest way to do that will be to pre-allocate the exact right number of elements so you don't have to keep resizing the array:

  // Notionally
  struct Array<Element> {
    init<Seq: SequenceType where Seq.Generator.Element == Element>(_ seq: Seq) {
      self.init()
      
      reserveCapacity(seq.count)
      for elem in seq {
        _appendElementQuicklyBecauseWeAlreadyKnowWeHaveTheCapacity(elem)
      }
    }
  }

But `seq.count` might consume the sequence, leaving you unable to add any elements. What do we do?

Enter `underestimateCount()`. We can always call `underestimateCount()` and size to whatever it says we should be. For sequences with easily-calculated counts, this should give us a size that's just right. For sequences where they can kind of estimate the right count (for instance, if you're decoding a fixed-size buffer of UTF-8 bytes, and you know how many bytes there are but not exactly how many characters they'll decode to), it will get us at least part of the way there. For sequences whose size is completely unknown, it won't help but it won't hurt much either.

So you do this:

  // Notionally
  struct Array<Element> {
    init<Seq: SequenceType where Seq.Generator.Element == Element>(_ seq: Seq) {
      self.init()
      
      let fastCount = seq.underestimateCount()
      reserveCapacity(fastCount)
      
      // We'll have to pull elements out manually.
      let generator = seq.generate()
      
      // Load all the fast elements
      for _ in 0..<fastCount {
        let elem = generator.next()!
        _appendElementQuicklyBecauseWeAlreadyKnowWeHaveTheCapacity(elem)
      }
      
      // Get anything else that's left
      while let elem = generator.next() {
        appendElement(elem)
      }
    }
  }

(If you're looking at Swift 3, remember: SequenceType is now Sequence, Generator is now Iterator, generate() is now makeIterator(), and underestimateCount() is now underestimatedCount().)

And in fact, this is quite similar to some real source code inside Array:

  @warn_unused_result
  internal func _copySequenceToNativeArrayBuffer<
    S : SequenceType
  >(source: S) -> _ContiguousArrayBuffer<S.Generator.Element> {
    let initialCapacity = source.underestimateCount()
    var builder =
      _UnsafePartiallyInitializedContiguousArrayBuffer<S.Generator.Element>(
        initialCapacity: initialCapacity)
  
    var generator = source.generate()
  
    // FIXME(performance): use _initializeTo().
  
    // Add elements up to the initial capacity without checking for regrowth.
    for _ in 0..<initialCapacity {
      builder.addWithExistingCapacity(generator.next()!)
    }
  
    // Add remaining elements, if any.
    while let element = generator.next() {
      builder.add(element)
    }
  
    return builder.finish()
  }

Now, collections *do* promise that you can walk through a sequence more than once, and some collections (ones that promise "random access") promise that `count` can be calculated quickly. But because CollectionType conforms to SequenceType, some code may receive a collection but only know it's a sequence. Fortunately, the standard library provides a default implementation of `underestimateCount` for collections:

    public func underestimateCount() -> Int {
      return numericCast(count)
    }

So you just don't have to worry about this on collections. Implement `count` and `understimateCount` will take care of itself.

HTH,

···

--
Brent Royal-Gordon
Architechies


(Jens Alfke) #6

No; linked lists and trees, which are pretty popular data structures, are generally O(n) to count.

—Jens

···

On Mar 19, 2016, at 1:53 AM, Will Stanton via swift-users <swift-users@swift.org> wrote:

Don’t most collections have O(1) `count` anyway?


(Fritz Anderson) #7

Okay, so… maybe I am unfairly treating your summary as the whole story… returning to the example of the contents of /dev/random, or quadrature data from a wireless receiver, which are supposed to be neither finite nor reproducible, but are indisputably one-thing-after-another — sequences in the common sense of the word …

Am I to gather

1: that the standard API requires that
every Sequence type can be instantiated so as to replicate the contents of another? The whole value of Sequence to the examples is that it should be impossible.

2: that the standard API requires of every Sequence that it should serve the expectation of the supposedly-uncoupled API of Array that it is free to buffer the entire contents of any Sequence of the same Element? Need it or not? Bad news for memory and performance.

The answer may be that if the user knows enough to build a Sequence for which laziness is indispensable, she should have better sense than to rely on those two things, required or not.

But a lifetime of Law and debugging have given me an exceptionally dirty mind. If those really are API, I can avoid them, but I can't expect others — including the standard library — to.

    — F

···

On Mar 19, 2016, at 2:53 PM, Brent Royal-Gordon via swift-users <swift-users@swift.org> wrote:

I might have missed something since a lot of the results are for tests <https://github.com/apple/swift/search?utf8=✓&q=underestimatedCount&type=Code>
But why would `underestimatedCount` ever be used instead of `count`? Don’t most collections have O(1) `count` anyway?

Hi Will,

This API comes from Sequence, which does not have a count that you can
inspect without possibly consuming the sequence.

To add some color:

A sequence merely says that it has a bunch of elements and you can walk through them. It does *not* promise several important things:

* That you can get the elements in the sequence more than once
* That it knows the number of elements in the sequence and can quickly return them
* That you can get the number of elements in the sequence without walking through and counting them one by one

This adds up to an ugly conclusion: With just a sequence, it is impossible to efficiently discover the count, and doing so might destroy the sequence.

That's obviously not so great. In particular, it's not so great when you want to load a sequence into an array. The simple, naive way is to do this:

   // Notionally
   struct Array<Element> {
       init<Seq: SequenceType where Seq.Generator.Element == Element>(_ seq: Seq) {
           self.init()
           
           for elem in seq {
               appendElement(elem)
           }
       }
   }

But this may resize the array several times, which would be very slow. The fastest way to do that will be to pre-allocate the exact right number of elements so you don't have to keep resizing the array:

   // Notionally
   struct Array<Element> {
       init<Seq: SequenceType where Seq.Generator.Element == Element>(_ seq: Seq) {
           self.init()
           
           reserveCapacity(seq.count)
           for elem in seq {
               _appendElementQuicklyBecauseWeAlreadyKnowWeHaveTheCapacity(elem)
           }
       }
   }

But `seq.count` might consume the sequence, leaving you unable to add any elements. What do we do?

Enter `underestimateCount()`. We can always call `underestimateCount()` and size to whatever it says we should be. For sequences with easily-calculated counts, this should give us a size that's just right. For sequences where they can kind of estimate the right count (for instance, if you're decoding a fixed-size buffer of UTF-8 bytes, and you know how many bytes there are but not exactly how many characters they'll decode to), it will get us at least part of the way there. For sequences whose size is completely unknown, it won't help but it won't hurt much either.

So you do this:

   // Notionally
   struct Array<Element> {
       init<Seq: SequenceType where Seq.Generator.Element == Element>(_ seq: Seq) {
           self.init()
           
           let fastCount = seq.underestimateCount()
           reserveCapacity(fastCount)
           
           // We'll have to pull elements out manually.
           let generator = seq.generate()
           
           // Load all the fast elements
           for _ in 0..<fastCount {
               let elem = generator.next()!
               _appendElementQuicklyBecauseWeAlreadyKnowWeHaveTheCapacity(elem)
           }
           
           // Get anything else that's left
           while let elem = generator.next() {
               appendElement(elem)
           }
       }
   }

(If you're looking at Swift 3, remember: SequenceType is now Sequence, Generator is now Iterator, generate() is now makeIterator(), and underestimateCount() is now underestimatedCount().)

And in fact, this is quite similar to some real source code inside Array:

   @warn_unused_result
   internal func _copySequenceToNativeArrayBuffer<
     S : SequenceType
   >(source: S) -> _ContiguousArrayBuffer<S.Generator.Element> {
     let initialCapacity = source.underestimateCount()
     var builder =
       _UnsafePartiallyInitializedContiguousArrayBuffer<S.Generator.Element>(
         initialCapacity: initialCapacity)
   
     var generator = source.generate()
   
     // FIXME(performance): use _initializeTo().
   
     // Add elements up to the initial capacity without checking for regrowth.
     for _ in 0..<initialCapacity {
       builder.addWithExistingCapacity(generator.next()!)
     }
   
     // Add remaining elements, if any.
     while let element = generator.next() {
       builder.add(element)
     }
   
     return builder.finish()
   }

Now, collections *do* promise that you can walk through a sequence more than once, and some collections (ones that promise "random access") promise that `count` can be calculated quickly. But because CollectionType conforms to SequenceType, some code may receive a collection but only know it's a sequence. Fortunately, the standard library provides a default implementation of `underestimateCount` for collections:

     public func underestimateCount() -> Int {
       return numericCast(count)
     }

So you just don't have to worry about this on collections. Implement `count` and `understimateCount` will take care of itself.


(Will Stanton) #8

Hello Brent,

Thanks kindly for the flair!

You gave cases for which `underestimatedCount()` is used:

For sequences with easily-calculated counts, this should give us a size that's just right. For sequences where they can kind of estimate the right count (for instance, if you're decoding a fixed-size buffer of UTF-8 bytes, and you know how many bytes there are but not exactly how many characters they'll decode to), it will get us at least part of the way there. For sequences whose size is completely unknown, it won't help but it won't hurt much either.

CC’ing swift-dev, I’m wondering:
Was there a reason `underestimatedCount()` was chosen for Sequence instead of `estimatedCount()`. I suppose an overestimate would be bad for iterating `0..<estimatedCount()`, but I couldn’t find any existing loops over `0..<underestimatedCount()`.

Does anything depend on an underestimate of Sequence count? And if not, would it be better to be less restrictive when asking a Sequence for a size estimate?
Since the exact count is returned for collections, the name also isn’t very precise.

Regards,
Will Stanton

···

On Mar 19, 2016, at 3:53 PM, Brent Royal-Gordon via swift-users <swift-users@swift.org> wrote:

I might have missed something since a lot of the results are for tests <https://github.com/apple/swift/search?utf8=✓&q=underestimatedCount&type=Code>
But why would `underestimatedCount` ever be used instead of `count`? Don’t most collections have O(1) `count` anyway?

Hi Will,

This API comes from Sequence, which does not have a count that you can
inspect without possibly consuming the sequence.

To add some color:

A sequence merely says that it has a bunch of elements and you can walk through them. It does *not* promise several important things:

* That you can get the elements in the sequence more than once
* That it knows the number of elements in the sequence and can quickly return them
* That you can get the number of elements in the sequence without walking through and counting them one by one

This adds up to an ugly conclusion: With just a sequence, it is impossible to efficiently discover the count, and doing so might destroy the sequence.

That's obviously not so great. In particular, it's not so great when you want to load a sequence into an array. The simple, naive way is to do this:

  // Notionally
  struct Array<Element> {
    init<Seq: SequenceType where Seq.Generator.Element == Element>(_ seq: Seq) {
      self.init()
      
      for elem in seq {
        appendElement(elem)
      }
    }
  }

But this may resize the array several times, which would be very slow. The fastest way to do that will be to pre-allocate the exact right number of elements so you don't have to keep resizing the array:

  // Notionally
  struct Array<Element> {
    init<Seq: SequenceType where Seq.Generator.Element == Element>(_ seq: Seq) {
      self.init()
      
      reserveCapacity(seq.count)
      for elem in seq {
        _appendElementQuicklyBecauseWeAlreadyKnowWeHaveTheCapacity(elem)
      }
    }
  }

But `seq.count` might consume the sequence, leaving you unable to add any elements. What do we do?

Enter `underestimateCount()`. We can always call `underestimateCount()` and size to whatever it says we should be. For sequences with easily-calculated counts, this should give us a size that's just right. For sequences where they can kind of estimate the right count (for instance, if you're decoding a fixed-size buffer of UTF-8 bytes, and you know how many bytes there are but not exactly how many characters they'll decode to), it will get us at least part of the way there. For sequences whose size is completely unknown, it won't help but it won't hurt much either.

So you do this:

  // Notionally
  struct Array<Element> {
    init<Seq: SequenceType where Seq.Generator.Element == Element>(_ seq: Seq) {
      self.init()
      
      let fastCount = seq.underestimateCount()
      reserveCapacity(fastCount)
      
      // We'll have to pull elements out manually.
      let generator = seq.generate()
      
      // Load all the fast elements
      for _ in 0..<fastCount {
        let elem = generator.next()!
        _appendElementQuicklyBecauseWeAlreadyKnowWeHaveTheCapacity(elem)
      }
      
      // Get anything else that's left
      while let elem = generator.next() {
        appendElement(elem)
      }
    }
  }

(If you're looking at Swift 3, remember: SequenceType is now Sequence, Generator is now Iterator, generate() is now makeIterator(), and underestimateCount() is now underestimatedCount().)

And in fact, this is quite similar to some real source code inside Array:

  @warn_unused_result
  internal func _copySequenceToNativeArrayBuffer<
    S : SequenceType
  >(source: S) -> _ContiguousArrayBuffer<S.Generator.Element> {
    let initialCapacity = source.underestimateCount()
    var builder =
      _UnsafePartiallyInitializedContiguousArrayBuffer<S.Generator.Element>(
        initialCapacity: initialCapacity)
  
    var generator = source.generate()
  
    // FIXME(performance): use _initializeTo().
  
    // Add elements up to the initial capacity without checking for regrowth.
    for _ in 0..<initialCapacity {
      builder.addWithExistingCapacity(generator.next()!)
    }
  
    // Add remaining elements, if any.
    while let element = generator.next() {
      builder.add(element)
    }
  
    return builder.finish()
  }

Now, collections *do* promise that you can walk through a sequence more than once, and some collections (ones that promise "random access") promise that `count` can be calculated quickly. But because CollectionType conforms to SequenceType, some code may receive a collection but only know it's a sequence. Fortunately, the standard library provides a default implementation of `underestimateCount` for collections:

    public func underestimateCount() -> Int {
      return numericCast(count)
    }

So you just don't have to worry about this on collections. Implement `count` and `understimateCount` will take care of itself.

HTH,
--
Brent Royal-Gordon
Architechies


(Brent Royal-Gordon) #9

1: that the standard API requires that
every Sequence type can be instantiated so as to replicate the contents of another? The whole value of Sequence to the examples is that it should be impossible.

No, it's the other way around. It should be possible to copy any (finite) Sequence into an Array, or any other type that declares it can be instantiated from a Sequence. The source of a Sequence could be anything, including data from external sources outside the program's control (like your /dev/random or wireless data examples).

2: that the standard API requires of every Sequence that it should serve the expectation of the supposedly-uncoupled API of Array that it is free to buffer the entire contents of any Sequence of the same Element? Need it or not? Bad news for memory and performance.

I believe there's a default implementation on Sequence as well (which would return 0), so it functions more as an opt-in mechanism to help Array and similar buffering mechanisms estimate the size of the buffer they should allocate. (Remember that greedy `map` and friends all use Array, so they all use this kind of preallocation logic, too.)

The answer may be that if the user knows enough to build a Sequence for which laziness is indispensable, she should have better sense than to rely on those two things, required or not.

Yes. Basically, the standard library doesn't model the difference between an infinite (or otherwise unwise-to-handle-non-lazily) sequence and a finite sequence. It expects you to know whether a particular sequence is going to be too large to handle greedily and refrain from doing so.

I consider this a weakness in the standard library and would prefer to see a more specific protocol hierarchy, perhaps along the lines of:

  /// Represents any series of elements which can be iterated over. Something which is
  /// merely Iterable may not be finite and, in any case, is probably not wise to use in
  /// greedy algorithms. It also may not be possible to iterate over more than once.
  protocol Iterable {
    associatedtype Iterator: InteratorProtocol
    func makeIterator() -> Iterator
    
    associatedtype Subset: Iterable where Subset.Iterator.Element == Iterator.Element
  }
  extension Iterable {
    // only APIs which are either lazy or operate on the start of the series
  }
  
  /// Represents any finite series of elements which can be iterated over. Something which
  /// is merely a Sequence may not be possible to iterate over more than once.
  protocol Sequence: Iterable {
    func underestimatedCount() -> Int
    associatedtype Subset: Sequence where Subset.Iterator.Element == Iterator.Element
  }
  extension Sequence {
    // adds greedy and tail-operating APIs, including `count`
  }
  
  /// Represents any finite, repeatable series of elements which can be iterated over and
  /// looked up by index.
  protocol Collection: Sequence {
    associatedtype Index: ForwardIndex
    associatedtype Iterator = IndexingIterator<Self>
    
    var startIndex: Index { get }
    var endIndex: Index { get }
    subscript(position: Index) -> Iterator.Element { get }
    
    associatedtype Subset: Collection where Subset.Iterator.Element == Iterator.Element = Slice<Self>
    subscript(bounds: Range<Index>) -> Subset { get }
  }

The `for` loop would take an `Iterable` (since you can exit early from a `for` loop), but most APIs that are currently constrained to `Sequence` would not change to `Iterable`, unless they happened to operate lazily or only on the start of the series. That would keep you from accidentally passing an infinite, or might-as-well-be-infinite, sequence to a function that expected to be able to read the whole thing.

I might write a swift-evolution proposal to this effect someday, but right now the Collection protocols are being reworked and I doubt the relevant people have the bandwidth to work on Sequence too.

···

--
Brent Royal-Gordon
Architechies


(Brent Royal-Gordon) #10

You gave cases for which `underestimatedCount()` is used:

For sequences with easily-calculated counts, this should give us a size that's just right. For sequences where they can kind of estimate the right count (for instance, if you're decoding a fixed-size buffer of UTF-8 bytes, and you know how many bytes there are but not exactly how many characters they'll decode to), it will get us at least part of the way there. For sequences whose size is completely unknown, it won't help but it won't hurt much either.

CC’ing swift-dev, I’m wondering:
Was there a reason `underestimatedCount()` was chosen for Sequence instead of `estimatedCount()`. I suppose an overestimate would be bad for iterating `0..<estimatedCount()`, but I couldn’t find any existing loops over `0..<underestimatedCount()`.

Does anything depend on an underestimate of Sequence count? And if not, would it be better to be less restrictive when asking a Sequence for a size estimate?
Since the exact count is returned for collections, the name also isn’t very precise.

Actually, the code samples I gave you *do* subtly depend on underestimatedCount being an underestimate, because the initial fast loop force-unwraps the element returned by `next`. If the count is an overestimate, that force unwrap will fail. You can see it in my example:

      // Load all the fast elements
      for _ in 0..<fastCount {
        let elem = generator.next()!
        _appendElementQuicklyBecauseWeAlreadyKnowWeHaveTheCapacity(elem)
      }

And in the actual standard library code:

    // Add elements up to the initial capacity without checking for regrowth.
    for _ in 0..<initialCapacity {
      builder.addWithExistingCapacity(generator.next()!)
    }

These and similar constructs could be turned into `if let`s, but that would add a conditional branch which would make the loop slower, particularly in -Ounchecked mode where safety checks are disabled. In that mode, the force unwrap operator simply *assumes* the value is not `nil`, which makes this code run as fast as possible—it pretty much can just copy or retain the element, assign it into the next slot in the array, and increment the array's count.

···

--
Brent Royal-Gordon
Architechies


(Will Stanton) #11

Hello Brent,

Thanks for pointing it out - obvious now!

Regards,
Will Stanton

···

On Mar 19, 2016, at 6:43 PM, Brent Royal-Gordon via swift-dev <swift-dev@swift.org> wrote:

Actually, the code samples I gave you *do* subtly depend on underestimatedCount being an underestimate, because the initial fast loop force-unwraps the element returned by `next`. If the count is an overestimate, that force unwrap will fail. You can see it in my example:

      // Load all the fast elements
      for _ in 0..<fastCount {
        let elem = generator.next()!
        _appendElementQuicklyBecauseWeAlreadyKnowWeHaveTheCapacity(elem)
      }

And in the actual standard library code:

    // Add elements up to the initial capacity without checking for regrowth.
    for _ in 0..<initialCapacity {
      builder.addWithExistingCapacity(generator.next()!)
    }

These and similar constructs could be turned into `if let`s, but that would add a conditional branch which would make the loop slower, particularly in -Ounchecked mode where safety checks are disabled. In that mode, the force unwrap operator simply *assumes* the value is not `nil`, which makes this code run as fast as possible—it pretty much can just copy or retain the element, assign it into the next slot in the array, and increment the array's count.