First 2 values using higher order functions (preferably)

Hi,

Overview

  • I have an array of values, I would like to get the first 2 true values.
  • I prefer not to iterate through the entire array if the 2 true values are obtained.
  • I know how to do it using a for loop with break. Just wondering if there is a better way to do it.

Questions:

  1. Is it possible to use map / filter or some other higher order functions?
  2. Or this is an overkill and it is better to use for loop with break?
  3. Or any other better solution?

Code (my attempt but iterates through entire array)

let values = [true, false, false, true, true, false, true]

func isValid(_ value: Bool) -> Bool {
    print("isValid called")
    return value
}


let firstTwoTrueValues = values.filter { isValid($0) }.prefix(2)

//I would like to get the first 2 true values and stop iterating

print(firstTwoTrueValues)

Output

isValid called
isValid called
isValid called
isValid called
isValid called
isValid called
isValid called
[true, true]

One might think that

let firstTwoTrueValues = values.lazy.filter(isValid).prefix(2)
print(firstTwoTrueValues)

does the trick, but it doesn't, it iterates over the entire array as well. This has been investigated on Stack Overflow

where also a solution is presented:

The foolproof way of preventing all indexing operations on a given lazy filter collection is to simply use a lazy filter sequence instead – this can be done by just wrapping the collection you wish to perform lazy operations on in an AnySequence

In your case that would be

let firstTwoTrueValues = AnySequence(values).lazy.filter(isValid).prefix(2)
print(Array(firstTwoTrueValues))
3 Likes

@anon9791410 Thanks a lot

I tried it but still isValid called gets called 6 times, does that mean that 6 iterations are done though true is available at index 0 and index 4.

Or am I mistaken?

@Martin thanks a lot

That works perfectly.

So every time we use lazy should we be using it with AnySequence or is this swift bug that lazy doesn't work as expected in normal sequences?

As I understand it, this is a particular problem of filtering a lazy collection. Perhaps can others give a more authoritative answer.

1 Like

strange indeed:

let x: [Bool] = Array(values.lazy.filter { isValid($0) }.prefix(2))
// isValid called 10 times
print(x) // [true, true]

let y: [Bool] = Array(values.lazy.map { isValid($0) }.prefix(2))
// isValid called 2 times
print(y) // [true, false]
2 Likes

There's another way to do this in the latest version of Swift without using AnySequence.

let opaqueValues: some Sequence<Bool> = values
let firstTwoTrueValues = opaqueValues.lazy.filter(isValid).prefix(2)
print(Array(firstTwoTrueValues))

For very large collections, this may be slightly faster.

5 Likes

@JuneBash thanks a lot!!

Thank you so much for this solution.

I was surprised by the need for this but this is the way I would do it. :+1:
values as any Sequence<Bool> will not work with lazy. as some is needed to eliminate the need for two lines, but nobody has ever mentioned plans for it being added to the language.

Swift doesn't have the necessary mechanisms for generalizing as some to a function, so for every protocol+associatedtype combination you need to use it with, you need to write a new property. :woozy_face:

public extension Sequence {
  /// A workaround for the language not supporting `as some Sequence<Element>`.
  var asSomeElementSequence: some Sequence<Element> { self }
}

values.asSomeElementSequence.lazy.filter(isValid).prefix(2)

There is no fundamental Swift bug here. Things are largely working as expected, with one minor exception.

Each step of this method chain is eagerly executed to completion, before the next one starts.

This statement is therefore expected to start by invoking isValid precisely seven times in filter, to produce a filtered version of the entire input array.

The prefix operation is then run on the resulting new array, visiting the first three indices. (At offsets 0, 1, and 2 from the start, with 2 being the end index of the resulting slice.)

This is working exactly as designed!

(Eager evaluation is pretty useful, too -- if you have a large amount of data, and you can spare the memory to hold all those intermediate temporary arrays, such methods chains can have better throughput than any of the lazy alternatives. And at the end of it, you get all results nicely packaged in a direct Array value -- which can be repeatedly accessed with no further computing.)

This is not true.

The addition of lazy switches the meaning of this method chain expression: instead of each step executing to completion on after the other, this is creating a chain of on-the-fly collection transformations, each doing as little work as possible to respond to each request.

The prefix method stops iterating the contents of its input collection as soon as it finds the full extent of the slice that it needs to return:

  public __consuming func prefix(_ maxLength: Int) -> SubSequence {
    _precondition(
      maxLength >= 0,
      "Can't take a prefix of negative length from a collection")
    let end = index(startIndex,
      offsetBy: maxLength, limitedBy: endIndex) ?? endIndex
    return self[startIndex..<end]
  }

This does not iterate over the entire input collection, unless the prefix reaches the end of it.

In the specific case of the original code sample, the bounds for the returned collection slice are expected to be at offsets 0 and 2 in the filtered collection. These positions correspond to offsets 0 and 4 in the original array. (I.e, the first and the third true value.)

To find these bounds, this method does the following steps:

  1. Invoke isValid once to find the position of startIndex (at offset 0 in the original array).
  2. Invoke isValid four more times to find the index at offset 2 from this start index. (It's at offset 4 in the original array.)
  3. Invoke isValid once more to find startIndex again.

The code needs to visit array items at offsets between 0 and 4 (inclusive), and that is precisely what it is doing. Altogether, it invokes isValid six times while doing this. This is only one invocation more than the minimum number necessary to find the bounds of the returned slice, as item 0 is visited twice.

Items 5 and 6 are never visited.

A real (but very minor) performance issue is that prefix should only compute startIndex once. (It is probably possible to fix that.) Other than that, the current implementation is doing the minimum possible amount of work to get the desired result.

(Somewhat regrettably, FilteredCollection does not precompute its startIndex during its creation. That said, some people (mistakenly) disagree with the idea of doing nontrivial work while creating a lazy transformation, and in any case, this is not easily changed at this point, as that type is now frozen.)

Using the sequence-based, lazy prefix transformation does indeed eliminate the need to find an endIndex -- it can stop at the last element that the prefix needs. (This is in addition to avoiding calculating startIndex twice.)
In the case of this code sample, this variant visits the four array items between offsets 0 and 3 -- it does not need to visit item 4 and it doesn't visit item 0 twice.

Avoiding looking at a single additional index (or two) is rarely, if ever, going to be worth worrying about.

My recommendations:

  • If you have a seven-element array of booleans that you need to process once, then it simply does not matter how inefficiently your code is filtering it. Scan the whole array four times. Do a clever quadratic algorithm. Copy items into a temporary Set, then sort the results. Encode the array into JSON and spawn a subprocess to process it. Better yet, send it over a satellite connection to a remote machine. Figure out a way to use Morse code or Unicode grapheme breaking! Go wild!

  • If you have more than seven items, and you think your code is running too slow, remember to profile it before making changes (or blaming it on your dependencies). The code may indeed be slow, but for me, the bottleneck tends not to be where I expect it. (If you do end up making changes, remember to also profile the result, to see if there was an actual improvement.)

  • It is usually a very bad idea to try solving hypothetical performance issues by reaching for AnySequence. Doing that has a distinct tendency to turn imagined performance problems into real ones.

  • The lazy collection transformations aren't general drop-in replacements to the eager ones -- they produce different behavior. Passing closures with side effects to these lazy transformations (such as the print statements in this code sample) will cause major headaches and confusion, and, eventually, bugs. Do not do that.

  • In general terms, it can be a good idea to reach for the least capable tool that still fits the use case. So if you only need the results to iterate over them once, then it can indeed be technically better to generate them as a lazy sequence rather than a lazy collection: the latter is capable of far more interesting things, and subsequently it can sometimes be slightly slower. However, this does not mean that it is good practice to force every lazy expression to be limited to Sequence -- that might be a good trick to know about in the exceptionally rare cases when the difference actually matters. (Use a profiler.)

11 Likes

Excellent analysis from @lorentey as always.

Here's my totally unscientific take. Without changing the original example substantially I've changed Bools to Ints (with odd ints for true and even ints for false to match the original example) – that way it is easier to see what "isValid" is being called for.

let values = [1, 2, 4, 5, 7, 8, 9]

func isValid(_ value: Int) -> Bool {
    let result = (value & 1) != 0
    print("isValid called \(value) -> \(result)")
    return result
}

extension Sequence {
    func sequentialPrefix(_ n: Int) -> [Element] {
        var it = makeIterator()
        var elements: [Element] = []
        for _ in 0 ..< n {
            guard let element = it.next() else { break }
            elements.append(element)
        }
        return elements
    }
}

let firstTwoTrueValues = values.lazy.filter { isValid($0) }.sequentialPrefix(2)
print(firstTwoTrueValues)

Outputs:

isValid called 1 -> true
isValid called 2 -> false
isValid called 4 -> false
isValid called 5 -> true
[1, 5]

i.e. it does the minimum number of isValid calls to fill the resulting array of two.

(Instead of returning array I should probably return something else that would allow the subsequent chained lazy operation – never implemented a proper lazy type myself so there'll be a learning curve).

1 Like

@lorentey Thank you so much for that awesome explanation.

I really liked the technical explanation and the recommendations.

Increasing the number of elements in the original array values to 12 proved what you had explained, that lazy doesn't iterate through the entire array (it stopped with 6 again).

My real case:

  • I would have around 200 values and isValid would actually be a nil check.
  • It really wouldn't matter much and just would add more complexity.

I really found the following interesting:

Just curious (last line), does that mean eager evaluation (non-lazy) caches the results so that no further computation is required if repeatedly accessed?