`String.count`: computed vs. stored property

(I wasn’t able to find a previous discussion about this and maybe that is because I have some misunderstanding. Apologies in advance.)

If I understand correctly, String.count is a computed property with O(n) time complexity. Why not make it a stored property that is updated whenever the string is mutated?

Most strings are either immutable, or changed without concern for the count. Paying the cost of calculating the count when it won't be accessed is worse than an O(n) algorithm when it's explicitly requested. Additionally, the implementation can cache the value and simply invalidate it when mutations occur.

Paying the cost of calculating the count when it won't be accessed is worse than an O(n) algorithm when it's explicitly requested.

What is the cost aside from incrementing/decrementing the count when a character is added/removed?

Let’s consider other standard library types. I believe Set/Dictionary stores the count property rather than iterating through the whole hash table. What is different about String?

There is no requirement that characters be added one-at-a-time. Consider the trivial case of let s = a + b. The implementation would have to know how long a and b are, even though no one cares, nor cares what length s is. In addition, s could be shorter than the lengths of a and b, due to normalization of combining characters. String could make no assumptions (except in the ASCII case).

And String is free to cache the value, as I mentioned above.

3 Likes

That's because in the case of Set/Dictionary, you don't have really complex interactions between adjacent elements, as you do with unicode code points. One element is one element, regardless of its neighbours before/after a mutation. This is not the case with String.

Related: I have always wondered why some properties (like .count) come with this tribal knowledge (sometimes documented) of being cached instead of being declared as “private(set) lazy var” instead.

Your general point is basically right.

However, technically string does not do what Unicode refers to as normalization ("é""é" or "\u{E9}""\u{65}\u{301}"). Canonical normalization, if it were done, would not cause s.count to change, but rather it would change s.unicodeScalars.count. Both “é” strings have exactly one Character, but different numbers of Unicode.Scalars.

The phenomenon that affects let s = a + b is grapheme cluster breaking, which leaves the scalars and s.unicodeScalars.count alone but affects s.count. String counts groups of scalars called extended grapheme clusters by Unicode and Character by Swift. In let s = "e" + "́", each starting string had one cluster or “Character”, but since the accent is supposed to be considered as clustered with the previous scalar, it merely inserts itself into the last cluster instead of appending a new cluster. Thus 1 cluster (of 1 scalar) + 1 cluster (of 1 scalar) became 1 cluster (of 2 scalars), which in terms of String.count appears as “1 + 1 → 1”.

Because of this difference:

struct LazyWrapper {
    init() {}
    var array: [Int] = []
    private(set) lazy var count = array.count
}
var lazy = LazyWrapper()
print("Lazy start: \(lazy.count)") // 0
lazy.array += Array(repeating: 1, count: 100)
print("Lazy modified: \(lazy.count)") // 0 ✗

struct CachingWrapper {
    init() {}
    var array: [Int] = [] {
        didSet {
            cache = Cache() // This line changes everything.
        }
    }
    private class Cache {
        var count: Int?
    }
    private var cache = Cache()
    var count: Int {
        if let cached = cache.count {
            return cached
        } else {
            let result = array.count
            cache.count = result
            return result
        }
    }
}
var caching = CachingWrapper()
print("Caching start: \(caching.count)") // 0
caching.array += Array(repeating: 1, count: 100)
print("Caching modified: \(caching.count)") // 100 ✓

Never compute a lazy var from something that is mutable. It will only be computed once and after that it may fall out of date.

2 Likes

The requirement on count (from Collection) is that it's O(n) or better. Some particular Collections document that they can do it faster than O(n), usually because it's precomputed. If it's not documented that it's cached, you shouldn't rely on it.

2 Likes

I see, so it is just tribal knowledge then. Thank you for taking time to explain this to me!

I'm not sure what you mean. Documentation is not tribal knowledge, and String's count being cached is not guaranteed if it's not in the documentation.

Sorry for not being more clear. When I said “tribal knowledge”, what I meant was that I saw this mentioned again and again on Stack Overflow, Twitter, etc. So I asked and now I understand why “count” is not declared as “lazy” — it is not part of the contract. Thank you again for taking time to explain this to me. All is clear now!

1 Like