That's true. You can indeed get this loop to run in quadratic time, and the growth can be quick enough to cause actual problems. I was being grumpy -- apologies!
What led to my dismissive grumpiness is the suggestion that the stdlib is somehow not doing the right thing here. By "nothing to mitigate", I mean that I do not think this is a problem within the standard library -- rather, it is "merely" an issue with how the sample code uses the library's APIs. The only reasonable mitigation is to fix the loop not to repeatedly evaluate an expression whose result never changes:
let first = string[string.startIndex]
for character in string {
if character == first { print("same!") }
}
There is no good way to mitigate issues of this sort on the library side*.
* E.g., here are some bad ideas for library-level solutions that would make things worse:
-
Precalculating Character boundaries only speeds things up by a constant factor -- extracting a Character is still going to cost time proportional to the length of it. Nevertheless, this is a valid way to speed up index(offsetBy:)
and distance(from:to:)
calculations, enabling speedy navigation within the string. The BigString
type within AttributedString's current implementation pushes this idea further, also allowing more scalable editing operations.
-
Replacing the String
with an Array<Character>
would "fix" this particular loop, but it would have extraordinary memory overhead, and (worse!) it would render the lower-level string views impractically slow. Those views are just as semantically important as the character view, and they need to be kept as fast as possible. (String is expected to be able to quickly produce a contiguous buffer of its code units in its native encoding (UTF-8 or -16).)
-
Truncating extracted grapheme clusters to some arbitrary maximum scalar count would mean that Swift would no longer implement the Unicode standard. E.g., Internet culture has assigned meaning to ridiculously long sequences of composing accents: Zalgo text is being used in actual human communication. Processing such text in a lossy way would potentially create more problems than it solved.
-
Caching a handful of previously vended Characters in some concurrent data structure within the String storage, on the off chance that they will be accessed repeatedly.
It seems relatively obvious to me that an expression with a constant result that gets repeatedly evaluated in a loop can cause a performance defect. I don't think such mistakes are particularly subtle, or hidden.
Collection
's performance "guarantees" are (at best) vague suggestions, not requirements; loops like this should therefore induce distress in performance-oriented readers, whether or not items
is a known container type:
for item in items {
if item == items[items.startIndex] { ... }
}
(Consider that collection types such as LazyFilterCollection
have similar "hidden" complexity gotchas; in that case, the non-constant operation is accessing startIndex
: the filter needs to invoke the filter predicate an arbitrary number of times to find the first matching item.)
I don't think we can ever fully avoid these issues by tweaking the library's abstractions. All the library can do is to set the right expectations: for example, by properly explaining the cost of extracting Character
s out of a String
. It's true, the stdlib is not doing a good enough job there:
- The documentation could explain the nature of
String
and Character
better. String
is not at all like Array
: it isn't a container of individual Character
values. Forcing a String
to materialize one of its elements is not a trivial operation; it is preferable to remember the result of the subscript than to unnecessarily repeat this extraction process.
- Perhaps String's indexing subscript should also tell more about its actual complexity. The tricky part is that there is a huge amount of potential complexity hidden behind the subscript, and it's difficult to figure out precisely what to say without misleading people. This is the same sort of problem as trying to usefully document the expected complexity of
Set.insert
: it feels like it should be possible, but all attempts to do so far have failed, because they invariably proved oversimplified, overcomplicated, or both at the same time. (Granted, String's subscript would not be likely to be anywhere as tricky a challenge.)
An example of a deeper problem that would need code-level mitigation in the library would be if we found such a carelessly written loop in one of the standard generic algorithms. (Oops! Of course we do have examples of that. It would be great to fix these, and it is probably okay to do so, but these are core algorithms, and as always, Hyrum's Law makes even small changes risky.)
An even worse problem would be if someone found a core operation that is universally expected to run in linear time, but actually turns quadratic -- such as used to be the case with loops that collected results in a Dictionary, before we enabled per-instance hash seeding. Finding that issue was an unpleasant surprise even to the people who implemented Dictionary
.
A particularly bad case of this latter issue would be if someone was able to construct a stdlib collection value that fits in n units of memory but a simple loop that iterated over it required n² time to run. Neither Collection
nor Sequence
has any guarantees on the speed of iteration, but it is universally expected that directly iterating over an in-memory container type will take time that is roughly proportional to its memory footprint. (The horrible news, of course, is that we have at least one known example of such a case in the current stdlib. Finding it is left as an exercise for the reader.)