Is String's subscript O(n) when given a lot of combining code points?

bbrk24 · October 17, 2023, 5:26am

FWIW, I'm having a hard time reproducing this locally:

$ swift repl <<EOF
import Foundation

let problematic = "a" + String(repeating: "\u{0301}", count: 20_000_000) + "x"

let start = Date()
print("Hello, World!".contains(problematic))
let end = Date()

print(end.timeIntervalSince(start))
EOF
Traceback (most recent call last):
  File "<string>", line 1, in <module>
ModuleNotFoundError: No module named 'lldb'
problematic: String = "á́́́́́́́́́́́́́́́́́́́́́́́́́́́́́́́́́́́́́́́́́́́́́́́́́́́́́́́́́́́́́́́́́́́́́́́́́́́́́́́́́́́́́́́́́́́́́́́́́́́́́́́́́́́́́́́́́́́́́́́́́́́́́́́́́́́́́́́́́́́́́́́́́́́́́́́́́́́́́́́́́́́́́́́́́́́́́́́́́́́́́́́́́́́́́́́́́́́́́́́́́́́́́́́́́́́́́́́́́́́́́́́́́́́́́́́́́́́́́́́́́́́́́́́́́́́́́́́́́́́́́́́́́́́́́́́́́́́́́́́́́́́́́́́́́́́́́́́́́́́́́́́́́́́́́́́́́́́́́́́́́́́́́́́́́́́́́́́́́́́́́́́́́́́́́́́́́́́́́́́́́́́́́́́́́́́́́́́́́́́́́́́́́́́́́́́́́́́́́́́́́́́́́́́́́́́́́́́́́́́́́́́́́́́́́́́́́́́́́́́́́́́́́́́́́́́́́́́́́́́́́́́́́́́́́́́́́́́́́́́́́́́́́́́́́́́́́́́́́́́́́́́́́́́́́́\u{cc}"...
start: Foundation.Date = 2023-10-17 05:21:24 GMT
false
end: Foundation.Date = 2023-10-17 05:21:24 GMT
0.12429308891296387

That last line -- the 0.124... -- is the number of seconds that contains check took, plus all the additional processing of piping this through the REPL instead of having a precompiled binary.

bbrk24 · October 17, 2023, 5:31am

~~Wait, did adding import Foundation change it from the Swift.String method to the NSString method? I can’t imagine so given that I’m on Linux but it’s worth checking I guess.~~

Edit: Nope, changing it to Swift.String.contains("Hello, World!")(problematic) did not significantly effect the timing.

lorentey · October 17, 2023, 9:34am

That's true. You can indeed get this loop to run in quadratic time, and the growth can be quick enough to cause actual problems. I was being grumpy -- apologies!

What led to my dismissive grumpiness is the suggestion that the stdlib is somehow not doing the right thing here. By "nothing to mitigate", I mean that I do not think this is a problem within the standard library -- rather, it is "merely" an issue with how the sample code uses the library's APIs. The only reasonable mitigation is to fix the loop not to repeatedly evaluate an expression whose result never changes:

let first = string[string.startIndex]
for character in string {
   if character == first { print("same!") }
}

There is no good way to mitigate issues of this sort on the library side*.

* E.g., here are some bad ideas for library-level solutions that would make things worse:

Precalculating Character boundaries only speeds things up by a constant factor -- extracting a Character is still going to cost time proportional to the length of it. Nevertheless, this is a valid way to speed up index(offsetBy:) and distance(from:to:) calculations, enabling speedy navigation within the string. The BigString type within AttributedString's current implementation pushes this idea further, also allowing more scalable editing operations.
Replacing the String with an Array<Character> would "fix" this particular loop, but it would have extraordinary memory overhead, and (worse!) it would render the lower-level string views impractically slow. Those views are just as semantically important as the character view, and they need to be kept as fast as possible. (String is expected to be able to quickly produce a contiguous buffer of its code units in its native encoding (UTF-8 or -16).)
Truncating extracted grapheme clusters to some arbitrary maximum scalar count would mean that Swift would no longer implement the Unicode standard. E.g., Internet culture has assigned meaning to ridiculously long sequences of composing accents: Zalgo text is being used in actual human communication. Processing such text in a lossy way would potentially create more problems than it solved.
Caching a handful of previously vended Characters in some concurrent data structure within the String storage, on the off chance that they will be accessed repeatedly.

It seems relatively obvious to me that an expression with a constant result that gets repeatedly evaluated in a loop can cause a performance defect. I don't think such mistakes are particularly subtle, or hidden.

Collection's performance "guarantees" are (at best) vague suggestions, not requirements; loops like this should therefore induce distress in performance-oriented readers, whether or not items is a known container type:

for item in items {
   if item == items[items.startIndex] { ... }
}

(Consider that collection types such as LazyFilterCollection have similar "hidden" complexity gotchas; in that case, the non-constant operation is accessing startIndex: the filter needs to invoke the filter predicate an arbitrary number of times to find the first matching item.)

I don't think we can ever fully avoid these issues by tweaking the library's abstractions. All the library can do is to set the right expectations: for example, by properly explaining the cost of extracting Characters out of a String. It's true, the stdlib is not doing a good enough job there:

The documentation could explain the nature of String and Character better. String is not at all like Array: it isn't a container of individual Character values. Forcing a String to materialize one of its elements is not a trivial operation; it is preferable to remember the result of the subscript than to unnecessarily repeat this extraction process.
Perhaps String's indexing subscript should also tell more about its actual complexity. The tricky part is that there is a huge amount of potential complexity hidden behind the subscript, and it's difficult to figure out precisely what to say without misleading people. This is the same sort of problem as trying to usefully document the expected complexity of Set.insert: it feels like it should be possible, but all attempts to do so far have failed, because they invariably proved oversimplified, overcomplicated, or both at the same time. (Granted, String's subscript would not be likely to be anywhere as tricky a challenge.)

An example of a deeper problem that would need code-level mitigation in the library would be if we found such a carelessly written loop in one of the standard generic algorithms. (Oops! Of course we do have examples of that. It would be great to fix these, and it is probably okay to do so, but these are core algorithms, and as always, Hyrum's Law makes even small changes risky.)

An even worse problem would be if someone found a core operation that is universally expected to run in linear time, but actually turns quadratic -- such as used to be the case with loops that collected results in a Dictionary, before we enabled per-instance hash seeding. Finding that issue was an unpleasant surprise even to the people who implemented Dictionary.

A particularly bad case of this latter issue would be if someone was able to construct a stdlib collection value that fits in n units of memory but a simple loop that iterated over it required n² time to run. Neither Collection nor Sequence has any guarantees on the speed of iteration, but it is universally expected that directly iterating over an in-memory container type will take time that is roughly proportional to its memory footprint. (The horrible news, of course, is that we have at least one known example of such a case in the current stdlib. Finding it is left as an exercise for the reader.)

tera · October 17, 2023, 10:17am

lorentey:

The only reasonable mitigation is to fix the loop not to repeatedly evaluate an expression whose result never changes:
let first = string[string.startIndex]
for character in string {
   if character == first { print("same!") }
}

~~Beware, in this form it still runs in O(N*M) time (N - string length, M - character length) albeit with a slightly better constant factor upfront (1.3 seconds vs 2.6 seconds).~~

Scratching this, see below.

tera · October 17, 2023, 10:31am

The simple fix in this case would be:

michelf · October 17, 2023, 11:19am

Intriguing. I get 21.85147202014923 by pasting the same thing in the terminal and running it. And 21.845985054969788 by compiling it from a file and running it with this command:

swiftc -O string-contains.swift && ./string-contains

This is with macOS Sonoma on an iMac M1, Swift 5.9. What are you using?

michelf · October 17, 2023, 11:55am

lorentey:

What led to my dismissive grumpiness is the suggestion that the stdlib is somehow not doing the right thing here. By "nothing to mitigate", I mean that I do not think this is a problem within the standard library -- rather, it is "merely" an issue with how the sample code uses the library's APIs. The only reasonable mitigation is to fix the loop not to repeatedly evaluate an expression whose result never changes:
let first = string[string.startIndex]
for character in string {
   if character == first { print("same!") }
}
There is no good way to mitigate issues of this sort on the library side*.

This is certainly a valid position to take. And indeed that makes it O(n+c), same as O(n) since c is a fraction of n.

Unfortunately the same can't be done with all algorithms. String.contains for instance has to iterate the search string repetitively and so it does suffer from this same problem—at least in my tests. Fixing this one would probably mean moving away from Character-based iteration in its implementation.

Right now the complexity for that subscript in String is documented as O(1). It's a little beyond not being clear.

bbrk24 · October 17, 2023, 12:00pm

I’m using Swift 5.9 on x86-64 Ubuntu 22.04, in WSL. It’s possible that this is an Intel vs ARM thing, or a Linux vs Mac thing, or both. I have a good CPU but it can’t be that much better.

tera · October 17, 2023, 12:55pm

FWIW, even this loop alone:

    for character in string {
    }

takes 1.3 second on my M1 Pro, -O

michelf · October 17, 2023, 1:02pm

Some more data points, still with this test:

macOS Big Sur, iMac Late 2014, 3.5 GHz Intel Core i5, Swift 5.4
REPL: 0.28845298290252686
Non-optimized build: 0.10442197322845459
Optimized build: 0.10431301593780518

Sounds like there's a big regression in Sonoma on M1 compared to this. Also:

macOS High Sierra, Mac Mini Mid 2011, 2 GHz Intel Core i7, Swift 4.0.3
REPL: 0.278517007827759
Non-optimized build: 0.10437905788421631
Optimized build: 0.000819087028503418

Ok, this last result is pretty unbelievable. 0.0008 sounds like the optimizer constant-folded the whole thing, so I'll ignore the optimized version (but how nice that it could do that!). As for the rest, I suppose the Unicode tables might have grown a bit too between Big Sur and Sonoma, but not that much.

bbrk24 · October 17, 2023, 1:14pm

i’m a little surprised the entire loop didn’t get yeeted by the optimizer, but I suppose that’s neither here nor there.

bbrk24 · October 17, 2023, 1:17pm

So far it looks like either a Big Sur vs Sonoma thing or an Intel vs ARM thing. We haven’t isolated those variables yet, have we?

If it is the latter — arch-dependent instead of a regression in macOS 14 — I’m impressed that it’s that much worse on ARM, and also worried because every modern iPhone is ARM.

tera · October 17, 2023, 1:24pm

Could also be "import Foundation vs not" thing.

FWIW, the import Foundation version of the loop alone (with nothing inside) takes ~2 seconds on x86, -O (tested on godbolt).

bbrk24 · October 17, 2023, 1:29pm

For the same code I piped to the REPL upthread, I’m getting 1.2–1.6 seconds on SwiftFiddle depending on the run. Slower, but still nowhere near the 21 seconds @michelf saw.

wadetregaskis · October 17, 2023, 2:40pm

let problematic = "a" + String(repeating: "\u{0301}", count: 20_000_000) + "x"
print("Hello, World!".contains(problematic))

M2 MacBook Air, Sonoma, Xcode 15 (Swift 5.9):
Seven seconds with native Swift strings, twenty seconds with Foundation imported.

iMac Pro (10-core), Sonoma, Xcode 15 (Swift 5.9):
Fifteen seconds with native Swift strings, twenty-six seconds with Foundation imported.

It of course makes no difference what optimisation settings are used since the code is trivial in itself and all the time is spent in library code.

Importing Foundation augments String in ways that typically (but not always) make it much slower. You cannot get around this using Swift.String… - the augmentation is internal to the Swift stdlib, not merely a matter of which API you're actually calling. It occurs even if you don't actually use anything from Foundation (the augmentation appears to happen at library load time).

I'm not entirely surprised to hear that it's much faster on other platforms, especially when Foundation is involved, as there's legacy code on Apple's platforms. I clearly don't see any unusual performance inversion between x86 and ARM, so I'm not sure how to explain the results others have reported.

bbrk24 · October 17, 2023, 2:45pm

Very interesting that importing Foundation more than doubles the runtime… but still doesn’t explain the discrepancy. If anyone in this thread has an Intel Mac running Sonoma or an M1/M2 Mac running Big Sur (I don’t have either), that would really help to isolate the issue.

wadetregaskis · October 17, 2023, 2:57pm

Looking at the time profiles, this looks like potentially another case of Swift Stdlib String not having a proper specialisation for one of its methods, in this case contains.

It's actually relying on the generic StringProtocol.contains(_:) implementation, which is essentially just doing BidirectionalCollection<>.firstRange<A>(of:). And that method is really inefficient - it does a massive amount of Array allocations, for a start (why?!). It also spends a lot of time in ZSearcher.search(_:in:), which has a more varied call subtree but clearly is doing a bunch of retain-release traffic, yet more array allocations (mainly under Unicode._InternalNFD.Iterator.next()), and a bunch of Unicodey stuff.

Also worth noting that actually creating the string takes about 250ms. That part doesn't contain anything super obviously egregious, although I do see some memmoves in there and quite a lot of buffer allocation, which is suspicious since the size of the resulting string's buffer is deterministic, isn't it? (sizeof("repeating") * count)

andreas66 · October 17, 2023, 4:08pm

On MacBook Pro 13-inch, 2018, 2,3 GHz Intel Core i5.
Sonoma 14.0.
swift-driver version: 1.87.1 Apple Swift version 5.9 (swiftlang-5.9.0.128.108 clang-1500.0.40.1)
Target: x86_64-apple-macosx14.0

Code:

let problematic = "a" + String(repeating: "\u{0301}", count: 20_000_000) + "x"
print("Hello, World!".contains(problematic))

swiftc -O main.swift -o main
time ./main
false
./main 17,92s user 2,16s system 97% cpu 20,603 total

And when import Foundation added:

time ./main
false
./main 29,44s user 1,95s system 97% cpu 32,061 total

Nevin · October 17, 2023, 4:24pm

That certainly sounds plausible.

String-searching is a notoriously difficult problem with many, many algorithms that feature different tradeoffs in space, time, and requirements.

tera · October 17, 2023, 4:40pm

So, before unicode – characters were all fixed size (and we didn't know we need a thing like Zalgo text), then unicode came along which allowed characters of arbitrary lengths (arguably for no good reason other than "it could be used for something later on"), then this feature was used in Zalgo text, and now of course we can't undo that feature as it would break Zalgo text... Is Zalgo text about the only thing that wants arbitrary length characters?