Is String's subscript O(n) when given a lot of combining code points?

michelf · October 15, 2023, 2:54am

A recent discussion made me wonder about the time complexity of reading one character in a string. Characters (grapheme clusters in Unicode parlance) can be an aggregate of many combining code points. And Unicode does not specify an upper limit to how many can be combined together in one character. So it comes to sense that if you have a lot of combining code points forming a character, extracting one character from the string may require reading an arbitrary length of data.

And so, the following code becomes unexpectedly quadratic O(n^2) when the string starts with one long combining character:

for character in string {
   // checking if character is same as first character
   if character == string[string.startIndex] { print("same!") }
}

Let's consider an input of 100 characters: the first one is made of 100 code points (one normal code point and 99 combining ones), and the 99 others are normal non-combining code points. With this input, the outer loop iterates 100 times (once for each character) and the inner loop (hidden inside of the subscript) has to iterate over 100 code points to form the first character every time. Total iterations of the inner loop: 100 * 100 = 10,000.

Is there some mitigation for that in String's implementation? Because this looks like a possible DOS attack vector to me if string algorithms—or generic collection algorithms used with strings—are written with the assumption accessing a character is constant time O(1).

Karl · October 15, 2023, 8:11am

Collection's subscript is documented as O(1), so that is indeed a very reasonable assumption for people to have.

String indexes can store their character stride, so subscripting them is O(1); but they might not, in which case the subscript will have to calculate it. So I think you may be right, and that String might not be a valid Collection in that it does not fulfil its performance guarantees.

startIndex in particular never knows its character stride; and we would need to store that information along with the code-units so that we can continue returning startIndex itself in O(1). I don't believe String's storage class (the one which holds big strings where this may be an issue) is @frozen, so we have room to add storage for that. Of course, all mutating operations would need to check and possibly update that value.

I've had to do similar things when faced with the same problem: when you have variable-length elements that are parsed on-demand from some underlying storage, Swift's Collection model effectively requires that indexes contain both the position and length of the element they refer to, and you need to cache the length of the first element. This is certainly a familiar issue, and I'm sure many collection authors have likewise encountered this.

TellowKrinkle · October 15, 2023, 10:04am

Even if the index stored the character stride, characters aren't substrings so wouldn't you still need an O(n) memcpy?

Karl · October 15, 2023, 10:13am

That's right. I mentioned that in the other thread about startIndex in standard library collections:

As you say, Character needs to copy its underlying data, and there can be a varying amount of data from character to character, so not all characters will take the same amount of time to access via the subscript.

However, it is still independent of the length of the entire string; the same character won't take more time to access if it were part of a longer string. That's what (I think) we mean by O(1) in this context.

TellowKrinkle · October 15, 2023, 10:19am

If you're going by that choice for n, then live-calculating character stride should be OK as well, since that's also bounded by the length of a single character.

Karl · October 15, 2023, 10:30am

Yes, that's a good point. Even if every index knew its character stride, simply forming 100 copies of a 100-byte character would require copying 10,000 bytes, since they do not share storage with the underlying String.

michelf · October 15, 2023, 12:00pm

All characters together are bounded by the length of the entire string, meaning if you iterate over all characters you don't get O(n^2) behaviour even though there's a nested loop.

But one character can make the entire string, so reading one character of unreasonable length is at worse O(n). It's unlikely to occur in practice outside of maliciously crafted strings, but it's certainly possible to make a couple-of-megabytes character (assuming there is no enforcement of a shorter byte limit).

I made an experiment via the Swift REPL with a gigantic one-character string:

let s = "a" + String(repeating: "\u{302}", count: 20_000_000)
s.unicodeScalars.count // fast!
s.count         // slow! (about one second)
s == "a"        // fast!
s.first == "a"  // slow! (about one second)
s == "e"        // slow! (about half a second)
s.first == "e"  // slow! (about one second)
s == "é"        // slow! (about half a second)
s.first == "é"  // slow! (about one second)
s == "/"        // slow! (about half a second)
s.first == "/"  // slow! (about one second)

Interestingly, s == "a" remains fast, but none of the others. I wonder what kind of shortcut it takes that isn't used with "e" or "/".

And as a consequence:

"hello world!".contains(s) // about 10 seconds

tera · October 15, 2023, 11:28pm

I wonder if this is "a bug rather than a feature"... In this current form you could have the whole world and piece (and some more) in a single character.

This is a serious concern. Looks unfixable All API's that deal with Character type claiming to be O(1) are at least O(n) in the degenerate case of n unicode points per character.

For me this one is still O(n) just with a smaller constant factor; make it 200_000_000 to see.

Wow, that's an interesting anomaly, worth investigating.

michelf · October 16, 2023, 1:28pm

I suppose the only possible fix would be to impose a size limit to Character (despite Unicode saying they're unlimited). If the limit is set to 50 code points, then s[startIndex] would need to read a maximum of 50 code points and copy a maximum of 50 code points to create the Character. This should make it constant-time. Longer characters would then be unrepresentable as a Character so they would become a � (and thus all compare equal even when they're not supposed to be ), but individual code points would remain unchanged inside the string.

Note that advancing the index by one character would still have to skip all the combining code points regardless of that limit, so character boundaries do not change. There is no expectation of O(1) when advancing so it's fine.

The biggest drawback is the lack of correctness for the rare cases characters are longer than the limit. The other is backward compatibility. One way to mitigate that would be to have a separate subscript in StringProtocol to get the full character, but this makes it difficult to use with generics. Another way would be to have a runtime setting to change this threshold globally or task-locally. With this later approach we can pick a default behavior while allowing applications to set their own limit. The limit could be set lower when parsing untrusted input for instance.

A configurable limit could also be beneficial for making more efficient string algorithms, not just mitigate attacks. String.contains for instance could check if one of the two strings is ASCII, and if so set the limit to the lowest allowed value. The "hello world!".contains(s) in my example above would then run instantly with O(n) where n = 12 and possibly no heap allocation instead of taking 10 seconds copying a multi-megabyte character for each equality test in O(n*m) fashion where n = 12 and m = 20,000,000.

tera · October 16, 2023, 4:24pm

If you know that the left string character is "h" and start reading the second string character and see it is "e" + "combining ^" + "combining ^" + ... do you really need to know the right string character fully to see that the two characters are different?

michelf · October 16, 2023, 8:46pm

If you're reading a Character it has to read it whole. If you're reading the code points directly in the string you could stop as soon as you can determine there's no possible equivalence. Ideally you'd be reading the bytes from the two characters's code points in lockstep and make that determination at every step. This would be a good fix for bad algorithmic complexity, probably. It's already possible to do this when operating on two strings if you are willing to dive into Unicode combining logic or limit the optimisation to characters in the ASCII range.

But it's also a very different and less convenient programming model from what we have now, where you extract two Characters fully and then you're able to compare them. It won't replace, not can it fix, the current API that is having O(n) issues when extracting a Character, which is pretty much only an issue with malicious input and not with normal text.

tera · October 16, 2023, 9:25pm

For the sake of completeness, this is quick in the above test case:

s.utf8.count

With this size-capped character in mind would it be possible:

when Character value is passed inside a function as a parameter or returned from the function as a return value / or as an inout parameter – allocate it purely on the stack. No dynamic allocation needed if it's limited to, say, 200 bytes.
when Character is getting stored in an Array or other similar structure allocate in on the heap – treating it as a fixed sized 200 byte value would be quite wasteful when you store characters en masse.

or is this type of optimisation impossible in Swift?

Karl · October 16, 2023, 9:57pm

I think we should be clearer with the terms and phrases that we're using. It may give people a false impression, especially as you've mentioned that this may have security implications.

To be precise: extracting a Character from a String is not O(n). It does depend on the length of the character, but not the length of the string beyond that character. The length of the string is the salient n because extraction is an operation that one performs on a string (subscripting using an index), not on a character.

For any Character, if I keep that character constant, and subscript it from a String that contains 1/2/4/8 other characters, the time to extract that character using its index does not change. In that respect it is O(1).

This highlights an important point when discussing algorithmic complexity - it pertains to scaling, not absolute performance. The issue here is that an individual Character may be surprisingly expensive to extract in absolute terms.

That means that this is not, in fact, quadratic:

michelf:

so, the following code becomes unexpectedly quadratic O(n^2) when the string starts with one long combining character:
for character in string {
   // checking if character is same as first character
   if character == string[string.startIndex] { print("same!") }
}

In this example, if you double the number of characters in string, execution time should scale linearly (with the outer loop), even if the first character is one of these pathological cases consisting of hundreds or thousands of code-points.

I hacked together a little script to test it. The first character is a pathological case like in your example: "a" + String(repeating: "\u{302}", count: 20_000) (I shortened it to 20K scalars rather than 20M), and the rest of the string is a string of "x"s. Here are my results:

Command: swiftc -O main.swift && ./main

Tail Length	Time (ns, relative to TL=1)
1,	1.0
2,	1.1209763106021202
4,	1.450454106412358
8,	2.119564980202834
16,	3.3017124757373306
32,	6.015886260665994
64,	11.696052411165708
128,	23.113525354604906
256,	45.930820600690446
512,	91.53868614042064
1024,	182.80684011048763
2048,	365.3415016794289
4096,	730.4373593487122

When I plot these figures, they show exactly what we expect - linear execution time, from the outer loop. I eyeballed it as roughly fitting y=0.176x:

michelf · October 16, 2023, 10:46pm

The actual complexity is O(firstCharacterSize * tailLength). If you only increase one of the two variables then it's indeed linear, but when both are increased together it becomes O(n).

If you were increasing the size of the tail and the size of the combining character together, you'd get a different curve. Here's what I get after such a tweak:

Total String Length (2 * 2^n + 1)	Time (ns, relative to TL=1)
3	1
5	0,11738926279543
9	0,217352741691449
17	0,619554442531434
33	1,83257682475087
65	6,26942140136693
129	22,5517817081442
257	78,8247508738978
513	304,120780508165
1025	1181,0229039495
2049	4621,93697500913
4097	18319,514008452
8193	73046,5274690875
16385	291843,283142902
32769	1166561,86414149

To get these results, I changed the first part of your script to this:

func pathological(_ length: Int) -> String {
	"a" + String(repeating: "\u{302}", count: length) + String(repeating: "x", count: length)
}

var strings = [(Int, String)]()
for n in 0...14 {
    let length = 1 << n
    strings.append((length, pathological(length)))
}

jrose · October 16, 2023, 10:48pm

When you have two variables to change, you should name them separately. You have a loop body that does O(firstCharacter) work and you loop it O(fullLength) times; that's an O(N * M) operation, even when M affects N.

michelf · October 16, 2023, 11:34pm

True. I should have made that clearer. Sorry for the confusion.

tera · October 16, 2023, 11:59pm

With all above in mind, shouldn't placed like these:

    /// - Complexity: O(1)
    public subscript(i: Substring.Index) -> Character { get }
    ...
    /// - Complexity: O(1)
    @discardableResult
    @inlinable public mutating func removeLast() -> Character

be changed to this instead:

    /// - Complexity: O(M), M is the number of unicode points of the resulting character

ellie20 · October 17, 2023, 2:00am

Not sure how useful this is, but I'd like to point out that Unicode describes an algorithm for generating "stream-safe text" that's intended to address a very similar issue that can arise from string normalization. A slightly different algorithm could be used to prevent extended grapheme clusters from becoming arbitrarily large.

lorentey · October 17, 2023, 3:09am

Extracting a Character out of a String has the same complexity as copying its Unicode scalars to a freshly allocated new String. (A Character is just a String with a guaranteed count of 1.)

This requires visiting each Unicode scalar in the Character.

So accessing a Character within a String value takes time proportional to its length in scalars.

No, this isn't a problem.

michelf:

And so, the following code becomes unexpectedly quadratic O(n^2) when the string starts with one long combining character:
for character in string {
   // checking if character is same as first character
   if character == string[string.startIndex] { print("same!") }
}

No, this is not a quadratic operation. The loop costs O(c*n) time, where n is the number of Unicode scalars in the entire string, and c is the number of scalars in its first character. c is a constant value.

There is nothing to mitigate here.

michelf · October 17, 2023, 3:30am

But I'm purposefully choosing c to be half the scalar length of the string: c = n/2. And so the whole thing becomes O((n/2)*n), same as O(n^2).