A recent discussion made me wonder about the time complexity of reading one character in a string. Characters (grapheme clusters in Unicode parlance) can be an aggregate of many combining code points. And Unicode does not specify an upper limit to how many can be combined together in one character. So it comes to sense that if you have a lot of combining code points forming a character, extracting one character from the string may require reading an arbitrary length of data.
And so, the following code becomes unexpectedly quadratic O(n^2) when the string starts with one long combining character:
for character in string {
// checking if character is same as first character
if character == string[string.startIndex] { print("same!") }
}
Let's consider an input of 100 characters: the first one is made of 100 code points (one normal code point and 99 combining ones), and the 99 others are normal non-combining code points. With this input, the outer loop iterates 100 times (once for each character) and the inner loop (hidden inside of the subscript) has to iterate over 100 code points to form the first character every time. Total iterations of the inner loop: 100 * 100 = 10,000.
Is there some mitigation for that in String
's implementation? Because this looks like a possible DOS attack vector to me if string algorithms—or generic collection algorithms used with strings—are written with the assumption accessing a character is constant time O(1).