Is String's subscript O(n) when given a lot of combining code points?

ksluder · October 17, 2023, 5:03pm

This is quite a misunderstanding of Unicode history. I suggest reading up on Unicode, UCS-2, and UTF-16 before continuing to speculate about breaking changes.

scanon · October 17, 2023, 5:07pm

Zalgo is currently the only thing that needs truly "arbitrary length" characters. However, absent absolute guidance from Unicode, any upper-bound that you might choose as "the biggest character anyone will ever legitimately need" is either small enough that it will eventually be violated by perfectly reasonable unicode sequences, or so large as to not be of any benefit for optimization purposes (we're already up to ~40 byte characters in perfectly normal emoji, IIRC, and it's already possible to back yourself into perverse behaviors using only characters smaller than that).

michelf · October 17, 2023, 6:08pm

If supporting arbitrary-length characters is a goal, then the programing model where you copy the whole character around is a bad one. It's pretty rare to keep a Character variable alive for a long time. In most cases, I think it'd be better if the Character was a slice. Ideally a slice with a lazily calculated upper bound so it's O(1) to create one from an index in the string. And it should be possible to manually detach the slice-Character from the underlying string when needed.

This should allow most generic parsing and searching algorithms to work well with it with a minimal amount of specialization.

But alas, Character is defined in a way that makes this hard to change:

@frozen
public struct Character: Sendable {
  @usableFromInline
  internal var _str: String
  // ...
}

Maybe then we should add a String.slicing view—with CharacterSlice as the element—which would be better suited for parsing and searching in strings in the presence of Zalgo or maliciously-sized characters.

lorentey · October 17, 2023, 6:25pm

If Character was a slice, then it would hold a strong reference to its original string storage. This would prevent said storage from getting deallocated as long as the Character is alive, and it would induce copy-on-write copies when someone tries to mutate the String value that the Character came from.

This would trade off one performance issue for a set of different, potentially more serious ones. (Who expects the elements of a collection to hold on to the entire collection?)

But, as long as we know what we're doing, accessing the contents of a string by slicing off character-sized Substrings rather than extracting any actual Characters can indeed be a useful tool.

// Look, ma, no allocations!
var i = str.startIndex
while i < str.endIndex {
  let j = str.index(after: i)
  defer { i = j }
  let char = str[i ..< j]
  print(char)
}

(Keep in mind that this only eliminates copying of string data. index(after:) will still visit each scalar in the Character it is jumping over.)

ksluder · October 17, 2023, 6:57pm

This reminds me that a lot of these exact topics were raised during the discussion of SE-0163, which added BidirectionalCollection conformance to String back in Swift 4. In particular, the lack of O(1) seeking was a point of contention.

wadetregaskis · October 17, 2023, 7:23pm

Indeed, and this is already a notable issue for string slices in general (Substring etc), and really all types of slices or views (e.g. ArraySlice).

I wonder if the weak reference machinery could be leveraged here? A weak reference alone isn't sufficient of course, because you do need part of the original object, but if the weak reference holder were given an opportunity to extract what it needs before the original goes away…?

It'd certainly make the implementation of Character and slices more interesting, having to support both modes, and I'm sure there's some significant coordination required to avoid regressive behaviour (trading off one large object for thousands of maybe smaller ones that add up to more). Still, could be a fun challenge.

tera · October 17, 2023, 7:25pm

We are using withXXX elsewhere for a similar purpose, potentially could be used here similarly:

str.withQuickAccess { accessor in
    var i = str.startIndex
    while i < str.endIndex {
        let j = str.index(after: i)
        let char = accessor[i ..< j]
        print(char)
        i = j
    }
}

jrose · October 17, 2023, 7:29pm

I feel like a lot of this discussion could have been avoided if String’s operations had been documented as O(e), where ‘e’ is the size of the element being accessed. This isn’t true for most Collections of variably-sized items (like [[Int]]), but it is true for String, and it could be true of others, so one possibility is that the documentation for Collection is updated to note that the size of the element in question can affect certain operations like subscript.

wadetregaskis · October 17, 2023, 8:02pm

Improved documentation would help a little, but I think the remaining problems are still significant - e.g. the "footgun" aspect that can be weaponised for e.g. DoS attacks.

There's a lot about Unicode text handling, and consequently Swift's most basic string type, that not only can be leveraged by malicious parties but is easy for most developers to get wrong.

I know some folks think it's intrinsically complicated and everyone handling text needs to "just" learn all the pitfalls and do all the work to mitigate them, but I hope that's not really the best possible approach.

scanon · October 17, 2023, 8:14pm

I don't think that anyone serious believes this. A number of people (myself included) believe that subscripting and character-by-character access are not generally very good idioms for most text processing, and that we should build out String's API so that people don't reach for those operations all the time.

michelf · October 17, 2023, 9:49pm

Text needing special handling is definitely not what Swift tells its users when String automatically inherit every algorithm that can work with Collection & cie.

But I like very much my earlier idea of providing a "slicing" view (with CharacterSlice as the element type). If you need to fix a performance problem with some obscure input nobody though of before, switching your algorithm to use the slicing view is much easier than rewriting it in a way that avoids forming any Character. The one important pitfall slicing has—insidious copy on write—could be helped by not making the view mutable thus preventing mutation algorithms from working on it.

SDGGiesbrecht · October 17, 2023, 9:59pm

For most writing systems, text simply isn’t linear. Writing was developed when there were chisels and pens used on two‐dimensional tablets and paper. An overarching approximately one‐dimensional flow of thought was useful for organizing a document, but there was no benefit to array‐like strictness of one unit of meaning after another. The approximate one‐dimensional flow is strong enough that the thing can be approximated in a linear manner to represent it in a bit stream, but it still needs a sort of mark‐up to stack units of meaning out sideways, perpendicular to flow. Aside from compromises for legacy compatibility, a unit of meaning is adopted as a Unicode scalar (below where order matters), and the group of them that forms a unit of the linear flow (where order becomes meaningful) is a grapheme cluster. Hence the internal size of a cluster is necessarily unbounded.

An English string is simultaneously a sequence of letters and a sequence of words. Imagine if someone implemented a string view that vended words, and then became annoyed that the thing became inefficient if people supplied a string containing too long a word. Imagine he came back to Unicode and said that the world would be better if words had a strict length limit. The longest word he has ever seen is 67 letters, so a cap of 100 ought to work for the real world. A string with more than 100 characters in a row without a space should be considered malformed. Unicode should declare that any compliant processor should fix malformed strings by inserting a space at the 100th character. How would you respond? Because this is roughly equivalent to the suggestion of capping grapheme cluster size, but stays within the realm familiar to English speakers so as to be easier to consider. (That is a rhetorical question for each reader; none need actually respond.)

Exactly this.

Karl · October 18, 2023, 3:12pm

This is generally true for any function that returns a value, isn't it? It needs to construct/copy the value it returns, which depends on the size of the value -- if you have a function which returns an Int, and one which returns an (Int, Int, Int), the latter should be assumed to take ~3x as long to return a value as the former does. I think that's straightforward enough (IMO) that it doesn't need explicit documentation -- otherwise, it would have to be on almost every generic function.

For values like Array with indirect storage, I think it's important to emphasise that the value's size is not the same as its capacity, and that Array has a constant size regardless of its capacity. I guess that might be worth explicit documentation?

jrose · October 18, 2023, 3:29pm

The difference here is that different instances of Character take different amounts of time to construct. That’s not true of Int or (Int, Int, Int). It is true of Array, though, unless there’s an existing Array you can share, so maybe you’re right that this isn’t about the element so much as the collection.

(As an example, if you had a Collection of Arrays backed by a single large buffer, accessing elements would have to construct a new Array, taking O(e) time like String does. And if you had a Set of Characters, access would take true O(1) time because of copy-on-write storage sharing.)

tera · October 19, 2023, 12:52am

Let's consider three examples, all claiming to be O(1):

Array:
/// - Complexity: Reading an element from an array is O(1). ...
subscript(index: Int) -> Element

the specific case of, say, array of arrays is optimised via COW machinery and thus making time/space complexity O(1), although in general "the whole thing" could be copied making time/space complexity O(size of element) or even O(?) wholly depending on what exactly happens during element copying. Here the (type of) element is the one that defines the resulting complexity.

String:
/// - Complexity: O(1)
subscript(r: Range<String.Index>) -> Substring { get }

here the time/space complexity is always O(1)

String:
/// - Complexity: O(1)
mutating func removeLast() -> Character

here as we discussed time and space complexity is O(size of character).

In the above examples the claimed complexities of O(1) are in fact: "O(?)", "O(1)" and "O(m)" respectively and it is not obvious at the first glance.

lorentey · October 19, 2023, 6:55am

/// - Complexity: O(1)

These are "feel-good" annotations, not promising anything in particular, other than a very vague "it'll be fast" -- as in, "feel free to invoke it in a loop if you need to do that".

But an O(1) operation can still take a month to execute. (Or a century. A million centuries!) Just because an operation has O(1) complexity, we do not get a free pass to repeatedly invoke it on the same inputs.

You can tell these promises aren't very tangible by the way they don't specify what units they're using to measure "complexity". They don't tell you this, because doing that would admit that the units vary from type to type, and from function to function.

Complexity is measured by the count of primitive operations that a function performs; what those primitive operations are depends on the abstraction level on which the construct happens to sit.

A simple example would be Int.+, i.e., arithmetic addition of two Int values. We can (probably!) promise that it takes O(1) time to add together two 64-bit integers -- the addition can compile down to a single instruction, and executing that instruction will usually take some finite maximum number of CPU cycles. So for operations on the lowest layers of abstraction, complexity can in fact be expressed in terms of actual elapsed time.

(Note that this is heavily assuming that the integer values don't need to get loaded from / stored to a memory address, and that the instruction is located in physical memory. Estimating time can get really complicated if that's not the case, as we'll see next.)

A more complicated example is Array's subscript getter. That is a far, far more complex operation than integer addition, even though Array is (conceptually) one of the simplest possible data structures. Its indexing subscript getter measures its complexity by the number of times it needs to access or copy array elements.

This does give us a good indication of how complex the operation is -- the vague "O(1)" tells us not to be afraid to use it as often as necessary, and that is good advice.

But "O(1)" has practically nothing to with the upper limit of how much time it can take to execute array[i]:

The selected unit of complexity very nicely avoids having to deal with the case when the Array happens to be backed by a bridged NSArray instance. NSArray subclasses (and proxy classes pretending to be one) can do whatever weird thing they want to do, including getting blocked for an arbitrary length of time on storage access or networking events. (Array still isn't lying -- it's performing a constant number of accesses, even in this case. An "access" can simply include dynamically calling out to some random Objective-C method.)
Copying an item that is (or contains) a strong reference includes incrementing its reference count; that operation involves an atomic compare-exchange loop that can, in theory, spin for an arbitrary amount of iterations. (In practice it usually succeeds in a few tries. But it might take a million! This does make it tricky to promise any particular strict upper bound on elapsed time -- which is why people get sometimes so worked up about whether certain atomic operations are "wait-free" or not. swift_retain isn't wait-free.)
Worse, memory is a ridiculously complicated construct in our machines. Even ignoring virtual memory, accessing a byte at a random offset inside a contiguous memory region of size n is better approximated as taking O(log(n)) time, not O(1). We just don't usually need to remember this embarrassing fact because caching works so well. (Until we encounter an algorithm where it doesn't, and suddenly everything slows to a crawl.)
Further, the process that runs our code will sometimes randomly suspend execution in the middle of an array access; usually there is no strict upper time limit on how long it will take to resume execution. (This point is easy to handwave away by only measuring time while our process is running, but if we're actually limited by some real life deadline (e.g. because we need to fill the next buffer's worth of audio data in time before the last one finishes playing), then it can easily become a real problem.)

So while let value = array[i] will perform O(1) element accesses/copies, it can still take an arbitrary amount of time to do that. There is no constant upper bound on its time complexity.

To sum it up: "Complexity: O(1)" does not mean what we generally take it to mean, even in the simple case of Array.subscript. There is sleight of hand trickery here, but it is there for a good reason -- explaining all of this directly in the API reference docs would not be at all helpful. (Even if we omitted the last two points as nitpicky nonsense, as they arguably are.)

String's indexing subscript getter also has "O(1) complexity". However, String's (default) character view sits on a much (and I mean much) higher abstraction level than Array, so its primitive operations are correspondingly more complicated, and they have an even weaker relation to elapsed time.

The primitive unit of operation in this case is finding the next grapheme cluster boundary after a previous one. (I.e., it involves visiting a single Character's worth of Unicode scalars.) In strings containing human text, the average number of scalars this operation needs to visit is typically some low value close to (but definitely higher than) 1. But there is no upper limit on how many scalars would need to be visited -- there is no upper bound to a Character's length.

The "O(1)" promise simply gives us a free pass to iterate over the characters in a string without overly worrying about how much time it will take to do so. The cost of one full iteration pass (as measured in memory accesses) will be proportional to the length of the string in bytes; and this should align with what a reasonably well-informed person would expect, given String's API.

(Note that String's other views have very different complexity units.)

The last interesting example to look at is Dictionary's keying subscript getter. Its complexity is surprisingly difficult to understand/explain. The primitive operations in this case are calculating the hash of a key, and comparing two keys for equality. (Accessing keys and values matters too, of course, but they are significantly cheaper than these two operations. The cost of hashing/comparing keys also includes the costs of accessing them.)

Note that the complexity of these latter two operations is inherently linear in the "size" of the key: this is actually very similar to how String measures things in terms of variable-sized grapheme clusters. The units once again do not (necessarily) have a constant size!

Dictionary's lookup complexity is particularly difficult to explain, because it is a probabilistic data structure. Its efficiency relies on hashing, which (if done properly [very big if!]) allows it to have O(1) expected complexity. The word "expected" here has the specific meaning used by probability theory, as in "expected value" -- and as usual, it is tied to specific assumptions about the distribution of the input (which get lifted by employing randomly seeded universal hashing).

If we applied the methods we used to analyze Array/String indexing to Dictionary lookups, we'd get that the complexity of a Dictionary lookup is O(count). There can be no better upper bound, because no matter what hash function we use, we're in effect sorting a (potentially) infinite number of different keys into a finite number of buckets. There will always be an arbitrarily large number of different keys that share the same bucket, and if we construct a dictionary from some of those clashing keys, then it will always exhibit this worst-case O(count) lookup performance.

However, as long as hashing is implemented properly, these cases will be guaranteed to be incredibly rare: the expected value of key comparisons during a Dictionary lookup is O(1), no matter what set of keys we use. (If the Key type broke its hashes, then this analysis goes out the window, and we revert to O(n) complexity -- we'll be able to easily produce keys that actually exhibit worst-case performance.)

Even if everything works as well as possible, that is, even if we find the key we want on the first try, it will still take a Key.== invocation to verify this -- and that comparison is going to generally cost time proportional to the size of the key. If the Dictionary has variable-sized keys (e.g. String keys), then looking one key up can still cost an arbitrarily large amount of time, even if we measure time by storage accesses, like Array does. But this cost is effectively irrelevant in practice: of course we need to compare at least one pair of keys to find the one we are looking for! That's part and parcel to looking anything up in a general-purpose container.

I think it would not be unreasonable to have Dictionary.subscript document its complexity in the same infuriatingly over-simplified way as Array does:

- Complexity: Expected to be O(1), if `Key` correctly conforms to 
    `Hashable`. Otherwise O(`count`).

For String.subscript, perhaps it'd be useful to include a couple of words about the granularity of our units. What couple of words, though? E.g., would this be enough?

- Complexity: O(1) `Character` extractions.

(Is there a better alternative? Copying and pasting something like the multiple paragraphs of really badly explained gibberish above would only serve to mislead and confuse people even more.)

David_Smith · October 19, 2023, 9:13am

Translation lookaside buffers are a truly marvelous trick until they suddenly aren't anymore, and I'm so glad I'm almost never in that situation.

tera · October 19, 2023, 10:17am

Great and thorough explanation, thank you!

lorentey:

I think it would not be unreasonable to have Dictionary.subscript document its complexity in the same infuriatingly over-simplified way as Array does:
- Complexity: Expected to be O(1), if `Key` correctly conforms to 
    `Hashable`. Otherwise O(`count`).

This would still miss complexity of hash calculation and subsequent key comparisons to find the relevant bucket, which for variable sized keys would be O(keySize), no? The best case scenario would thus be O(keySize) + O(1) == O(keySize + 1) == O(keySize). Or is this too obvious to spell out explicitly?

lorentey:

For String.subscript, perhaps it'd be useful to include a couple of words about the granularity of our units. What couple of words, though? E.g., would this be enough?
- Complexity: O(1) `Character` extractions.

I like this one.

michelf · October 19, 2023, 11:53am

I think a big difference is Dictionary and Hashable have a strategy to prevent degradation from malicious input: varying the hash value at runtime.

String and Character do not have such affordance. You have to tell users of this subscript to beware somehow. Complexity: O(1) Character extractions isn't telling you much unless you document the character extraction process as O(m) somewhere. And where would you document the character extraction process if not in the subscript that performs this character extraction?

jonathanpenn · October 19, 2023, 5:28pm

Dictionary can be maliciously manipulated in these examples, though, by just using really really long keys. The hashing algorithm is O(keyLength), and then every comparison to check for buckets could potentially have to check a lot of the key, too. This is the part that we ignore when we say dictionary access is O(1).