Shorthand for Offsetting startIndex and endIndex

@QuinceyMorris I like your suggestions.

In case of the label skip I would suggest to use drop which has precedent in dropFirst and dropLast.

Thanks for the explanation, but that doesn't clarify it for me. Does this wrapping up

subscript(_ i: Int) -> Character {

    return self[self.index(s.startIndex, offsetBy: i)]    
}

suddenly make the algorithm O(n)?

I don't see how this can meddle with having Int. Are both representations equally indexed?

I know. I want to know why String does.

Yes, calculating a string index from the start of the string is O(n). Because you have to look at all the bytes that come before the character you want. Once you have an index though things get a little faster since you don't have to compute that index again. But it still remains that you'll have to compute the index.

1 Like

Is converting a String to Array an expensive operation?

Now I really am confused.

You wrote subscripts on Collections should be O(1). That means String subscripts are O(1), since @xwu confirmed String is a Collection. So

str[str.index(str.startIndex, offsetBy: i)] is O(1).

Then the below is O(1) as well.

subscript(_ i: Int) -> Character {

    return self[self.index(s.startIndex, offsetBy: i)]    
}

@ckeithray That should be at least Theta(n)

It is a bit confusing. Doing the actual indexing is O(1), but computing the Index still takes O(n).

Which is why Swift strings make these index calculations be done explicitly by the user.

1 Like

Take a look at this excerpt from the Advanced Swift book (a good read if you're interested).

Nicely explained. Let's put it in the context of the code that @anthonylatsis writes:

let idx = str.index(str.startIndex, offsetBy: i) is O(N).
str[idx] is O(1).

This is why it's so important to have an explicit label, or some other indication, if providing a convenience subscript that performs both of these steps in one.

7 Likes

@xwu
So basically this is to prevent writing

str[i] // O(n)
str[i + 1] // O(n)

instead of

let index = str.index(str.startIndex, offsetBy: i) // O(n)

str[index] // O(1)
str[index.next()] // O(1)

although if i + 1 were some j instead, it would still be O(n) in both cases. And iterating over a string also remains O(n).

This requires traversing the string from the beginning until i + 1.

This isn't O(1), because the next character could be any amount of distance away, but it doesn't require traversing the string from the beginning, only from index.

Due to grapheme clusters? Then O(m), m is the length of the cluster. As for everything else, understood. Thanks to everyone

2 Likes

Am I right in saying that for the various StringProtocol.*View, if they weren't views over a type (which might have a backing storage of different encoding) they could be RandomAccessCollection? Basically, could they just be seen as Arrays?
Of course in this case an element wouldn't be a "character" depending on the encoding.
Is there any way to access this performance gain without losing StringProtocol conformance?

With this example you expose an issue of performance (total of 2N+1 as opposed to N+2). But I think that the main reason for this, more than avoiding performance losses, was to avoid unexpected performance costs. i.e. avoid letting code look like O(1) when it's actually O(n)

They're all arrays of fixed-size code units, but that's not useful directly, because a code unit is meaningless in isolation. At a minimum, you want a sequence of code units (of variable length) that represents a code point (i.e. a code point is an array of code units). Alternatively, you want to find a code unit before or after a given code unit that is the first code unit of a code point.

In the case of UnicodeScalarView, one code unit is one code point. That seems easier, but that still doesn't get you very far, because Characters are arrays of code points (of variable length) — or arrays of arrays of code units.

Thus, the only real use of the views is for forming or serializing strings, or for performance-sensitive operations where you drop down to the code point or code unit level, ignoring or otherwise dealing with grapheme boundaries.

Yes. For a concrete Unicode string, any algorithm that treats it as an array of anything risks explosive increase in complexity. Hence the extreme lengths that String goes to, to avoid looking like an array.

IIRC, the first Unicode standard was released about 1990, so it's taken 28 years to get a grip on this (and we're not really there yet, if String is undergoing another round of usability changes).

1 Like

You mention convenience functions that the user can add at will, such as, takeFirst. I’d argue that this would be a necessity to include along with the other methods.Some others I think should be included:

subscript(takeFirst:)
subscript(takeLast:)
subscript(skipFirst:)
subscript(skipLast:)

All of this will increase the surface API of Sequence and Collection even if suffix/prefix/drop are removed.

Having all of these as subscripts will also make it difficult to discover, due to limitations in code completion I’ve seen, in a few IDE’s.

Yes. I waffled on this because it's a decision to be taken whether the convenience methods should follow a consistent naming convention, or whether they should keep the existing (drop/prefix/suffix) names where they overlap or both. I think consistent is better, and the old names seem to be approaching deprecation.

The question of API surface area seems to me to be part of SE-0132, if it gets revived.

If autocompletion counts as discovery, there could be a fourth set of API that implement the subscripts as methods (so .skipFirst(_:take:) for [skipFirst:take:]). It would be nice if they could have an attribute like "deprecated", with a fixit to replace them with the subscript, except that they would still appear in code completion.

Edit: Oops, no need for another set of methods. The subscripts are already equivalent to the underlying slice methods with the same parameter keywords.

My only concern would be the mutable case:

  x [dropFirst: 2, take: 3] = []

which could be read as suggesting that the first two elements are being dropped, and the next 3 replaced. Other than that, I agree that "drop" has familiarity in its favor.

But that was the point I was trying to make: if someone is working with strings that are known to be ASCII-only... is the only way to deal with this performantly to actually deal with [UInt8]? Should there be something like struct ASCIIString: StringProtocol, RandomAccessCollection?

String can and does provide its own optimizations for common use cases.

The trap with the reasoning here is that everyone working with strings in every programming language knows that they're dealing with [insert charset here] only, then things break when what they know isn't true.

The question that Swift has you answer is: are you working with strings-as-strings, or only strings-as-sequences-of-bytes? If the former, then use String, since you very much want things not to break down when you receive unanticipated Unicode input; if the latter, and you really don't care what the represented characters are, then use Data or something else.

Also, you don't know what's in the String instance backing store, so to avoid all conversions you need to stay away from String completely.

1 Like

A desire for solutions to this is mentioned in the "### Unmanaged Strings" section of State of String: ABI, Performance, Ergonomics, and You!.

Sorry for the lack of rendered Markdown, it's from pre-forum days. Perhaps an admin (CC @Nicole_Jacque) can edit it to apply formatting?

1 Like

Well, I tried...