Shorthand for Offsetting startIndex and endIndex

I was thinking about this kind of approach yesterday, glad to see someone suggested this. I feel like this is some sort of compromise. The biggest issue I had with this approach is thinking of names for the params/labels. take for example is not immediately obvious to me what it does.

Another thing it doesn't really address is the brevity that some people want. It seems some people really do want powerful slicing expressions like you have in more script oriented languages like Python or Perl.

Perl does have * to refer to the end of an array.

my @alphabet = 'A' .. 'Z';
say @alphabet[*-1];  # OUTPUT: «Z␤» 
say @alphabet[*-2];  # OUTPUT: «Y␤» 
say @alphabet[*-3];  # OUTPUT: «X␤» 

It also allows that to be used in the slicing syntax.

Can somebody be so kind to explain why can't we have proper Int-based subscripts on String like in Array, for instance, instead of discussing how to simplify
s[s.index(s.startIndex, offsetBy: 7)...s.index(s.startIndex, offsetBy: 11)]?

P.S. According to the String Manifesto, String will become a Collection of Characters again (It hasn't yet become, has it?), which emphasizes the question for me even more.

Someone correct me if I'm wrong, but it has to do with Unicode correctness and algorithm complexity. Because of the way Collection is defined, subscripts should be O(1). But to be Unicode correct, you have to traverse the string, which can be O(n). So the idea is that if we had Int based subscripts on String, all of a sudden you have O(n) behavior on a subscript.

It also has to do with what a "character" really is. Because Unicode defines code points that combine to form one grapheme. What happens when you have a String that is really composed of multiple Unicode code points, but composes to a single grapheme?

That being said, I'm also in the camp that thinks there should be some easier way to express slicing up a string that doesn't revolve around a ton of index calculation.

An extreme case:



let zalgo = "Ḩ̷͕̺͍͉̇͛́̀͑̍̕͢e̢̢̘̫̺͖͓̥͎̐̈̀̄̚͞ c̵̛̗̯̫̫͓̪̓͛̂͘ͅo̢̡̳͕̰̗̻͇̽̿̈͗̀̐̿̌͠m̢̪͕̮̱̘͇̈̉͗͆̀̕͘e̳͕͓͚͎̪̟̬̰͆̓͑̃̿̊ͅs̵͇̳͓̥̼̠͌͒̎̄͒̅̿͢"



print(zalgo.count) // 8
print(zalgo.utf16.count) // 112
1 Like

It has. Collections can have opaque indices.

Previously, I had skip/keep but that looked a bit weird in a subscript on the LHS of an assignment, because it replaces the thing you're "keeping". Before that I had skip/next, which reads better, but is slightly odd in the end-relative case, where it means "next to the left". There probably isn't any single word that is totally clear on its own.

If we can cover the basic functionality (and meet the basic requirements) in a straightforward, consistent way, I don't see a problem adding "convenience" API on top that introduces operators for brevity, or uses Python-style signed offsetting. I just wasn't much in favor of those being the only choices.

I believe the ultimate goal is to be able to complete SE-0132.

3 Likes

@QuinceyMorris I like your suggestions.

In case of the label skip I would suggest to use drop which has precedent in dropFirst and dropLast.

Thanks for the explanation, but that doesn't clarify it for me. Does this wrapping up

subscript(_ i: Int) -> Character {

    return self[self.index(s.startIndex, offsetBy: i)]    
}

suddenly make the algorithm O(n)?

I don't see how this can meddle with having Int. Are both representations equally indexed?

I know. I want to know why String does.

Yes, calculating a string index from the start of the string is O(n). Because you have to look at all the bytes that come before the character you want. Once you have an index though things get a little faster since you don't have to compute that index again. But it still remains that you'll have to compute the index.

1 Like

Is converting a String to Array an expensive operation?

Now I really am confused.

You wrote subscripts on Collections should be O(1). That means String subscripts are O(1), since @xwu confirmed String is a Collection. So

str[str.index(str.startIndex, offsetBy: i)] is O(1).

Then the below is O(1) as well.

subscript(_ i: Int) -> Character {

    return self[self.index(s.startIndex, offsetBy: i)]    
}

@ckeithray That should be at least Theta(n)

It is a bit confusing. Doing the actual indexing is O(1), but computing the Index still takes O(n).

Which is why Swift strings make these index calculations be done explicitly by the user.

1 Like

Take a look at this excerpt from the Advanced Swift book (a good read if you're interested).

Nicely explained. Let's put it in the context of the code that @anthonylatsis writes:

let idx = str.index(str.startIndex, offsetBy: i) is O(N).
str[idx] is O(1).

This is why it's so important to have an explicit label, or some other indication, if providing a convenience subscript that performs both of these steps in one.

7 Likes

@xwu
So basically this is to prevent writing

str[i] // O(n)
str[i + 1] // O(n)

instead of

let index = str.index(str.startIndex, offsetBy: i) // O(n)

str[index] // O(1)
str[index.next()] // O(1)

although if i + 1 were some j instead, it would still be O(n) in both cases. And iterating over a string also remains O(n).

This requires traversing the string from the beginning until i + 1.

This isn't O(1), because the next character could be any amount of distance away, but it doesn't require traversing the string from the beginning, only from index.

Due to grapheme clusters? Then O(m), m is the length of the cluster. As for everything else, understood. Thanks to everyone

2 Likes

Am I right in saying that for the various StringProtocol.*View, if they weren't views over a type (which might have a backing storage of different encoding) they could be RandomAccessCollection? Basically, could they just be seen as Arrays?
Of course in this case an element wouldn't be a "character" depending on the encoding.
Is there any way to access this performance gain without losing StringProtocol conformance?

With this example you expose an issue of performance (total of 2N+1 as opposed to N+2). But I think that the main reason for this, more than avoiding performance losses, was to avoid unexpected performance costs. i.e. avoid letting code look like O(1) when it's actually O(n)

They're all arrays of fixed-size code units, but that's not useful directly, because a code unit is meaningless in isolation. At a minimum, you want a sequence of code units (of variable length) that represents a code point (i.e. a code point is an array of code units). Alternatively, you want to find a code unit before or after a given code unit that is the first code unit of a code point.

In the case of UnicodeScalarView, one code unit is one code point. That seems easier, but that still doesn't get you very far, because Characters are arrays of code points (of variable length) — or arrays of arrays of code units.

Thus, the only real use of the views is for forming or serializing strings, or for performance-sensitive operations where you drop down to the code point or code unit level, ignoring or otherwise dealing with grapheme boundaries.

Yes. For a concrete Unicode string, any algorithm that treats it as an array of anything risks explosive increase in complexity. Hence the extreme lengths that String goes to, to avoid looking like an array.

IIRC, the first Unicode standard was released about 1990, so it's taken 28 years to get a grip on this (and we're not really there yet, if String is undergoing another round of usability changes).

1 Like

You mention convenience functions that the user can add at will, such as, takeFirst. I’d argue that this would be a necessity to include along with the other methods.Some others I think should be included:

subscript(takeFirst:)
subscript(takeLast:)
subscript(skipFirst:)
subscript(skipLast:)

All of this will increase the surface API of Sequence and Collection even if suffix/prefix/drop are removed.

Having all of these as subscripts will also make it difficult to discover, due to limitations in code completion I’ve seen, in a few IDE’s.