Subscripting a string should be possible or have an easy alternative

itaiferber · December 8, 2024, 5:53pm

Agreed, and I would say that there's a lot more that could be done beyond documentation — I'd be delighted if, for example, both Xcode and swift-lsp would offer snippets in their autocomplete suggestion lists for common string operations, to surface these types of alternatives non-intrusively, right at the use site. (e.g., typing <string var>.index could offer a list of alternatives to indexing before suggesting String.index(after:)/.index(before:)/whatever, as could <string var>.startIndex and .endIndex)

Built-in linting that could detect common misuse patterns would also be steller (and not just for String).

But, that's a non-trivial amount of work, to say the least, so an effort like this would really need to be championed by someone with the time and dedication.

louzell · December 8, 2024, 6:27pm

Despite everyone telling you you are wrong, I think you are right! I have been using this extension for several years. It's fine, and contrary to the sentiment in here, the world hasn't exploded.

itaiferber · December 8, 2024, 6:50pm

The world certainly won't explode, but you're at high risk for writing accidentally-quadratic (or -cubic) algorithms, which pass most cursory small-scale testing and grind to a halt in real-world scenarios with non-trivial input.

The point is that there are just-as-easy to read, write, and understand forms of these same operations, without the same pitfalls — so why take the risk?

louzell · December 8, 2024, 7:03pm

you're at high risk for writing accidentally-quadratic (or -cubic) algorithms, which pass most cursory small-scale testing and grind to a halt in real-world scenarios with non-trivial input

Surely you have prod metrics in place that tell you when any of your deps are starting to show strain, and you do this well before you are suddenly at risk of exposing a quadratic algo to users pumping in an n sufficiently high to warrant improvement.

Edit to add: I do find the critique fair though :)

Edit to add more: thinking more about it, I wouldn't want to ship something that accidentally lead to poor algo perf in the stdlib either. I was sympathetic to the OP's post, which is why I commented in the first place. Both are true. The string API is good, and the string API is frustrating.

The point is that there are just-as-easy to read, write, and understand forms of these same operations, without the same pitfalls — so why take the risk?

I have not seen these. What are they?

itaiferber · December 8, 2024, 8:20pm

Yeah, I would say that any API that requires you to have monitor performance in production (quite possibly only noticing once customers have already had a poor experience) is not one worth promoting.

All of the APIs on

which String conforms to, among others. Taking just your extension as an example:

str[i...] == str.dropFirst(i) (Sequence.dropFirst(_:))
str[..<i] == str.prefix(i) (Sequence.prefix(_:))
str[...i] == str.prefix(i + 1) (Sequence.prefix(_:))
str[i..<(i+n)] == str.dropFirst(i).prefix(n), since these functions compose directly

Other useful functions:

Pretty much everything meaningful you can do with an Index is already leveraged by a higher-level operation that doesn't require you to think about it, or assume constant-time index calculations.

nnnnnnnn · December 8, 2024, 8:44pm

I'm so glad you're having fun tackling Advent of Code, and hope you enjoy what using Swift can bring to your experience!

As some of the other commenters have noted, we asks "strings" to do a lot of different jobs when programming:

text for user display
text for debugging or logging purposes
structured data like JSON, XML, or CSV
semi-structured data like HTML
non-textual data like binary formats

Swift strings are designed to be Unicode correct, which leans strongly toward the "text" uses, so it's normal to find a bit of a mismatch when using strings for data, like the Advent of Code inputs.

The tools I use for processing AoC inputs as strings are:

split(separator:) There are versions that accept a character, a substring, or a regex
prefix(_:), suffix(_:), dropFirst(_:), dropLast(_:): These all return substrings by chopping off the start or end of a string (or other collection)

Whenever I find myself reaching for indexing directly into a string, I just convert it to an array. For example, this converts an input string into a 2-dimensional array that can be accessed with [row][col] integer subscripts:

let grid = input
    .split(separator: "\n")
    .map(Array.init)

Each of the elements of the 2D array is a Character, so you can use all your regular comparisons (e.g. grid[0][0] == "#") or convert part of the grid back to a string easily (let firstLine = String(grid[0])).

carlynorama · December 9, 2024, 12:29am

What we're operating on in AOC are NOT "Strings" in the modern way. They're just base 128 data mapped to the roman alphabet for readability. (not even the 255 of extended ASCII)

A previous year I started writing a whole [UInt8] based type. It was fun! This year so far I'm just using that extension that would be a no-go for production and moving on... because something else is fun this year.

Next year I might parse everything straight into an enum. Who knows!

I will say that with Swift embedded being a thing it seems like there is already some ideas how to make working with Base 256 data more comfy.

what I've said before on that topic specifically: Embedded Swift - #141 by carlynorama
signs that thinking is being done: SE-0243: Codepoint and Character Literals - #341 by Ben_Cohen

Lots of good ideas in this thread! Gonna try them all!

louzell · December 9, 2024, 2:15am

Thanks! I added a link to this reference from my stackoverflow answer, as I think it's a top google result for working with swift strings. 550k views on the question.

aliali · December 9, 2024, 2:38am

That stack overflow thread makes me feel more sane. Seems like a common frustration on the stack overflow thread.

I would like to think I am not a beginner to swift, been using it daily for 3 years (my professional career, maybe that is a beginner?). I am just frustrated at this part of it. Below quote summarises it quite well.

Yes, I understand that a character (i.e. extended grapheme cluster) can take multiple bytes. My frustration is why we have to use the verbose index-advancing method to access the characters of a string. Why can't the Swift team just add some overloads to the Core Library to abstract it away. If I type str[5], I want to access the character at index 5, whatever that character appears to be or how many bytes it takes. Isn't Swift all about developer's productivity?

I also understand to do this correctly it may take O(n) and string operations in swift should be done very carefully.

I don't know if there is a solution here, or a compromise Swift/I are willing to accept, or even acknowledgement that this type of verbose string indexing is ~~weird~~ cumbersome. Maybe we are very happy to be different and stand on this hill knowing that we are better?

AlexanderM · December 9, 2024, 3:20am

I'm conflicted here. I know this is a really common pain point, and I would like for the tools to meet those peoples' needs. At the same time, Swift's approach is correct, robust, and really underappreciated. I wish those users had a better awareness that this area has complex trade-offs, and understood the rationale behind Swift's take on it.

I don't want to be an out-of-touch Principle Skinner, but I'm still trying to teach others that these concepts are not as simple as they've been mislead to think from other languages.

Fundamentally, this boils down to a trade off of:

Correctness
Performance
Familiarity/Ergnomics

Pick any 2.

Most other languages inherit C's notion of a String being a contiguous sequence of fixed-size code points (be they 8 bits, 16, w/e). This hits 2 and 3, but is completely and irreparably incorrect when it comes to handling foreign languages. Funny enough, it was ultimately Emoji (of all things, silly little pictograms!) that made devs care about unicode correctness.

I think that the majority of the drive behind String indexing comes from two main factors:

Other languages where string indices are the only way to iterate a String, e.g. C. If you can just natively iterate a String (e.g. for c in string { ... }, this need for integer indices goes away completely.
Programming problems like interview questions and Advent of Code, which perform rather rare/unusual transformations on simplistic string inputs, which don't have any of the complexities of real human text, which Swift's strings are optimized for.

My frustration is why we have to use the verbose index-advancing method to access the characters of a string.

This is the exact wrong takeaway, I'm afraid.

This is not merely a "you're holding it wrong" situation. If you're trying to do greeting[0] but can't, the solution isn't to write some more boilerplated version of the same flawed idea (greeting[greeting.startIndex]). It's entirely the wrong tool, regardless of how it's held/spelled.

In this example, it's to just use greeting.first (which even handles empty strings safely).

To me, the most frustrating irony of this situation is that even in other languages, if you want to handle non-English text correctly, you need renounce their standard library (and the given indexing operator), and use third-party libraries that do it correctly. And to no surprise, those libraries employ the same techniques that Swift encourages to begin with.

David_Smith · December 9, 2024, 6:43am

FWIW if someone comes up with a way to make integer string indexing not be a footgun, I’ll happily write the stdlib implementation and swift evolution proposal for it myself. Our objections are purely practical ones, which we’d love to be able to drop.

Unfortunately after a decade or so of it being a frequently requested change, we have put a considerable amount of thought into it without finding an acceptable answer.

aliali · December 9, 2024, 6:58am

Thanks everyone. Makes sense. Last two replied I think summarise the responses in the thread quite well.

taylorswift · December 9, 2024, 7:15am

in my mind, the missing ladder rung is far smaller than the earnest debates over the unicode-correctness of String indices would suggest it were.

for these kinds of coding challenges, a “simple” [UInt8] array of ASCII characters is exactly the right kind of abstraction to use — String is just not the right tool for the job when solving these puzzles.

what Swift doesn’t have (and makes it an outlier among C-family languages) is a sensible literal syntax for expressing a UInt8. the best we can do is UInt8(ascii: "a")!, which is hardly ergonomic. so instead, we give users Character and Unicode.Scalar which support the literal syntax, but impose a lot of unnecessary complexity for these ASCII-limited coding challenges.

CharlesS · December 10, 2024, 4:50am

You think that's bad, try subscripting Data sometime. With Data, you actually can subscript it with Int, but it usually doesn't do what you expect.

Consider this example:

import Foundation

let data1 = Data([1, 2, 3, 4, 5])
let data2: Data = data1.suffix(3) // yes, `Data`'s slice type is also `Data`

print(Array(data2)) // `[3, 4, 5]`

Given this, what would you expect data2[0] to get you? Trick question: the answer is the program will crash! This is because even though the type is Data, the fact that data2 is a slice of data1 means that all its subscripts are actually relative to data1, not data2. data2[2] is even more fun; it won't crash, but it'll silently give you the wrong result, returning 3 instead of 5 like you'd intuitively expect. So Data does let you use integer subscripts, but you mustn't ever use them, and the compiler will absolutely not enforce this.

Give me String's subscript behavior any day.

johnno1962 · December 10, 2024, 11:00am

The complexity of Swift's String abstraction is a known issue for people coming to Swift from other languages and is an uninviting aspect of Swift. I get that allowing "just" integer indexes would be inviting potential performance regressions but I wonder if we can't find a middle way that allows you to work with indexes reasonably conveniently with the right amount of friction.

I wrote a small package which seeks to simplify indexing while not loosing out on all that "unicode correctness".

StringIndex - Reasonable indexing into Swift Strings

An experimental package to explore what can be done about Swift's dystopian string indexing. At the moment, you have to perform this memorable dance to get the 5th character of a String:

let fifthChar: Character = str[str.index(str.startIndex, offsetBy: 4)]

This package defines addition, and subtraction operators for the String.Index type returning a temporary enum which conveys the offset and index to subscript operators on StringProtocol which iadvances by the offset lazilly (when it knows the String being indexed). The result of this is you can now get the same result by typing:

let fifthChar: Character = str[.start+4]

There are also range operators and subscripts defined so you can use the following to remove the leading and trailing characters of a string for example:

let trimmed: Substring = str[.start+1 ..< .end-1]

Or you can search in a String for another String and use the index of the start or the end of the match:

let firstWord: Substring = str[..<(.first(of:" "))]
let lastWord: Substring = str[(.last(of: " ", end: true))...]

You can search for regular expression patterns:

let firstWord: Substring = str[..<(.first(of:#"\w+"#, regex: true, end: true))]
let lastWord: Substring = str[(.last(of: #"\w+"#, regex: true))...]

etc..

oscbyspro · December 10, 2024, 1:04pm

In my opinion, Swift would have a much easier time explaining String's indexing approach if text.utf8[123] actually were a thing because then you could say that String is a proper zero-based random access collection of UTF-8 bytes and that its Character element type is a different thing. I suspect that most indexing frustrations come from being handed a known ASCII String and thus knowing (better than the compiler) that there's no meaningful distinction between index and offset.

xwu · December 10, 2024, 4:32pm

This absolutely isn't ruled out; first we need to be able to express lifetime-dependent [Mutable]Span types (or something in that vein) to vend a safe API.

David_Smith · December 10, 2024, 8:18pm

Tragically, even ASCII text has precisely one allowed multi-byte character: CRLF. This has thwarted many promising standard library optimizations.

(But yes, your claim holds given you might also know the text has no CRLFs)

David_Smith · December 10, 2024, 8:19pm

Reminds me a lot of swift-evolution/proposals/0265-offset-indexing-and-slicing.md at main · swiftlang/swift-evolution · GitHub (which is a compliment, I liked that proposal)

tim1724 · December 10, 2024, 8:35pm

In Unicode that's a single character.

In ASCII as originally formulated it's not.

If I were making a SimpleASCIIStringForAdventOfCode type I would treat [0x0d, 0x0a] as a two character string.