Subscripting a string should be possible or have an easy alternative

I'm not sure you're wrong but I'd wager people would find it easier to learn about Unicode with a more concrete type that commits to a representation. The situation at the moment makes more of a meal of it than is necessary.

Well, it would be nice if Swift‘s internal usage of UTF-8 would be fixed and documented so you know getting the UTF-8 representation is efficient. Maybe this could also help understanding the problem (together with the note that using 32 bit integers instead would be space-inefficient in most cases), and this (I just looked it up) seems indeed be something that is missing in the Swift book.

i think String is exactly the right abstraction to use in “real” code. the problem we have is that it’s awful for puzzles and playgrounds, which are often the first experience people have with Swift.

this means potential new users have a terrible first experience with Swift, and unless they are externally compelled to use Swift (e.g., App Store), they will be unlikely to continue with the language. and thus, we continue to lose market share on Linux, Windows, and everywhere else that Swift does not benefit from an artificial monopoly.

10 Likes

What's wrong with:

string.components(separatedBy: delimiter)

Or, if there is only one instance of the delimiter:

if let delimRange = string.range(of: delimiter) { 
  let prefix = string[..<delimRange.lowerBound]
  let suffix = string[delimRange.upperBound...]

  ...
} 

Or, as others have mentioned, regular expressions.

The thing preventing all Strings from being documented as natively UTF-8 is that a String is allowed to actually be an NSString under the hood. This is why we have makeContiguousUTF8().

2 Likes

Add findCharacter(at: …) and findString(from: …, until: …), giving you according features, and keep those awkward names to keep in mind this is not something simple / efficient. Or just use arrays of characters.

We're slowly chipping away at how often bridged Strings show up in practice at least. For example[1], prior to iOS 18/Sequoia/etc, bridging an ASCII NSMutableString would produce a wrapped String, but it now produces a native one.

And makeContiguousUTF8 is cheap if it's already in the native representation.


  1. I recognize this is Apple-specific, but so are bridged NSStrings ↩︎

4 Likes

How often (if ever) can it be optimized down to a no-op for native strings?

It's all inlinable and ultimately boils down to if (_object & 0x1000_0000_0000_0000) == 0. So, any time LLVM could normally eliminate a branch like that (e.g. constant propagating the set bits, or hoisting it out of a loop).

6 Likes

I really appreciate that you're playing around with ideas in this space.

Could have, but almost never is. That's the problem. I share Florian's assessment:

Worse yet, even well meaning and diligent developers who write tests for their string-handling code might still miss it, because their test data might be simplistic ASCII/BMP and not contain any problematic characters in problematic spots that would break their algorithms.

2 Likes

This limits the available character set and you cannot add accents to your characters. Better use arrays of “Character” (so actually arrays of Strings) and add a “print” function and other things you need, you might call this “SimpleString”, it can have simple subscripts. It is not efficient, but if it is only for mitigating the shock of a first usage of Swift, why not. If this is successful, you might even add the '…' notation for simple strings to the language.

If it is clear that the “simple” alternative is inefficient, people will learn to do it “right”.

Or it will make them turn away from Swift just as much as String’s API does, saying “string processing in Swift is terribly inefficient.” :wink:

That’s why I think it’s not worth it. Simple, inefficient alternatives already exist: create an array of whatever base units you want/need to manipulate, and work with that. String makes this pretty easy.
The only complaint I’ve received when I suggest this to people (usually for puzzles or other little programs with a need for quickly-written code more than high performance) is a variant of: “but then I’m not really manipulating strings”, to which I reply by reminding them that in the other languages they came from, “string” is a synonym for “array of 16-bit (or even 8-bit) units”.

Sometimes, I think we would have seen a lot let complaints about String if it had been named something like Text instead, just to avoid clashing with people’s mental model of a string as an array of fixed-size units and highlight the fact that this type is a higher-level abstraction. But then maybe the complaints would have turned into ”Swift does not even have a string type!” :roll_eyes:.

4 Likes

Maybe you have fun with this:

import Foundation

public struct IndexedString: CustomStringConvertible {
    
    private let s: String
    private let indices: [Int]
    private let utf8: [UInt8]
    
    public var description: String { s }
    
    public var count: Int { indices.count }
    
    public init(_ s: String) {
        self.s = s
        var index = 0
        indices = s.map{ index += $0.utf8.count; return index }
        utf8 = Array(s.utf8)
    }
    
    public subscript(position: Int) -> IndexedString {
        return IndexedString(String(bytes:  utf8[(position > 0 ? indices[position-1] : 0)..<indices[position]], encoding: .utf8)!)
    }
    
    public subscript(range: Range<Int>) -> IndexedString {
        return IndexedString(String(bytes:  utf8[(range.lowerBound > 0 ? indices[range.lowerBound-1] : 0)..<indices[range.upperBound-1]], encoding: .utf8)!)
    }
    
    public subscript(range: ClosedRange<Int>) -> IndexedString {
        return IndexedString(String(bytes:  utf8[(range.lowerBound > 0 ? indices[range.lowerBound-1] : 0)..<indices[range.upperBound]], encoding: .utf8)!)
    }
    
    public func replacing<Replacement>(_ regex: some RegexComponent, with replacement: Replacement, maxReplacements: Int = .max) -> Self where Replacement : Collection, Replacement.Element == Character {
        IndexedString(s.replacing(regex, with: replacement, maxReplacements: maxReplacements))
    }
}

// usage:

let s = IndexedString("Häl😉y\u{301}o")
print(s) // prints "Häl😉ýo"
for i in 0..<s.count {
    print("\(i): \(s[i])") // prints "0: H", "1: ä", "2: l", "3: 😉", "4: ý", and "5: o"
}
print(s[1..<3]) // prints "äl"
print(s[1...3]) // prints "äl😉"
print(s.replacing(/[a-z]/, with: "x")) // prints "Häx😉ýx"
1 Like

Please do not forget that even when staying comfortably within the limits of the BMP, some characters are composed of a sequence of several units (be they 16-bit or 32-bit): a base + one or more combining parts. There is no need to reach for emojis or ancient scripts outside of the BMP in order to find characters that can be broken by string processing based on fixed-size units — common writing systems used everyday by lots of people can get you there, and we still see them often enough be mistreated by basic, old-style string processing.

3 Likes

I'll not leave this thread to wither on the vine but summarise my own conclusions.

It was informative for me to develop the "String16" type I mentioned as it helped firm up my evaluation of Swift's string abstraction which is that it is essentially sound. It was probably a mistake to introduce it into this thread though as people started to fixate on its simple utf-16 data representation as if that was all it was. In combination with a smart index type and tapping into the "ICU" it can be a Unicode correct as you like. For all-BMP strings it does give you an entry level api where integer indexes are character indexes.

But this was all by the by as the real problem Swift really needs to solve is the difficulty working with the String.Index type. The StringIndex package I introduced solves this in a simplistic, performant and correct manner. You can't use integer subscripts with String but relative ones such as .start+n which is to my mind a sufficient reminder to users this is something a little different and an O(n) operation. No feet were harmed in the making of this package.

The similar formal proposal SE-0265 goes further and looks to upstream this behaviour into the collection protocols which seems more difficult than it needs to be and perhaps as a result 5 years later we are still waiting for it to come back to review. This was all discussed in my original thread way back then and we are no further along.

I don't think it's an exaggeration to say this is the #1 thing that needs to be fixed in Swift to improve many people's initial impression of the language. We have a stack overflow answer on indexing viewed 500,000 times which means practically all Swift programers have had to consult it at least once. I don't know what more I can do to try to inject urgency into this conversation other than swear or some other minor code violation. The thread title phrases the still unanswered question perfectly "Subscripting a string should be possible or have an easy alternative". Solving this problem in a timely manner is not beyond the combined wit of the people represented here but it seems we're choosing not to solve it to "teach people a lesson".

1 Like

I do not see how newcomers expecting simple indexing would be better off with those solutions. And note that you can change the scope of substrings via the “drop” methods, giving you offsets already with the standard String methods.

Have you seen my code above? Other than what I have written before, I think this subscripting could indeed be added to String, subscripts could fill the lazy indices on the first use (and invalidated if the String changes). So those subscripts are inefficient on first use on a String and then cost some space, but the performance of subsequent uses of the subscripts of a String should be quite OK. Why not (documentation is king).

I couldn't quite get the point you were getting at with your example. Perhaps you could explain? If you really need performant random access to characters the simplest option is to convert your string into an actual array.

var chars = Array<Character>("my string")

What I'm interested in is the programability/simplicity of the model leaving the string as is.

By my count we only get 1 out of three at the moment. I know it sounds like I'm obsessed with subscripting but that quant old concept served us well until Swift came long. I believe it can be rehabilitated to get familiarity/ergonomics back which counts for a lot for people coming to the language. The means of discovery for the assorted collection/slice methods is not great. The full set of things you can do with subscripts/slices removes the necessity to know perhaps a dozen collection methods and you still have the problem of preparing the Index to use anyway.

Leave the String as String because:

  • You would have to use 32 bit integer arrays, which are quite inefficient for most texts.
  • Grapheme clusters i.e. “real” characters (which could be composed of several Unicode codepoints) are not (!) handled by such an array, the indexing is then cheating so to speak.
  • You would like to treat those arrays as texts/strings so you have to replicate many String methods via new implementations.
  • You would have to do a lot of conversions between code that uses different string implementations.

My thinking is that Swift solves the String issue actually quite well, it is mainly a problem for newcomers, along those lines:

If this “lazy” subscripting as described by me above will not be added to String, then maybe make a separate IndexedString package to be used in those cases (or add it to Foundation). String methods then only have to be “translated” and not re-implemented (see my code, but yes, you want to add some specialized methods), and an IndexedString always has the normal String accessible. But I would actually add it to String, it only costs one optional array (the indices) being nil if not used (and there has to be the UTF-8 array, but I guess this should already be there in most cases?), that’s not much.