Subscripting a string should be possible or have an easy alternative

johnno1962 · December 10, 2024, 8:35pm

Very interesting. That's spookily similar to my original pitch and pre-dates it. It's a shame it didn't survive the review phase. We could really do with a solution along those lines or something similar.

hryde · December 10, 2024, 9:53pm

The following works well under the following circumstances:

Your strings aren't outrageously long
You do a lot of random access into the resulting array of Character

var randomAccess = Array<Character>( yourString)

// Randomly access "randomAcces[i]" to your heart's content

var stringAgain = String( randomAccess )

Best of all, you can modify the characters in the Array with no problems (unlike attempting to change the data at some string index).

Just so you know, each element of the Character array consumes quite a bit of memory (pointer, plus storage to hold the actual character data, plus other overhead). Although the conversion to/from Array/String is O(n), it's not especially fast (though probably much faster than the O(n**2) code that attempts to use arbitrary string indices on a String variable).

If you're only doing a few operations using random access to characters in the string, converting to and from an Array will swamp the performance of the random access; go ahead and use the Index functions at that point. However, if you're doing a ton of random accesses, this quickly amortizes the cost of the string/array conversion.

Of course, while the data is in Array form, you don't get to use cool string functions, like the regular expression stuff. You do get to use Sequence and Collection methods, though.

sspringer · December 11, 2024, 8:33am

In the cases when I thought that directly accessing something like the 5th character of a string, there was always a better solution, e.g. when accessing the 5th character, you might first want to make sure that the string has a certain prefix, and when looking for the prefix, you should already get a range that you can use to access the next character. So actually in my applications where I do a lot of string processing, I nowhere have the need for direct character indexing.

Nobody1707 · December 11, 2024, 5:33pm

If you just need a pure ASCII byte string for coding challenges, this is probably enough. At worst you might need to add append and +.

@preconcurrency import Foundation

struct ASCII: ExpressibleByUnicodeScalarLiteral, Equatable, Hashable {
  public private(set) var codepoint: UInt8
  init(_ codepoint: UInt8) {
    precondition(codepoint < 128)
    self.codepoint = codepoint
  }
  init(unicodeScalarLiteral scalar: Unicode.Scalar) {
    precondition(scalar.isASCII)
    codepoint = UInt8(ascii: scalar)
  }
}

extension ASCII: TextOutputStreamable {
  public func write<Target>(to target: inout Target) where Target : TextOutputStream {
    Unicode.Scalar(codepoint).write(to: &target)
  }
}

struct ASCIIString: Equatable, Hashable {
  var storage: [ASCII]
  public init(_ characters: ASCII...) {
    storage = characters
  }
}

extension ASCIIString: ExpressibleByStringLiteral {
  public init(stringLiteral value: StringLiteralType) {
    storage = value.utf8.map { ASCII($0) }
  }
}

extension ASCIIString: TextOutputStreamable {
  public func write<Target>(to target: inout Target) where Target : TextOutputStream {
    for ascii in storage {
      ascii.write(to: &target)
    }
  }
}

extension ASCIIString: MutableCollection, RandomAccessCollection {
  public var startIndex: Int { 0 }
  public var endIndex: Int { storage.count }
  public var count: Int { storage.count }
  public subscript(index: Int) -> ASCII {
    get {
      storage[index]
    }
    set {
      storage[index] = newValue
    }
  }
}

johnno1962 · January 6, 2025, 4:31pm

I'm going to post even if this will likely be dismissed as the ramblings of Principal Skinner.

This thread is a classic and it reminds us that 10 years on we still have a String type that is quite difficult to use for anything other than passing Strings around and fairly basic String manipulations. This isn't just a case of RTFM.

The problem seems to be we have an abstraction, the reality of trying to use, is more complex than the complexity it was trying to encapsulate and, somehow in trying to solve a very complex problem we have made solving trivial problems quite demanding for new users. I can see how abstracting to such a degree helps Apple spare its users of implementation details but abstraction is not an end in itself.

To be able to talk about how things could have been different I have experimented with a hypothetical "devil's reference implementation" that is literally an array of 16 bit shorts GitHub - johnno1962/String16: Workable String type?.

It's worth remembering in terms of "correctness" this representation in itself represents 99.1% of the characters on the web i.e. those before Unicode 2.0 extended unicode scalars outside the "Basic Multilingual Plane" (16 bits).

https://en.wikibooks.org/wiki/Unicode/Character_reference/1F000-1FFFF

If you want to extend to full "Character" in the Swift sense correctness, the implementation includes a separate index type you can use instead of raw integers which will correctly segments characters using IBM's ICU library. Indexes are integers and performant, it's just some of them are not valid character boundaries when you enforce this using the index type. The key thing is you get to opt into the complexity/correctness trade-off and the potential performance variability and understand what is going on.

This is all by the by as things are not about to change with Swift's String type but I do wish there was an admission there is a problem with usability and an effort to do something about it such as SE-0265 or my own library. I think the original poster summed the current situation up masterfully...

sspringer · January 6, 2025, 5:15pm

I very much do not think that‘s right. The problem is complex and Swift in my opinion has a good solution to it. The good old ASCII world is gone, I have to handle a lot of texts in different languages and different space characters, mathematical characters and combining characters. 16 bits just are not enough. The character ranges that common operations give me are perfect in most cases, regex expressions can work on codepoints not characters if you want, and in some cases you may resort to the codepoint view.

If anyone needs a simpler representation, isn’t there a simpler String implementation for embedded contexts?

As mentioned earlier in that thread, most times when you think you need simple indexing there is actually a solution without it witch makes more sense. I would say look at realistic examples that way to see if you really need simple indexing.

ksluder · January 6, 2025, 5:37pm

If I recall correctly, this is exactly where the open design problems remain. Swift needs a data type that is easy to search for within for patterns of bytes that are actually defined in terms of characters, and that is easy to extract Strings from, but which does not become an attractive alternative to String for general-purpose text manipulation and presentation.

johnno1962 · January 6, 2025, 6:15pm

8-bits is enough to represent all characters if you use utf-8 which is efficient in terms of space but slow in terms of access. utf-16 is a historical compromise where you can address most characters directly even if it is neither constant length or efficient.

I'm trying to tease out the components of what makes a string. To manipulate it other than through an api you need to give it a representation or model which doesn't have to literally be an array. From memory, NSString/Java had it about right. The point I'm trying to make about validating non-linear indexes past unicode scalars is that it's a separate problem. With the existing String api, the "unicode safety belt"(straight jacket) isn't optional when it could have been though some might not accept that.

florianpircher · January 6, 2025, 6:58pm

I don’t believe developers can judge whether the strings that their software is dealing with is the Basic Multilingual Plane subset of Unicode or Unicode. I see so much broken string handling, even in apps that claim to be Unicode savvy. Either you are dealing with a Unicode string or you are using a different encoding standard entirely.

It’s easy to pretend the Basic Multilingual Plane is enough and you can have a simpler API than String while still supporting the world’s languages. But the complexity is real and it does not go away. Supporting this subset is not supporting Unicode and it offloads the complexity to the end users who now have to deal with string encoding instead of the developer.

The UTF8Span could serve this purpose:

johnno1962 · January 6, 2025, 7:41pm

I'm not saying that's necessary. It's what you mean by "dealing with" that counts. Say you were writing software to parse a URL or an email address. It's the delimiters (characters your code searches for) that count and pretty much all of those are in BMP (or ASCII for that matter). What's extracted from between the delimiters can be any combination of characters in all their splendour and either an integer or unicode aware index range would yield the same result.

I don't think anybody has invented a URL http😀//johnholdsworth.com that needed non-BMP character matching. You can do useful and correct work on a string e.g. JSON parsing without unicode strictness at every stage. Making the "unicode safety belt" optional lowers the least common denominator of using the API making it simpler to use and document.

florianpircher · January 6, 2025, 7:56pm

That get’s difficult to communicate to developers when both fully Unicode complaint and some-what Unicode complaint APIs are accessible on the same String type. How do you differentiate these APIs, making it clear which is OK to be used in what context?

I suspect as soon as easier, not fully-Unicode-complaint alternatives are introduced, many developers will gravitate towards them, learn that that’s how string handling is done in Swift, and exclusively use the APIs without considering the complexity that they have offloaded onto their users.

sspringer · January 6, 2025, 9:33pm

Java‘s handling of Unicode never considered letters (= grapheme clusters) and had problems since 32 bit codepoints were introduced (because if its internal representation via 16 bit integers), so Java did not correctly handle Unicode.

Was speaking of a representation via equally sized numbers for codepoints.

sspringer · January 6, 2025, 9:44pm

You can do it already today via e.g. myText.contains(#/^https?:///#.matchingSemantics(.unicodeScalar)). No grapeme clusters are considered in this case. Indeed, simpler methods (without regex expressions) should also have such versions. See my own package FastReplace which contains simpler methods.

johnno1962 · January 6, 2025, 9:45pm

But a Java program could have through a third party library. My question generally is: was it advised or even necessary to bake Unicode complexity into a fundamental type of Swift language itself rather than delegate it to say, a separate smart index type. I'm largely just thinking out loud given things aren't going to change but working through things in these terms might lead us to a simplification we can make to the existing model to mitigate the usability problems.

sspringer · January 6, 2025, 9:46pm

Java's internal representation of Strings is inaccessible. And it is a problem that some methods in the Java library just do not work correctly with 32 bit codepoints.

Yes, there might be something smart. But you do not want to have such a complex thing for Strings you do not operate on in an according manner, so this should be a lazy representation, so if you just operate once on a String, you practically gain nothing (would be even slower in many cases).

ksluder · January 6, 2025, 10:00pm

Yes. Because the decades-long status quo has been subtle-to-catastrophic breakage for application users in various locales: Mojibake - Wikipedia

Swift was intentionally designed to advance the state of the art beyond this point, just like the previous generation of languages/frameworks (Objective-C, Java, wide-char support in Windows and [Visual] C++) added UTF-16 support to advance the state of the art beyond ANSI codepages.

johnno1962 · January 6, 2025, 10:17pm

Not sure Mojibake has much to do with maintaining unicode correct indexes. If the character encoding is wrong all bets are off.

I'm aware of that. I'm quibbling with the implementation.

ksluder · January 6, 2025, 10:18pm

One way to get mojibake is to break a UTF-16 surrogate pair or UTF-8 byte sequence that encodes a character beyond the BMP.

johnno1962 · January 6, 2025, 10:24pm

I agree, I'd refer to that as "tearing" a character but I think the risks of that given how the means to initialise an index are limited are overblown. Then there are plenty of subtle problems when you deal with zero-width joiners where searching for even a unicode scalar can match inside a longer franken-character. I'm not saying it isn't complicated. I'm just casting around for any opportunity to make it simpler or Strings will continue to be Swift's elephant in the room.

sspringer · January 6, 2025, 10:31pm

As already said and excuse the repetition, real use cases should be the guidance, maybe simpler versions of some String methods in Foundation (like .matchingSemantics(.unicodeScalar) for regex expressions) could be all you need (compare my package mentioned above). Many times I just thought I would need simpler indexing and then I recognized I just don‘t with code that was actually better. So excuse me not being convinced of the need so far. People will have to learn about Unicode anyway to be able to avoid common pitfalls, and the according documentation in the Swift book is not to bad I think.