Removing CharacterSet characters from a string seems hard

tera · April 12, 2024, 3:48pm

You mean that when I see the construct like string.fetchCharacter(atOffset: 5) or s.dropFirst(7).prefix(5) I must immediately recognise that it's not a good idea to have it in the loop, while when I see a simpler construct like string[i] I will fail to do so? How about a more explicit string[dontDoThisInLoop: i] – obviously a placeholder name, but the idea is to name it in such a way that it is immediately obvious that it would be a bad idea to call it in a loop, without having to decipher that?

There are two problems with warnings:

if warnings are set to be errors – you can't ignore them.
if you really want to do it the way you do it (e.g. you know that the loop will only have two iterations max) there must be a way to suppress the warning. What that suppression could be? This?

for i in 0 ..< string.count {
    let c = (string[i])
    ...
}

FlorianPircher · April 12, 2024, 4:29pm

No, I don’t think syntax can perfectly signal the runtime complexity of an API access. But I do think there is a general assumption about what different access patterns entail:

subscript: generally O(1), sometimes O(log(n))
property: generally O(1) or O(log(n)), sometimes O(n)
function: no guarantee, might be O(1), might be much more expensive

There are exceptions, but I would not use a property for something that takes O(n²) time or a subscript that is O(n).

O(n) for a property access already feels like a stretch. This causes confusion for Swift programmers that assume string.count is O(1) – me, for quite some time – and read the property repeatedly instead of caching the value. Maybe the compiler can optimize that, but I would not want to rely on such optimizations.

I am sure that if it were spelled string.count(), people would be more hesitant to make the call over and over again and instead read the value once and then use the cached value.

Similarly, subscript notation is near constant-time in almost every language I can think of, including Swift. A verbose subscript label might help, but finding a good name is no easy task. I feel like almost everyone can agree that array access with optional return value (array[safe: index] or array[optional: …] or any of many other proposed names) would a great feature, but the lack of consensus on the label name has prevented the feature from landing in the language.

I would be somewhat OK with something like

string.character(atOffset: 5)
string.substring(from: 3 ..< 7)

but it’s not much of an improvement over what we have already (the family of collection methods):

string.dropFirst(4).first
string.dropFirst(9).prefix(12)

I mostly think that if you need integer indices into a string, you should probably reconsider if that is the best solution to your string processing task. For example, when parsing some text file, most formats are not based on Unicode extended grapheme clusters, so if some spec tells you to “skip four characters and then read until the next space character”, that spec is probably talking about bytes or UTF-16 units and about the ASCII space and not the Unicode space character property.

And for those times where integer indices into strings are the best solution, you can still do it with a single line of code in Swift.

I do want to acknowledge that that name “dropFirst” does not help it’s case; its a bad name that needs some time to get used to if you are not familiar with this style of programming. It sounds so much more dangerous that what it actually does.

tera · April 12, 2024, 4:39pm

I'd not assume that among these: x[y] or x.y or x(y) the first would be the fastest and the last would be the slowest. To me any of those could take an arbitrary amount of time and if I don't know their complexity upfront I would consult with the documentation before jumping to conclusions.

On that we agree. dropFirst sounds like a mutating function to me, droppingFirst would be more accurate name for a non-mutating function.

Nobody1707 · April 12, 2024, 5:05pm

No, originally String did confirm to Collection, but the core team was worried that this was an attractive nuisance so they removed that conformance. They the added it back when they rewrote String for Swift 5, because they decided removing that conformance was a mistake.

I thought String.count kept a cache of the value internally, and was only O(n) the first time you called it or if you mutated the string. Is it really O(n) every time?

tera · April 12, 2024, 7:10pm

Yes, String.count scans the string every time:

let s = String(repeating: " ", count: 300_000_000)
var start = Date()
let a = s.count
let elapsed1 = Date().timeIntervalSince(start)
start = Date()
let b = s.count
let elapsed2 = Date().timeIntervalSince(start)
print(elapsed1, elapsed2)

outputs:

This is in sharp contrast with NSString:

let s = NSString(data: Data(repeating: 0x20, count: 300_000_000), encoding: NSASCIIStringEncoding)!
var start = Date()
let a = s.length
let elapsed1 = Date().timeIntervalSince(start)
start = Date()
let b = s.length
let elapsed2 = Date().timeIntervalSince(start)
print(elapsed1, elapsed2)

outputs:

I'd highly recommend this article @Karl linked:

It discusses tradeoffs of string implementation in various languages.

QuinceyMorris · April 12, 2024, 7:21pm

Before this discussion goes too far down this road…

The reason there isn't any compact integer-offset-based string slicing syntax in Swift is not primarily because of performance considerations. There have been at least two major efforts in the past to formulate syntax that people would communally find acceptable.

It's a no-brainer when you just look at something like myString[3] or even myString[i]. It gets hard when you start to consider what more general syntax such simple cases could evolve into, perhaps in the future when source-breaking revisions to the simple syntax would be ruled out.

In particular, integer-offset subscripts are string-start-relative offsets, and most people also want string-end-relative offsets. That leads to:

There is no consensus about syntax for string-end-relative offsets. Should they be negative Ints syntactically or semantically?
Should there be a dedicated type for integer offsets, for type safety reasons?
How do you represent ranges of offsets in a dedicated type, when a mixture of start- and end-relative offsets is specified? Such ranges don't meet the Comparable requirements of the range types we have.

Basically, there's never been true consensus on #1 or #2, though it's possible to believe that consensus could be reached. #3 didn't have a solution in the past (although there might be something that could be done to solve it now).

So, in the absence of a way forward of #1, #2 and #3, the question is whether we should go ahead with syntax for just the simple start-relative offset cases.

In that context, both performance considerations and future syntax extension considerations have historically lead to the decision not to proceed with this language feature.

tera · April 12, 2024, 7:48pm

We don't have end relative subscripts with arrays, why would we want it with strings? Or is this feature so handy that we do need it for arrays as well?

Is this optimisation possible with String?

    private var prevOffset: Int?
    private var prevIndex: String.Index!

    subscript(_ offset: Int) -> Character {
        if offset == prevOffset { }
        else if let prevOffset, offset == prevOffset - 1 {
            prevIndex = index(before: prevIndex)
        } else if let prevOffset, offset == prevOffset + 1 {
            prevIndex = index(after: prevIndex)
        } else {
            prevIndex = index(startIndex, offsetBy: offset)
            // † see the comment below
        }
        prevOffset = offset
        return self[prevIndex]
    }

so the naive code people use in other languages would be quick:

for i in 0 ..< string.count {
    string[i]
}

There might be complications with this approach IRT using a single string across threads / tasks, are they solvable or the show stoppers?

Time complexity of this subscript would be funny:" O(1) when index is within +/- 1 of the previously used index, otherwise O(index)"

† - for the "fallback" branch a slightly better implementation could find the closest offset among the three (0, prevOffset, count) and use the relevant index (startIndex, prevIndex, endIndex) to scan to the required offset in a minimal number of steps.

QuinceyMorris · April 12, 2024, 8:54pm

We sort of do have it, for some special cases: last, popLast, dropLast(n), suffix(n).

The true ask here isn't for subscripting, but for a compact syntax for slicing strings. Subscript syntax just happens to compact and relatively popular in other languages, so people tend to reach for it as the "obvious" suggestion.

bdkjones · April 12, 2024, 9:14pm

This warms my heart because I’m an old Objective-C guy and everyone knew it was non-performant to repeatedly access a property in the bounds of a for in loop—we always cached the count value before the loop and it was a great way to spot a developer who didn’t understand the language/compiler well.

I just carried that practice over into Swift, so I’ve never fallen prey to this particular pitfall.

Jon_Shier · April 12, 2024, 9:17pm

In Obj-C that's because properties were still messages, so added overhead. An actual stored property in Swift shouldn't have that behavior, so something like Array.count should be fine to repeatedly access. It's only String that's weird in that count is computed and not then cached.

tera · April 12, 2024, 9:28pm

for i in string.count only calls count once, be it Swift or Objective-C. You are probably thinking of the older for (int i = 0; i < string.length; i++) – that indeed calls length repeatedly. Which is compensated by NSString.length being instant...

bdkjones · April 12, 2024, 9:32pm

Yep! That’s just way more to type on an iPhone. And careful: as soon as I’m done with this war on String, I’m gonna start laying into the fact that Swift made pointers exactly as hard as every first-year comp sci student always thought they were. 87 different kinds and the one you need for a given C API is never the one you have.

(Compared to what Swift did to pointers, String is absolutely wonderful!)

Jumhyn · April 12, 2024, 9:41pm

And nearly as hard as they actually are! :)

taylorswift · April 12, 2024, 10:58pm

i find when people complain that Strings in Swift are unsatisfactory, what they often really mean is that the particular set of String APIs available to them are unsatisfactory. and this happens because many of the “convenient” operations are not implemented in the standard library, but rather in Foundation, or even worse, in random third-party libraries.

moreover, a lot of the useful stuff that is available in the standard library isn’t documented as belonging to String, but rather to protocols such as RangeReplaceableCollection or BidirectionalCollection. so they are hard to discover because developers do not realize that these tools are also available on String.

i think this has become more confusing in recent years, because we now have additional colonies of String APIs that live in the standard library but not in the Swift module (e.g. extensions on BidirectionalCollection vended by _StringProcessing), and moreover, we now have two frameworks that are named Foundation but vend different API.

this really shouldn’t be thought of as an inherent tradeoff of Unicode correctness. it’s perfectly possible to provide APIs that are both correct and convenient, and many of these APIs already exist. they are just hard to discover and not organized into places that are easy to remember where to look.

tera · April 13, 2024, 12:33am

Yet another example of Swift string superiority. Same "à🏆💩🎬" string, the task is to add "." after every character.

Code

// Swift
let a = "a\u{0300}🏆💩🎬"
for c in a {
    print(c, terminator: ".")
}
print()

// Kotlin
fun main() {
	val a = "a\u0300🏆💩🎬"
    for (c in a) {
        print(c)
        print(".")
    }
    println()
}

// Python
def main():
    a = "a\u0300🏆💩🎬"
    for c in a:
        print(c, end=".")
    print()
main()

// C#
using System;

public class Program
{
    public static void Main()
    {
		string a = "a\u0300🏆💩🎬";
		foreach (char c in a) {
			Console.Write(c);
			Console.Write(".");
		}
		Console.WriteLine("");
    }
}

Results:

Swift:  à.🏆.💩.🎬.
Kotlin: a.̀.?.?.?.?.?.?.
Python: a.̀.🏆.💩.🎬.
C#:     a.̀.�.�.�.�.�.�.

FWIW Python is quite close to Swift in correctness.

Those who claim that Swift string handling is overly complex and praise other languages string API's: please tell me, what am I doing wrong in those other languages? How am I suppose to know the proper start / end offsets of the individual characters? Am I supposed to know the intimate details of UTF-8 (or another encoding if that's used) to see where characters end or what? You claim that other languages' string API is easier, so this task should not be too hard, right? That's a genuine question, I really like to know.

BTW, it might be not so bad idea to create a space here ("community projects" category?) that discusses some alternative String API. i.e. if we forgot what we have now in the standard library and started from scratch based on what we now know, what would be the "ideal" API? Maybe count / hash and subscripts would be O(1), or strings would be always stored in some canonical form, so to compare two strings we could use memcmp, and so on. So long as we don't push this design upon standard library people and won't assume anything of it might end up in the standard library it should be fine (and if there is anything useful that could be taken to the standard library that's even better).

bdkjones · April 13, 2024, 3:29am

There's no question String's Unicode compliance is excellent. It's best-in-class. The problem is that it's an aircraft carrier when, 90% of the time, all you need is a canoe.

An Analogy:

Consider the task of adding a border to a rounded rectangle in SwiftUI. This is the first thing a human reaches for:

RoundedRectangle(cornerRadius: 4)
    .border(.blue, width: 2)

And it produces something like this:

So the "correct" way has been this gibberish:

RoundedRectangle(cornerRadius: 4)
    .overlay(
        RoundedRectangle(cornerRadius: 4)
            .stroke(.blue, lineWidth: 4)
    )

That is, objectively, awful. And that's what using String often feels like.

And there's a bunch of explanations and reasons and technical jargon for why this pattern exists and the other one produces the effect in the image but none of that matters because the construct is just flat-out frustrating for the HUMANS who use it.

I UNDERSTAND all of the technical arguments for why String is the way that it is. But—like fixing this SwiftUI nonsense—there must exist a better compromise between supporting every nook and cranny of Unicode on the one hand and not forcing developers to google every little operation on the other.

tera · April 13, 2024, 3:58am

I'd love to hear your comment on my previous example-1, example-2 and example-3.

Or are those falling into the rare 10%? BTW, what do you do in those other languages API to deal with those 10%, are you switching to a "real" API? (that's not a rhetoric question, I 'd really like to know).

vns · April 13, 2024, 7:26am

I have strong doubts about this claim.

First, different encoding has been an issue for a very long time, forcing developers to deal with it in unpleasant ways in languages that lack any of tools for this. You had (and still have) to understand every bit of encoding nuances to properly work with a string, and there is still not just Unicode what you may get.

Second, in the modern world it is extremely hard to find places where such Unicode processing isn’t needed, everything right now has emojis, everybody writes in different languages, and most of the time we as developers deal with text we actually do not know the whole range of content. If you are writing iOS apps specifically, the case when you need such Unicode processing is the 90% of the time — user input might contain all of that, your localization strings will contain that, whatever comes from web might contain that, and so on. I cannot count how many times Swift has prevented me with String API from making a mistake.

And in those actually rare cases when you need effectively process just ASCII you can opt-out to C-like byte array.

P.S. In the world with autocomplete (even when it struggles to work in Xcode) it is fairly easy to find appropriate API, at least on Apple platforms. I agree with @taylorswift take above that API in general for String is fragmented across platforms and wished to be more consistent, but on Apple platforms that is not noticeable most of the time.

bdkjones · April 13, 2024, 8:19am

I don't write for iOS. But I'm happy to share an actual example from one of my apps. Suppose we're looking for the string @import in the text of a file. Because our search term is 100% ASCII, we don't have to care about Unicode at all. The text can have multibyte characters, some é characters represented by the single-codepoint variation and other é characters represented by the base e + diacritic mark codepoint combination, etc.

We can treat the text as a dumb sequence of bytes, walk through it with pointer arithmetic, and discover @import. We're not modifying the text, sending it anywhere, or displaying it—we just need to know if this stupid phrase exists.

The app in question predates Swift and the overhead of NSString and Objective-C actually mattered (especially on PowerPC!) so this was just plain C. Fast, simple.

Modern Times

I generally just drop to String.UnicodeScalarView when I have to do anything complex with String. I find it more pleasant to work with and I just gave up worrying about performance because on modern hardware it's just a non-issue.

A Recap

Everyone:

Seems to generally admit that String is convoluted and non-obvious—it takes a lot of experience/knowledge to use it correctly.
Agrees that Swift gets Unicode right better than any other language.
Agrees there are good technical reasons why String is the way that it is.

I'm not here to argue the technical merits of String. If I had a better approach for it, I'd have submitted a proposal.

I'm here because the OP's title was spot-on: "This seems hard". That reputation matters and it gets categorically dismissed by Swift. We (the community) have a blindspot! I think that's because the community is made up of engineers and engineers generally dismiss things that aren't discrete and quantifiable. Touchy-feely subjective impressions don't register.

And I worry about it because I have many years invested into hundreds of thousands of lines of Swift and, just three days ago, a new CTO at one Hollywood client for whom I build a custom app said: "What would it take to rewrite this as a web app?"

Swift has headwinds. The APIs shouldn't add more.

FlorianPircher · April 13, 2024, 9:07am

If you want to know whether a string contains another string, use

string.contains("@import")

No need to think about encoding, whether the query is ASCII or not, no need for manual iteration, especially not using integer indices. Also, Swift indexing is safe and thus requires bounds checking which might be a noticeable slowdown in a loop. Fast string search is not a trivial algorithm, so I would expect a manually written loop to not be as performant as a system provided function anyway.

The issue that OP had was less related to String and more about CharacterSet. Foundation does not offer ready-made sets of Characters for URL components, which is more of an issue of Foundation than Swift the language. Many other programming languages also don’t have ready-made character sets for URL components in their standard library, so Swift is not more difficult is that regard.

And I would disagree with your first point. It is significantly more difficult to use strings correctly in almost every other language, even if it is often easier to use them incorrectly.