Is SubString a necessary type?

Good example. Just note that were the Strings behaving like current Substrings (so assignment of one to another won't incur a hidden conversion) and were such conversions always explicit:

string.compacted

instances like these would be very visible and obvious.

I guess the fear then is that people would forget to call "compacted"... Indeed quite tricky to make this ergonomic and interesting to see different languages landed at different solutions.

I'd like to echo the OP's question as, put simply, I wish there was not a SubString type in Swift. It seems like a fork in the woods where we took the wrong path and gave undue consideration to engineering concerns over what has to be the more important consideration to keep things as simple as possible.

My first impression of the SubString type from the relevant WWDC session was that it was an act of genius but over the years the real cost on the Swift language has become apparent. First, in term of complexity where Strings forked into two types loosely bound together by a protocol the ergonomics of using in an api have only recently improved. Even from an engineering point of view, I would be surprised if the SubString abstraction has ever made the code in the Swift corpus faster. The SubString type occupies 100 bytes, containing a reference to the storage an a pair of indexes, quite likely more bytes than many of the strings it represents not considering the overhead of locking the shared reference etc.

I feel individual potential performance regressions could have been addressed by developers by revising their algorithm when they occur rather than adding so much complexity to such a critical api. All this said this isn't a decision that is going to change any time soon but I have to say SubString has long been a part of Swift where I feel we made a mis-step.

1 Like

[one sentence removed here]. Substring does not require taking any locks at all. Copying a String does potentially take a lock though (inside malloc).

5 Likes

This is counterintuitive, given that the documentation highlights the case where Substring retains the original backing buffer of a much larger string.

Hmm. I was distracted and now I can't remember what my reasoning was here. I'm going to go ahead and delete that part of the post at least until I figure out what I meant :sweat_smile:

1 Like

The problem is that this is quietly circular. As someone who's written a fair amount of Java string manipulation, I'd like to write it straightforwardly using slicing, but the performance impact of allocating an object for every slice and potentially copying the underlying buffer makes that untenable, so I have to pass around a trio of a String, an offset, and a length instead. The design of the language and library shapes the client code that the library then finds itself optimizing for.

11 Likes

Exactly! This is a very important point about API design and writing code that is optimized for it. We may think the ideal solution is to break the old API and start over based on the lessons learned, but that is rarely possible. Even if we manage to do that, we have a new API that we don't fully understand its limitations and bottlenecks yet. Then we start improving performance of the newly generated client code by optimizing the library for it and the cycle keeps repeating itself.

3 Likes

One of the biggest issues with Substring (which isn't really Substring's fault) is that so little functionality lives on StringProtocol. Quite a bit lives on String, still more is imported from NSString and Foundation. If we could update APIs to work with StringProcol and make some StringProtocol a viable API, Substring would largely blend into the background.

5 Likes

If there's an entity that shouldn't exist, it's StringProtocol. The overhead of going through a protocol abstraction is almost never worth it. If something should process both Strings and Substrings, it should only take Substring—that'll be far more efficient.

4 Likes

The point is to allow String and Substring to be treated identically most of the time. There's little reason various string operations should exist only on one or the other. I suppose it doesn't really matter how that's achieved, but in a world with String and Substring, more unification is needed, not less.

1 Like

I'm not a particular fan of the split, to be clear, and I wish we'd had more time in the early days to flesh out the relationship between these "owner" and "borrower" pairings that tend to evolve when working with large values. But I don't think a protocol or generics should be the tool to do so, since that's going to cost you either code duplication via specialization or the abstraction overhead of going through dynamic dispatch, and is also likely to lose the benefit of the "borrower" type being able to share memory when you instantiate a generic to use only the "owner" type.

From an implementation perspective, the best thing to do is to put operations that make sense on both types only on the borrower (Substring), and have clients pass string[...] when they don't already have a Substring. It would be great to figure out ways to make that more ergonomic in the language. I imagine this problem will only become more prevalent as we introduce move-only types, safe buffer views, and more variations of array types, since we'll have the same problem of a bunch of different "owner" types, which own contiguous buffers with varying representations, memory management policies, and so on, that have a common "borrower" type in the buffer view type that can point into contiguous parts of any of them.

17 Likes

Perhaps you are right.. Let me try still. :slight_smile:

What if instead of two distinct types we had a single "String" type that was capable of holding either String and Substring representation. There's just a handful of operations that currently generate substrings: subscript, drop, prefix, suffix, perhaps a few others, and all those could require an additional parameter whether to alias or compact the string (perhaps along with "default" option that selects one of the two behaviours automatically). Plus have an explicit "compacted" method to convert "substring" string into standalone string. Maybe not too ergonomic:

let slice: String = string[from..<to, .alias] // Substring
.... work with Substring
let copy: String = slice.compacted // copy Substring into String

however wouldn't it solve all cases that the current String/Substring/StringProtocol split solve?
Provided "compact" parameter is required – conversion (or lack of) would be always visible, at the cost of being (hopefully) mildly annoying.

Then every codepath through String would be bifurcated, and you’d probably give the branch predictor a Very Bad Time.

2 Likes

Folks, I need to be clear here that there is no path to Swift removing SubString. I'm happy to let this conversation continue, but please be aware that you're just building castles in the air.

6 Likes

This kind of relates to my original question, that how String and SubString will scale in the future. In general when you design an API is to have the types as opaque as possible. Internally, it can be updated because computers change during time. One example is that C++ standard library strings went from reference counted buffer to a non one together with SSO because they found that it was faster on newer systems. With String and SubString you kind of expose the memory management more and the question was if it is possible to merge the two types to make it more opaque.

Another analogy is memory management which is a moving target today. Should the API expose weak references or should you make just the reference completely opaque? There are languages that have solved the cyclic reference problem.

It's certainly the case the ship has sailed on Substring but if we're going to have a 32 byte value type (I don't know where I mis-remembered 100 bytes from in the previous post) representing String slices and a StringProtocol it's interesting to meditate on what might have been given this string: "fits inside thirty characters".

With that many bytes, I'd wager the majority of substrings could have been represented inside the memory of the struct itself without having to allocate or reference count — a sort of Shortstring if you like along the lines of the optimisation mentioned with respect to NSString. If the string slice is too long it could revert to storing a SubString-like representation referring to the original storage. You loose the ability to recover the index into the original string from the slice though one wonders how often that feature was used or known about.

It might be interesting to benchmark such a solution as a nearly ABI compatible alternative to SubString to see if it offers a performance advantage worth pursuing.

enum Shortstring: StringProtocol {
    case short(length: UInt8,
               bytes: (CChar, CChar, CChar, CChar, CChar,
                       CChar, CChar, CChar, CChar, CChar,
                       CChar, CChar, CChar, CChar, CChar,
                       CChar, CChar, CChar, CChar, CChar,
                       CChar, CChar, CChar, CChar, CChar,
                       CChar, CChar, CChar, CChar, CChar,
                       CChar))
    case long(substring: Substring)
    ...
}

But, at the end of the day I'm not sure even I'm convinced.

1 Like

FWIW we do have an inline representation like that for up to 15 bytes on any String

7 Likes

Interesting! I hadn't dared hope that would be true :100: