Is SubString a necessary type?

Like some languages Swift has for some reason decided to have a SubString struct/class in order to make slices from a String type. This is similar to C++ string_view, however in C++ string_view serves as an escape hatch from the null terminated string. In C++ if a new string from a sub string of another string is created, then a new allocation is required because of the trailing zero (disregard SSO in this discussion).

Swift doesn't have this requirement with trailing zeroes in strings so what purpose does SubString serve? Since memory is reference counted the string buffer can referenced in several string structures in Swift. Basically a reference counted buffer that can be shared and a offset/length into that would be enough. That might slightly increase the size of the String but I'd say it is worth it.

Some might say, what if you just cherry pick a few words out of a 10 MB string, then that buffer would still be retained. This particular case is unusual and such case you could simply copy to a new buffer with a duplicate method.

In short, if you can remove SubString and use String directly for sub strings, then we have better orthogonality in the standard library. string_view in C++ is kind of annoying, you have two different types to care about and all the conversion between them. It is also not always clear when to use either one (sometimes it is beneficial to pass them parameters and so on).

Moved to the "Using Swift" category as the topic doesn't have to do with the ongoing development/implementation work of the standard library. Thanks!

Looks like a valid question about standard library design to me.

Perhaps it's a rudiment of some original idea, and if to do it today we could have just "String" - String that behaves like Substring just lazily compacting itself when possible:

var string_10mb = .... some 10MB string
let str: String = string_10mb.range[100..<200]
// here "str" length is 100 and it points to the original 10MB buffer
string_10mb = "hello"
// does "str" still point to 10MB buffer?

I believe that at the end of the fragment "str" could be compacted so that it now points to a shorter memory buffer.

This is not rare, so it's important to handle it well.

That's what String(subString) does.

Having an explicit Substring makes it so the "when should a copy happen?" question left up to the author. There's no one superior design (e.g. Java's "always copy when slicing" and C's "always alias when slicing" both have their own fairly obvious issues.)

Yes. And that's a memory leak.

6 Likes

I mean does it have to be this way necessarily? Or is the world possible where the COW machinery on line "string_10mb = "hello" line compacts "str" buffer as well? I understand it might be tricky as it involves two or more variables, but is this impossible?

Swift String accurately models Unicode strings. This makes it far more complicated than c/c++ strings. SubString is necessitated by String's conformance to Collection (more specifically BidirectionalCollection) protocol. To write algorithms that work with both String and SubString, you write a generic function that accepts any type that conforms to StringProtocol.

I highly recommend reading more of Swift's documentation around these topics before delving too deep in Swift string processing. If you are translating something from C/C++, you most probably benefit from a deep refactoring or using one of String's low level views, such as String.UTF8View obtained from utf8 property of any String. Also, take a look at the new built-in regex facilities of Swift 5.7.

EDIT: For high performance and low-level manipulation of UTF8 bytes look at withUTF8() methods that are available on both.

6 Likes

String Manifesto

Substrings

When implementing substring slicing, languages are faced with three options:

  1. Make the substrings the same type as string, and share storage.
  2. Make the substrings the same type as string, and copy storage when making the substring.
  3. Make substrings a different type, with a storage copy on conversion to string.

We think number 3 is the best choice. A walk-through of the tradeoffs follows.

For the detailed rationale, see the full document.

10 Likes

Just to add a bit to what @Karl quoted:

The same choices also apply to other collection types. The associated slice type of each Collection is called SubSequence in standard library for historical reasons. Standard library provides a generic Slice<...> type that is used by many collections (which is the default if you don't explicitly define your own slice type). String uses its own custom slice type instead of generic Slice<String> for performance reasons.

5 Likes

To answer the question, some languages define the characters in a string as immutable. As soon the characters are changed, the buffer is renewed. In that case you would have a new buffer for "hello", and the 10MB buffer would have been released.

NSString picks yet another option not mentioned here, which is to heuristically decide whether to copy or reference the original storage based on the relative lengths of the two strings. This has the advantage of not requiring developers to be aware of the problem, but the disadvantage of not always being the optimal choice.

8 Likes

OTOH, sometimes it's better not to compact - for both memory and performance reasons:

var string_10mb = .... some 10MB string
let str: XXXX = string_10mb.range[100 ..< 10MB - 100]
// here "str" length is almost 10MB and it points to the original 10MB buffer
string_10mb = "hello"

at this point it would be better to continue with the original 10MB buffer in "str" instead of copying the most bytes of it into a newly allocated buffer just under 10MB and getting rid of the original. In that sense it makes sense to have explicit control whether to do copy or not.

Alternative design would be some explicit parameter:

let compacted: String = original.range[100 ..< 10MB - 100, compact: true]

or never compact unless you do it explicitly:

let slice: String = original[100...200]
let compacted: String = String(slice)

let slice: String = original[100...200]
let compacted: String = slice.compacted

Worth noting that Data has chosen a different strategy (SubData is Data).

1 Like

Data's chosen tradeoff is having more internal representations (it can be a slice internally), which means more index-related confusion, larger generated code, and more branches at runtime, but neatly sidesteps the issue from this thread.

3 Likes

Thank you, interesting read.

Important: Long-term storage of Substring instances is discouraged. A substring holds a reference to the entire storage of a larger string, not just to the portion it presents, even after the original string's lifetime ends. Long-term storage of a Substring may therefore prolong the lifetime of large strings that are no longer otherwise accessible, which can appear to be memory leakage.

What the document admits is that SubString has the exact same problem as it must keep a reference to the original buffer, just like if we would have a single String type. It has be that way otherwise SubString would have been unsafe like string_view in C++.

When assigning a Substring to a longer-lived variable (usually a stored property) explicitly of type String, a type conversion will be performed, and at this point the substring buffer is copied and the original string's storage can be released.

This is of course the key why even consider SubString. Almost no programmer will put SubString in storage classes and the magic point is when SubString is converted to String so that original buffer pointed by SubString goes away. The conversion point is what makes this model work.

It's certainly possible, but it would have too many trade-offs to be worthwhile.

For this "compaction" to work (misnomer, it probably won't be in-place compaction, it'll probably be just copying out into a right-sized buffer), the deinitializer of the String would need to modify every SubString that was sliced from it. This requires knowing what all those SubStrings are, which it currently doesn't.

SubStrings hold reference to the buffer of the String they're sliced from (which is only ever singular), but not the other way around (which would need an array of references, because one String's buffer can provide backing storage for many SubStrings). This means slicing would involve an array allocation, which then also raises questions about synchronization, and it gets really complex/slow really fast.

4 Likes

Java has a lot of experience in this department. So much so, that they decided to replace the implementation of .substring():

If you would want the same behaviour with just a single String type, you could have a flag in the String class indicating if the String was sliced. Then if you do an assignment to a new String then the string will be copied to a new buffer.

This approach has real downsides for performance because you can end up with a "secretly slow" String: a String for whom a bunch of algorithms will silently incur substantially greater performance cost. This already happens a bit: a String bridged from an NSString can be "secretly slow" as well. Adding more places for this to happen doesn't seem to be valuable.

3 Likes

Do you mean the greater cost in terms of having a bigger O-big of N complexity where N is the logical string length (what count returns on a string or a substring)? How that'd be possible? (I am not talking about NSString bridged strings here which is a peculiar kingdom of its own with special ruling.)

As an arbitrary hypothetical, one might have this kind of structure:

let secretlySlowString: String  // derived from some earlier code
for index in secretlySlowString.indices {
    someBackgroundWork(index, secretlySlowString)
}

If someBackgroundWork makes the mistake of assigning its String argument to a new String variable, it will need to copy the sliced String. The copy of the sliced string is linear in the String length, and the number of loop iterations is (approximately) linear in the String length, so this algorithm becomes silently quadratic in the length of the String.

This kind of silently quadratic code is super scary. Even clearly-quadratic code is tricky, because as Bruce Dawson points out,"O(n^2) is the sweet spot of badly scaling algorithms: fast enough to make it into production, but slow enough to make things fall down once it gets there". Code that is not obviously quadratic (because of hidden copies), and that is in fact only quadratic depending on a not-easily-checked property of its input type, is even more insidious because it can be nearly impossible to debug, and it's highly likely to be missed in local and unit testing. What are the odds that you're going to remember to performance test all your String algorithms with sliced Strings?

12 Likes

BTW, Swift's String storage is actually null-terminated. It's an implementation detail, but it allows us to bridge to a C string in constant time/without allocating. NSString bridging means it can't be a formal guarantee, but it's still useful because C interop is still important.

If even native String storage might/might not be null-terminated based on whether it was a slice or not, we would have yet another class of "secretly slow" strings, as @lukasa calls them.

5 Likes