IMO, SubSequence is actually the odd one, especially since they should be substrings (wiki), array slices, and subsequences (wiki).
IIRC, SubSequence had been with Sequence since the very beginning. So maybe it was created before the API naming guideline took the shape it has to date.
* I've seen "You can perform many string operations on a substring" there. What would be a lot more informative is to know what String can do, that SubString can't.
E.g. currently, you can't use a SubString wherever you can use a String. Why?
Just because a type presents the same interface as another type doesn't mean it has the same behavior. Substring exists as a performance optimization. This is the part of the documentation I wanted you to read:
When you create a slice of a string, a Substring instance is the result. Operating on substrings is fast and efficient because a substring shares its storage with the original string. The Substring type presents the same interface as String , so you can avoid or defer any copying of the string’s contents.
For example, imagine you have a 100-character string and you access characters 0..<90:
let slice = largeString[0..<90]
slice, a Substring, now shares the same memory as largeString. Without this optimization, you would have to copy most of the storage of largeString into slice, which would result in almost twice the memory usage. This is why Substring exists.
You have it backwards. String is a specialized memory-optimization of Substring.
I'm saying it's not worth a type.
But it's possible that it is a type because a type is the only marking system that Swift offers to assert that the Substring has been minimized. I want documentation on that.
String is currently serving as its own subsequence, allowing substrings to share storage with their "owner". This can lead to memory leaks when small substrings of larger strings are stored long-term (see here for more detail on this problem). Introducing a separate type of Substring to serve as String.Subsequence is recommended to resolve this issue, in a similar fashion to ArraySlice .
and
A new type, Substring , will be introduced. Similar to ArraySlice it will be documented as only for short- to medium-term storage:
Important
Long-term storage of Substring instances is discouraged. A substring holds a reference to the entire storage of a larger string, not just to the portion it presents, even after the original string’s lifetime ends. Long-term storage of a substring may therefore prolong the lifetime of elements that are no longer otherwise accessible, which can appear to be memory leakage.
Aside from minor differences, such as having a SubSequence of Self and a larger size to describe the range of the subsequence, Substring will be near-identical from a user perspective.
The introduction of Substring came about from practical experience of code holding on to Strings permanently, when those Strings were really slices of much larger data, effectively wasting memory.
But it's possible that it is a type because a type is the only marking system that Swift offers to assert that the Substring has been minimized
Yep, this is it. The API between String and Substring is meant to be as close to identical as possible (so much so that most operations on String-like objects should likely be using StringProtocol itself, which abstracts over the two), but the type of String indicates to you that it is the owner of its entire buffer, whereas Substring is not (i.e., direct storage of Substrings should be a code smell that indicates that you really want conversion to String to "minimize" the slice).
The full rationale behind this perhaps isn't spelled out as clearly as it could be in the Substring docs, but it does at least say
Important
Don’t store substrings longer than you need them to perform a specific operation. A substring holds a reference to the entire storage of the string it comes from, not just to the portion it presents, even when there is no other reference to the original string. Storing substrings may, therefore, prolong the lifetime of string data that is no longer otherwise accessible, which can appear to be memory leakage.
To add to @xwu's comment, this is expected. Substrings are mutable through RangeReplaceableCollection, and have value semantics — when they are mutated, they make a copy of the string slice they're holding on to (if not uniquely held) and are now a slice of that string.
The fact that the base isn't inherently held onto by anything else other than substring.base isn't an issue.
the "here" portion of the base in the last example is a waste, as there is no way to use it. this waste can be quite big. a better implementation would drop everything from the base but the slice itself upon mutation.
That's a good point — I believe this may be caused by Substring being implemented on top of Slice<String> directly without additional handling on mutation, so changes are applied to the underlying string first, and then the substring is reformed atop the underlying string without getting rid of characters beyond the slice boundaries. I can't think of a case off the top of my head where keeping the full underlying mutated string is necessary, but @Michael_Ilseman or @David_Smith might know better. It seems like a worthwhile optimization to consider, but for now, this isn't semantically incorrect, at least.
I think Java used to (until around 2012 or thereabouts) implement String like our Substring type. That is, a string would hold a character buffer combined with an offset and a length. Calling .substring on a string would return a String type (but again, similar to our Substring) with the same buffer, but a different length and offset.
They changed that because the overwhelming number of string manipulations weren't parsers, scanners and other cases where that optimisation mattered. Their String is now like our String.
Swift opted for a middle ground. The Swift project realised that although Java made the right decision when they changed their String, there are still cases where keeping the old "lens" type made sense. So Swift god two distinct types.
It has a different set of tradeoffs. For the most part, the inconvenience of dealing with two distinct types are mitigated by type inference and function overloads, but it sometimes surfaces to the user/programmer. Like it did for the OP.
I still think it is far preferable to having a single String type with Substring semantics.
As the Java team learnt the hard way.
indeed. swift solution seems a reasonable compromise, but not a silver bullet. example that would've benefited from "Substring semantics baked into String":
var s = "very long string here ..."
while s != "" {
s.removeLast()
do_something(s)
}
with "substring semantics" this loop could have run without memory allocation / string copying, just adjusting the length leaving the same base of the string. ditto for the head removal or both ends removal.
if substring semantics was baked into the String itself (so there was no need for a separate Substring type) i'd expect to see some "compact" API to convert String to its trimmed form:
s.removeLast(), etc
s.compact() // opt-in on an as needed basis
if i was designing the thing i'd probably go with this latter approach: just one small API method to surface (and for the users to know about when to use and when to not) instead of the heavier Substring / StringProtocol approach.
That's a fair preference. But I'm not convinced that this approach wouldn't be less optimal in practice. It would be easy, especially for newcomers to forget the .compact() calls, or it could quickly become noisy to include them in everywhere to avoid memory leaks.
But I guess it boils down to different trade-offs.
It does if you use actual slicing operations (dropLast() over removeLast()), or operate on a slice (var s = "string"[...]). And that works for all collections.
isn't the situation irt memory leaks and noise exactly the same now with Substring? extra noise due to "String(substring)" here and there, and this warning in the documentation which is very easy to miss / forget for newcomers:
IME, I often find heavy string manipulation code (one that warrants Substring) and other, lighter string usages to be separated relatively cleanly. Meaning that I'd only need to do String(substring) or .compact only when crossing between those two areas. In that regard, I do enjoy having type-level information to remind me to trim unused string portions.