String Essentials
Hi all, I want to shared some thoughts on a few more areas of string usability improvements. These additions could alleviate a lot of frustration when working with String, and they all have at least some basic functionality that can be added in the near-term.
String Cleaning
A basic ease-of-use API present in many programming languages is the ability to strip leading and/or trailing whitespace from a string. This is a frequently encountered issue.
See Rosetta Code’s survey of implementations for how this is expressed in dozens of programming languages (which is conspicuously missing an example for Swift).
For example, Rust calls it trim, provides the default of whitespace (but supports parameterized over a pattern), and has separate functions for from-start and from-end.
Swift has yet to decide what such a Pattern type would look like, but it could be part of regex support (though we likely need to accommodate partial matches). For now, we can add the whitespace one as that addresses the majority of usage. We can also figure out how Swift should surface the distinction between from-start, from-end, and both.
This should be available on String and its views. In the future, a pattern-taking variant could make sense for all BidirectionalCollections. This was discussed previously at String hygiene.
Find, Replace and Split
An unfortunate omission in the standard library is substring find, replace, and split. We should add a basic findRange
operation on e.g. Collection that finds ranges of indices for a given Collection. This can serve as the basis for generic find, replace, split, and other functionality. Since it’s generic on Collection, it would be available on String as well as all scalar and code unit views. (Alternatively, we could put it on BidirectionalCollection, which excludes Set and Dictionary, which is closer to what we’d normally consider to have a meaningful order.)
This is useful for all of String’s views: operating on a view of Characters gives find/replace semantics honoring Unicode canonical equivalence), while operating on the Unicode scalar or code unit views would honor literal equivalence.
This was discussed in Range based Collection mutations, which should be revived. And of course, if/when Swift gets patterns, those too.
Retrieval of Characters and Substrings from offsets
Unadorned subscripts in Swift imply efficient access to elements. However, String is a collection of Characters (extended grapheme clusters), and getting the n
th Character is a linear-time operation whose behavior is Unicode-version specific. Due to this, strings only expose index-based subscripts. However, this is unwieldy for casual usage, and String should provide some mechanism to get a character or substring from offsets, without implying O(1).
This should also be available on all of String’s views, if not Collections in general.
Call for Users (and Authors): Offset indexing pitches one formulation of this using an offset-based indexing and subscript. There’s open questions to answer with this approach, including the optionality of some return types. Concrete usage is needed to figure this out.
An advantage of this approach is that by extending all Collections, slices of random access collections (which are themselves random access and could be passed generically) can be operated on using 0-based offsets rather than arithmetic with the startIndex.
Many alternatives and discussion can be found at Shorthand for Offsetting startIndex and endIndex
Extension: Negative offset means from endIndex
A small extension to any offset model would be using negative offsets to represent offsets from the endIndex (at least for BidirectionalCollections). This can improve usability and allow offset-based retrieval without calculating the full count.
They do introduce some bug potential, which we should weighed against their usefulness and we should see if we can better protect against them. Negative offsets start at -1 and not 0, so they can present off-by-one errors for the unfamiliar. It’s also easier to accidentally form empty or negative ranges using integer literals, which is dependent on the length of the string. Non-literal integer expressions which produce negative numbers have a good chance of being a programmer bug, and negative offsets would hide them (unless we limit negative offsets to literals).
Alternative: New Range Types
This is the approach taken at Proposal Offsetting subscript by Letanyan. This approach has the advantage of supporting ranges whose end point is negative, implying offset-from-end, which could be convenient.
It does come with greater complexity cost of introducing even more Range types and operators. A similar idea is found at Alternate design for RelativeRange · GitHub, which makes the index being offset from explicit.
Alternative: IndexOffset
As an alternative to new range types, this introduces a new IndexOffset type, which avoids adding new operators. There are many other variations in that thread, my apologies to any others not explicitly called out here.
Alternative: Slicing DSL
Slicing is a common operation that already pretty heavily utilizes overloading, types, and custom operators in the standard library. This complexity sometimes results in ambiguous subscript / range error messages. At some point, it could make sense to design specific syntax for slicing collections. If a slicing DSL is designed and gains traction soon enough, it could obviate the need for the offset-based subscript overloads.
See Dave Abraham’s example for one formulation of this.
Alternative: character/substring methods
String could have something like func character(atOffset: Int) -> Character
and func substring(fromOffset: Int, to: Int) -> Substring
, as mentioned here. These would be pretty simple and avoid subscript confusion, but would not be available on String’s views (unless we add scalar(atOffset:)
/codeUnit(atOffset:)
, etc.), and wouldn’t improve other collections such as slices of RandomAccessCollections.
This approach would not have the ability the propagate mutation through an accessor, which is possible with a subscript. That being said, these could easily be deprecated in favor of some future ultimate solution.
String initializers
Like SE-0245, String should provide an initializer that can invoke a given user’s closure on uninitialized UTF-8 capacity. Afterwards, String will still have to validate, and potentially error-correct, the contents.
String could also provide a UTF-8 initializer with control over how to handle encoding validity. There are 3 approaches:
- Throwing initializer (none currently exist, but we could throw UTF8ValidationResult)
- Failable initializer (currently available only for C-strings)
- Perform error-correction on initialization (normal
decoding
initializer)
View conformances
String’s views should have many of same conformances that String itself enjoys, such as Comparable, which would provide literal semantics (as opposed to String’s which honors Unicode canonical equivalence). Views should be:
- Comparable
- Hashable
- TextOutputStreamable
- ExpressibleByStringLiteral
- (maybe) TextOutputStream
- (maybe) LosslessStringConvertible
- (maybe) ExpressibleByStringInterpolation