String Essentials

Michael_Ilseman · March 22, 2019, 1:32am

String Essentials

Hi all, I want to shared some thoughts on a few more areas of string usability improvements. These additions could alleviate a lot of frustration when working with String, and they all have at least some basic functionality that can be added in the near-term.

String Cleaning

A basic ease-of-use API present in many programming languages is the ability to strip leading and/or trailing whitespace from a string. This is a frequently encountered issue.

See Rosetta Code’s survey of implementations for how this is expressed in dozens of programming languages (which is conspicuously missing an example for Swift).

For example, Rust calls it trim, provides the default of whitespace (but supports parameterized over a pattern), and has separate functions for from-start and from-end.

Swift has yet to decide what such a Pattern type would look like, but it could be part of regex support (though we likely need to accommodate partial matches). For now, we can add the whitespace one as that addresses the majority of usage. We can also figure out how Swift should surface the distinction between from-start, from-end, and both.

This should be available on String and its views. In the future, a pattern-taking variant could make sense for all BidirectionalCollections. This was discussed previously at String hygiene.

Find, Replace and Split

An unfortunate omission in the standard library is substring find, replace, and split. We should add a basic findRange operation on e.g. Collection that finds ranges of indices for a given Collection. This can serve as the basis for generic find, replace, split, and other functionality. Since it’s generic on Collection, it would be available on String as well as all scalar and code unit views. (Alternatively, we could put it on BidirectionalCollection, which excludes Set and Dictionary, which is closer to what we’d normally consider to have a meaningful order.)

This is useful for all of String’s views: operating on a view of Characters gives find/replace semantics honoring Unicode canonical equivalence), while operating on the Unicode scalar or code unit views would honor literal equivalence.

This was discussed in Range based Collection mutations, which should be revived. And of course, if/when Swift gets patterns, those too.

Retrieval of Characters and Substrings from offsets

Unadorned subscripts in Swift imply efficient access to elements. However, String is a collection of Characters (extended grapheme clusters), and getting the nth Character is a linear-time operation whose behavior is Unicode-version specific. Due to this, strings only expose index-based subscripts. However, this is unwieldy for casual usage, and String should provide some mechanism to get a character or substring from offsets, without implying O(1).

This should also be available on all of String’s views, if not Collections in general.

Call for Users (and Authors): Offset indexing pitches one formulation of this using an offset-based indexing and subscript. There’s open questions to answer with this approach, including the optionality of some return types. Concrete usage is needed to figure this out.

An advantage of this approach is that by extending all Collections, slices of random access collections (which are themselves random access and could be passed generically) can be operated on using 0-based offsets rather than arithmetic with the startIndex.

Many alternatives and discussion can be found at Shorthand for Offsetting startIndex and endIndex

Extension: Negative offset means from endIndex

A small extension to any offset model would be using negative offsets to represent offsets from the endIndex (at least for BidirectionalCollections). This can improve usability and allow offset-based retrieval without calculating the full count.

They do introduce some bug potential, which we should weighed against their usefulness and we should see if we can better protect against them. Negative offsets start at -1 and not 0, so they can present off-by-one errors for the unfamiliar. It’s also easier to accidentally form empty or negative ranges using integer literals, which is dependent on the length of the string. Non-literal integer expressions which produce negative numbers have a good chance of being a programmer bug, and negative offsets would hide them (unless we limit negative offsets to literals).

Alternative: New Range Types

This is the approach taken at Proposal Offsetting subscript by Letanyan. This approach has the advantage of supporting ranges whose end point is negative, implying offset-from-end, which could be convenient.

It does come with greater complexity cost of introducing even more Range types and operators. A similar idea is found at Alternate design for RelativeRange · GitHub, which makes the index being offset from explicit.

Alternative: IndexOffset

As an alternative to new range types, this introduces a new IndexOffset type, which avoids adding new operators. There are many other variations in that thread, my apologies to any others not explicitly called out here.

Alternative: Slicing DSL

Slicing is a common operation that already pretty heavily utilizes overloading, types, and custom operators in the standard library. This complexity sometimes results in ambiguous subscript / range error messages. At some point, it could make sense to design specific syntax for slicing collections. If a slicing DSL is designed and gains traction soon enough, it could obviate the need for the offset-based subscript overloads.

See Dave Abraham’s example for one formulation of this.

Alternative: character/substring methods

String could have something like func character(atOffset: Int) -> Character and func substring(fromOffset: Int, to: Int) -> Substring, as mentioned here. These would be pretty simple and avoid subscript confusion, but would not be available on String’s views (unless we add scalar(atOffset:)/codeUnit(atOffset:), etc.), and wouldn’t improve other collections such as slices of RandomAccessCollections.

This approach would not have the ability the propagate mutation through an accessor, which is possible with a subscript. That being said, these could easily be deprecated in favor of some future ultimate solution.

String initializers

Like SE-0245, String should provide an initializer that can invoke a given user’s closure on uninitialized UTF-8 capacity. Afterwards, String will still have to validate, and potentially error-correct, the contents.

String could also provide a UTF-8 initializer with control over how to handle encoding validity. There are 3 approaches:

Throwing initializer (none currently exist, but we could throw UTF8ValidationResult)
Failable initializer (currently available only for C-strings)
Perform error-correction on initialization (normal decoding initializer)

View conformances

String’s views should have many of same conformances that String itself enjoys, such as Comparable, which would provide literal semantics (as opposed to String’s which honors Unicode canonical equivalence). Views should be:

Comparable
Hashable
TextOutputStreamable
ExpressibleByStringLiteral
(maybe) TextOutputStream
(maybe) LosslessStringConvertible
(maybe) ExpressibleByStringInterpolation

IOOI · March 22, 2019, 2:23pm

Let me ask a possibly stupid question: was NSString's/NSMutableString's API a consideration or do we want to start from scratch?

nuclearace · March 22, 2019, 3:08pm

Yes, these are some much needed things. Are there any immediate concerns with tackling this now? It doesn't seem like anything should be holding these additions up short of an implementation and proposal.

I think the best way to solve most of these problems is just to go with Joe's suggestion and use a method for these kinds of operations. A substring method built on some kind of particularly powerful Collection.slice method side steps a lot of the concerns I think most people have with a subscript based solution.

jrose · March 22, 2019, 3:25pm

NSString / NSMutableString is worth looking to for reference, but I think we're working under different constraints (value types, non-interchangeable indexes, Characters instead of UTF-16 code units) that make it a possible inspiration and source of ideas rather than a clear starting point that we should port over verbatim.

Joe_Groff · March 22, 2019, 4:44pm

I think that, if the syntactic and design issues could be solved, this would be the ideal solution. It would be great to refine the RangeExpression protocol into something that can also abstractly express more complex relative index computations. That would give us something that works for all collections, as you noted, but also something that could work across collection operations, not just subscripting, since other range operations like replaceSubrange could use the same protocol to gain the same expressivity.

fswarbrick · March 22, 2019, 8:56pm

I have what I consider to be a decent implementation for trim/trimming. It's on BidirectionalCollection and then specialized as necessary for StringProtocol, String and Array. I'd be happy for it to be used as the base for a stdlib implementation.

public enum TrimOption {
    case both, leading, trailing
}

extension BidirectionalCollection {
    public typealias ElementEvaluation = (Element) throws -> Bool
    public typealias ElementEvaluationMethod = (Element) throws -> () throws -> Bool
    public typealias ElementProperty = KeyPath<Element, Bool>
    
    public var lastIndex: Index { return index(before: endIndex) }

    public func trimming(_ trimOpt: TrimOption = .both, where predicate: ElementEvaluation) rethrows -> SubSequence {
        let leftIndex = (trimOpt == .trailing) ? startIndex : try firstIndex { !(try predicate($0)) } ?? startIndex
        let rightIndex = (trimOpt == .leading) ? lastIndex : try lastIndex { !(try predicate($0)) } ?? lastIndex
        return self[leftIndex ... rightIndex]
    }
    
    @inline(__always)    
    public func trimming(_ trimOpt: TrimOption = .both, where predicate: ElementEvaluationMethod) rethrows -> SubSequence {
        return try trimming(trimOpt) { try predicate($0)() }
    }
    
    @inline(__always)    
    public func trimming(_ trimOpt: TrimOption = .both, while property: ElementProperty) -> SubSequence {
        return trimming(trimOpt) { $0[keyPath: property] }
    }
}

extension BidirectionalCollection where Element: Equatable {
    @inline(__always)    
    public func trimming(_ trimOpt: TrimOption = .both, while trimElement: Element) -> SubSequence {
        return trimming(trimOpt) { $0 == trimElement }
    }
}

extension BidirectionalCollection where Self == SubSequence {
    @inline(__always)    
    public mutating func trim(_ trimOpt: TrimOption = .both, where predicate: ElementEvaluation) rethrows {
        self = try trimming(trimOpt, where: predicate)
    }
    
    @inline(__always)    
    public mutating func trim(_ trimOpt: TrimOption = .both, where predicate: ElementEvaluationMethod) rethrows {
        self = try trimming(trimOpt, where: predicate)
    }
    
    @inline(__always)    
    public mutating func trim(_ trimOpt: TrimOption = .both, while property: ElementProperty) {
        self = trimming(trimOpt, while: property)
    }
}

extension BidirectionalCollection where Self == SubSequence, Element: Equatable {
    @inline(__always)    
    public mutating func trim(_ trimOpt: TrimOption = .both, while trimElement: Element) {
        self = trimming(trimOpt, while: trimElement)
    }
}

extension StringProtocol {
    @inline(__always)    
    public func trimming(_ trimOpt: TrimOption = .both) -> SubSequence {
        return trimming(trimOpt, while: " ")
    }
}

extension StringProtocol where Self == SubSequence {
    @inline(__always)    
    public mutating func trim(_ trimOpt: TrimOption = .both) {
        self = trimming(trimOpt, while: " ")
    }
}

extension String {
    @inline(__always)    
    public mutating func trim(_ trimOpt: TrimOption = .both, where predicate: ElementEvaluation) rethrows {
        self = String(try trimming(trimOpt, where: predicate))
    }
    
    @inline(__always)    
    public mutating func trim(_ trimOpt: TrimOption = .both, where predicate: ElementEvaluationMethod) rethrows {
        self = String(try trimming(trimOpt, where: predicate))
    }
    
    @inline(__always)    
    public mutating func trim(_ trimOpt: TrimOption = .both, while property: ElementProperty) {
        self = String(trimming(trimOpt, while: property))
    }
    
    @inline(__always)    
    public mutating func trim(_ trimOpt: TrimOption = .both, while trimElement: Character = " ") {
        self = String(trimming(trimOpt, while: trimElement))
    }
}

extension Array {
    @inline(__always)    
    public func trimming(_ trimOpt: TrimOption = .both, where predicate: ElementEvaluation) rethrows -> Array {
        return Array(try trimming(trimOpt, where: predicate) as ArraySlice)
    }
    
    @inline(__always)    
    public func trimming(_ trimOpt: TrimOption = .both, where predicate: ElementEvaluationMethod) rethrows -> Array {
        return Array(try trimming(trimOpt, where: predicate) as ArraySlice)
    }
    
    @inline(__always)    
    public func trimming(_ trimOpt: TrimOption = .both, while property: ElementProperty) -> Array {
        return Array(trimming(trimOpt, while: property) as ArraySlice)
    }
    
    @inline(__always)    
    public mutating func trim(_ trimOpt: TrimOption = .both, where predicate: ElementEvaluation) rethrows {
        self = Array(try trimming(trimOpt, where: predicate))
    }
    
    @inline(__always)    
    public mutating func trim(_ trimOpt: TrimOption = .both, where predicate: ElementEvaluationMethod) rethrows {
        self = Array(try trimming(trimOpt, where: predicate))
    }
    
    @inline(__always)    
    public mutating func trim(_ trimOpt: TrimOption = .both, while property: ElementProperty) {
        self = Array(trimming(trimOpt, while: property))
    }
}

extension Array where Element: Equatable {
    @inline(__always)    
    public mutating func trim(_ trimOpt: TrimOption = .both, while trimElement: Element) {
        self = Array(trimming(trimOpt, while: trimElement))
    }
}

I'll follow below with some use cases.

fswarbrick · March 22, 2019, 9:25pm

First, some "helper" methods:

extension Character {
    public func isSame(_ c: Character) -> Bool {
        return self == c
    }
    public var isVowel: Bool {
        return Set(["a","e","i","o","u"]).contains(lowercased())
    }
}

Examples of usage:

let ss = "  trim me  ".trimming(.leading)
print("'\(ss)'")  // 'trim me  '

let sp = " ? ".trimming(while: \.isWhitespace)
print(sp)  // ?

var q = "XqY".trimming(while: "Y")
print(q)  // Xq
q.trim(where: Character.isSame("X"))
print(q)  // q

var s = "  trim me"
s.append("aXX")
print(s) //   trim meaXX
s.trim(while: "X")
print(s)  //   trim mea
s.trim(while: \.isVowel)
print(s)  //   trim m
s.trim(while: \.isWhitespace)
print(s)  // trim m
s.trim { $0 == "t" }
print(s)  // rim m

Michael_Ilseman · March 27, 2019, 1:36am

We should be able to develop the simple collection-taking findRange at any time. We’ll want to do a Pattern approach in the future and that could check for a trivial pattern and forwards on to the simple collection-taking findRange.

The also sounds good to me, unless we can figure out the DSL approach soon. I do think we will want this on UnicodeScalarView and probably the code unit views.

I agree, and I’m actually fine with Dave’s syntax (though I expect that to be highly controversial). Your thoughts on the syntax?

Interesting. Could you elaborate more on what you’d want to be able to express and how RangeExpression cannot meet that? Maybe instead of a slicing DSL we have a range DSL.

I think the pressing need would be for the whitespace special case for [Sub]String (likely based on Character.isWhitespace assuming it reconciles degenerate graphemes), UnicodeScalarView (using Unicode.Scalar.Properties.isWhitespace), and maybe code unit views (if useful) as aliases for the USV solution.

Beyond that, let’s see if a Pattern abstraction covers all the others well.

For the mutating trim operation, I wonder if it would be best on RangeReplaceableCollection, which can delete prefixes or suffixes in-place.

Joe_Groff · March 27, 2019, 3:00pm

The issue with RangeExpression today is that it includes both contains and relative(to:) as requirements. contains makes sense for open-ended ranges, but not really for relative-offset ranges, since you wouldn't know whether a range like ++1 ... --2 contains a particular index without knowing what collection those offsets are relative to. Similarly, contains requires a Collection whose index matches the RangeExpression's bound type, but for an abstract relative range, you also don't necessarily know what the specific index type is. ++1 ... --2 could just as well apply to String indexes as array indexes. We'd need a more abstract protocol to cover these cases, which maybe RangeExpression can be a refinement of.

As for syntax, I think Dave's syntax is fine too, and people will hate any new thing until they get used to it.

xwu · March 27, 2019, 3:05pm

Could we not say that they are, for these purposes, arbitrarily large ranges, such that ++n is an index that precedes --m for any ‘n’ and ‘m’? Like an open-ended range, but with the opening smack in the center.

Joe_Groff · March 27, 2019, 3:07pm

You could, perhaps. You'd still need the more generic form of contains to be able to apply the expression to a collection, though.

Michael_Ilseman · April 18, 2019, 1:27am

Another example of this unintuitive behavior is in:

Michael_Ilseman · April 23, 2019, 12:09am

Barring any new concrete use cases from the community, I think we should explore the longer-term solution with a DSL for constructing offset ranges.

Coming back around to this. With ABI and source stability, what can we realistically achieve here? We can't change RangeExpression itself or have it inherit from a more general protocol. And we can’t have it retroactively conform to a new protocol.

RangeExpression is currently:

public protocol RangeExpression {
  associatedtype Bound: Comparable

  func relative<C: Collection>(
    to collection: C
  ) -> Range<Bound> where C.Index == Bound

  func contains(_ element: Bound) -> Bool

The only users of RangeExpressions in the standard library are Collection, MutableCollection, and RangeReplaceableCollection. These are versions of subscript, replaceSubrange, and removeSubrange defined in extensions that are generic over RangeExpression and just pass it through relative(to:).

RangeExpression itself provides ~=, which uses contains. I don’t see anything using the Comparable constraint on the Bound associated type and there’s no default definition for contains, but maybe I missing something there. There is a TODO on Range.overlaps to put it on RangeExpression, but I don’t know how we’d provide a default implementation for an any-to-any overlap check since we don't know all potential conformers.

So, perhaps something like:

public protocol RelativeRangeExpression {
  func relative<C: Collection>(to collection: C) -> Range<C.Index>
}

And we have 4 new overloads in extensions for RelativeRangeExpression? What would be the advantage of this protocol over just using a concrete type such as @beccadax's RelativeRange?

enum RelativeBound<Bound: Comparable> {
    case from(Bound, offset: Int)
    case fromStart(offset: Int)
    case fromEnd(offset: Int)
}
struct RelativeRange<Bound: Comparable> {
    let lowerBound: RelativeBound<Bound>
    let upperBound: RelativeBound<Bound>
}

These offsets are relative to other indices (by default, start/endIndex). So this would read idx1[++|--]n < idx2[++|--]m iff idx1 < idx2.

Hole-in-the-middle is interesting, but what does this actually allow us to express? Or is this a technique for conforming to the existing RangeExpression?

I assume you mean relative(to:) in these responses (which is the one that takes a Collection).

Why do we need an untyped ++1...--2? In RelativeRange above, we'd require type context for the pure-offset representation (effectively phantom-typed).

xwu · April 23, 2019, 11:07pm

I wouldn’t call it a ‘technique’ so much as semantics that are already precedented by open-ended ranges, but yes.

Michael_Ilseman · April 23, 2019, 11:29pm

Another example use for offsetting an index, from @Ray_Fix's post

  mutating func appendInterpolation<T: BinaryFloatingPoint>(_ value: T, precision: Int = 3) {
    var result = String(describing: value)
    defer { appendInterpolation(result) }

    guard precision > 0 else { return }

    let string = result
    guard let index = string.firstIndex(of: ".") else { return }

    let offset = precision+1
    let last = string.index(index, offsetBy: offset,
                            limitedBy: string.endIndex) ?? string.endIndex

    result = String(string[..<last])

    let fraction = string[index..<last].count

    if fraction < offset {
      result += String(repeating: "0", count: offset-fraction)
    }
  }

Michael_Ilseman · April 23, 2019, 11:36pm

I suppose we'd end up also having PartialRelativeRangeThrough, etc....

Even if these relative ranges conformed to RangeExpression, we'd also be providing overloads for the range operators:

extension Comparable {
  public static func ..< (minimum: Self, maximum: RelativeBound<Self>) -> Range<Self> {}
  public static func ..< (minimum: RelativeBound<Self>, maximum: Self) -> Range<Self> {}

  public static func ... (minimum: Self, maximum: RelativeBound<Self>) -> ClosedRange<Self> {
  public static func ... (minimum: RelativeBound<Self>, maximum: Self) -> Range<Self> {}
}

At what point would an IndexExpression make sense?

Michael_Ilseman · April 24, 2019, 2:33am

So I used @xwu's semantics with @beccadax's RelativeBound and @dabrahams's syntax here

This gives us:

let str = "abcdefghijklmnopqrstuvwxyz"
let idx = str.firstIndex { $0 == "n" }!

print("[i]")
print(str[--14]) // m
print(str[idx--100]) // a
print(str[idx++1]) // o

print("a..<b")
print(str[++1 ..< --2]) // bcdefghijklmnopqrstuvwx
print(str[idx--2 ..< --2]) // lmnopqrstuvwx
print(str[idx ..< --2]) // nopqrstuvwx
print(str[idx--2..<idx]) // lm
print(str[idx--2..<idx++3]) // lmnop
print(str[--4 ..< --2]) // wx

print("a...b")
print(str[idx--2 ... --2]) // lmnopqrstuvwxy
print(str[idx ... --2]) // nopqrstuvwxy
print(str[idx--2...idx]) // lmn
print(str[idx--2...idx++3]) // lmnopq
print(str[--4 ... --2]) // wxy

print("..<b")
print(str[..<(idx++2)]) // abcdefghijklmno
print(str[..<(++20)]) // abcdefghijklmnopqrst
print(str[..<(--20)]) // abcdef

print("...b")
print(str[...(idx++2)]) // abcdefghijklmnop
print(str[...(++20)]) // abcdefghijklmnopqrstu
print(str[...(--20)]) // abcdefg

print("a...")
print(str[(idx++2)...]) // pqrstuvwxyz
print(str[(++20)...]) // uvwxyz
print(str[(--20)...]) // ghijklmnopqrstuvwxyz

kiel · October 27, 2020, 10:49pm

Is the standard library still missing a native (i.e.: not a bridge to NSString) trim whitespace API? I couldn't find anything on the String documentation.

xwu · October 28, 2020, 12:16am

github.com/apple/swift-algorithms

Add BidirectionalCollection.trimming

apple:main ← karwa:main

opened 08:15AM - 08 Oct 20 UTC

karwa

+224 -0

### Description Adds a '.trimming' method to BidirectionalCollection, returning a slice with matching elements removed from the start and end. ### Detailed Design ```swift extension BidirectionalCollection { /// Returns a `SubSequence` formed by discarding all elements at the start and end of the Collection /// which satisfy the given predicate. /// /// e.g. `[2, 10, 11, 15, 20, 21, 100].trimmimg(where: { $0.isMultiple(of: 2) })` == `[11, 15, 20, 21]` /// /// - parameters: /// - predicate: A closure which determines if the element should be omitted from the resulting slice. /// @inlinable public func trimming(where predicate: (Element) throws -> Bool) rethrows -> SubSequence } ``` ### Documentation Plan No existing guides need updating AFAIK. I haven't written a guide for this feature: is it necessary for every API to include this parallel documentation? ### Test Plan Tests are included. ### Source Impact Additive. ### Checklist - [X] I've added at least one test that validates that my change is working, if appropriate - [?] I've followed the code style of the rest of the project - [X] I've read the [Contribution Guidelines](CONTRIBUTING.md) - [X] I've updated the documentation if necessary

kiel · October 28, 2020, 12:23am

Ah, excellent! When will we be able to use this?