Unify String/Substring UnicodeScalarView initializers in StringProtocol

Abstract

Herein I propose a very small non-intrusive addition to StringProtocol which unifies existing initializers on String and Substring, allowing for writing generic extensions across both which are currently possible, but require unwieldy and unnecessary encoding/decoding. The short version is: there is no generic way to instantiate an arbitrary StringProtocol from its own SubSequence, despite all conforming types providing the corresponding initializer.

Motivating Example

Let us consider the case of implementing a simple lexing primitive over String and Substring. The following consume method matches a prefix, returns the matched prefix as an independent String, and reassigns self to the remaining characters.

Figure 1

fileprivate extension String {
    @discardableResult
    mutating func consume(while pred: (Character) -> Bool) -> String? {
        let match = prefix { pred($0) }
        
        guard match.endIndex > startIndex else { return nil }
        
        defer { self = String(self.suffix(from: match.endIndex)) }
        return String(match)
    }
}

fileprivate extension Substring {
    @discardableResult
    mutating func consume(while pred: (Character) -> Bool) -> String? {
        let match = prefix { pred($0) }
        
        guard match.endIndex > startIndex else {
            return nil
        }
        
        defer { self = self.suffix(from: match.endIndex) }
        return String(match)
    }
}

It seems natural and ideal to instead implement this over StringProtocol (Figure 2). However, the assignment to self cannot be made generic. More precisely: there is no generic way to instantiate a StringProtocol from its own SubSequence. One roundabout solution is to encode the unicodeScalars of the suffix, and then use a cString/decoding initializer. This adds needless codec overhead, as StringProtocol requires a UnicodeScalarView anyways.

Figure 2

fileprivate extension StringProtocol {
    @discardableResult
    mutating func consume(while pred: (Character) -> Bool) -> String? {
        let match = prefix { pred($0) }
        
        guard match.endIndex > startIndex else { return nil }
        
        // No such initializer.
        defer { self = Self.init(content: self.suffix(from: match.endIndex).unicodeScalars) }
        return String(match)
    }
}

Suggested Resolution

Note however, that both String and Substring expose the following constructors (Figure 3 is a sketch, not intended to accurately depict the implementation of String or Substring):

Figure 3

public struct String {
    public init(content: Substring.UnicodeScalarView)
    // ...
}

public struct Substring {
    public init(content: Substring.UnicodeScalarView)
    // ...
}

The above issue can then be remedied very easily by the addition of a single initializer to StringProtocol:

Figure 4

protocol StringProtocol { 
    // ... 
    init<SubStr>(content: SubStr.UnicodeScalarView) where SubStr == SubSequence, SubStr: StringProtocol
}

This addition is entirely painless and required no additional implementation. The use of a generic is to avoid modifying StringProtocol, as I do not believe there is a way for StringProtocol to refine the constraints on SubSequence, which it inherits from parent protocols.

This protocol requirement is already satisfied by both String and StringProtocol. By making this change, the sample shown in Figure 2 “just works”.

In the mean time, this can be worked around in one’s own code like so:

protocol AugmentedStringProtocol: StringProtocol {
    init<SubStr>(_ content: SubStr.UnicodeScalarView) where SubStr == SubSequence, SubStr: StringProtocol
}
// Initializers already exist, no need to implement anything.
extension String: AugmentedStringProtocol {}
extension Substring: AugmentedStringProtocol {}

Edit

Pursuant to further discussion, a better alternative might be to add ... where SubSequence: StringProtocol to the declaration of protocol StringProtocol, and declare the above initializer without the generic parameter.

3 Likes

I’m in favor of exposing this initializer. However, I’d be a lot happier if we could have StringProtocol add the requirement that SubSequence : StringProtocol (which I’m not sure if we can actually do today in Swift 4.1, but if not, we should fix whatever’s blocking it). That way we can get rid of the generic parameter on the initializer.

2 Likes

I agree, requiring SubSequence: StringProtocol would be far, far, cleaner.

I just did a little testing, and it appears this pattern does in fact work:

protocol P { associatedtype T }
protocol Q: P where T: Q { }

So, it appears that it is possible to simply amend StringProtocol to require that SubSequence is also StringProtocol. Since the only two instances of StringProtocol are String and Substring, and both have SubSequence == Substring, this should also be pretty nonintrusive. I’m not sure where there would be any conflicts within the internal implementation details of strings though.

2 Likes

Very relevant, as recursive protocol requirements and constraints will be implemented in Swift 5

The above pattern doesn’t require any recursive protocol requirements/constraints. It could work today in Swift 4.1. Amending my previous example:

protocol AugmentifiedStringProtocol: StringProtocol where SubSequence: StringProtocol {
    init(_ content: SubSequence.UnicodeScalarView)
}

extension String: AugmentifiedStringProtocol {}
extension Substring: AugmentifiedStringProtocol {}

This works today in Swift 4.0.3 as well as the most recent beta release of 4.1. You can paste the above verbatim in the REPL.

Yes, but I meant a modified version in the Standard Library.

protocol StringProtocol {
    
    associatedtype Subsequence: StringProtocol
}

As far as I understand you want this as a general case, and it is reasonable.

I actually just noted with additional warnings turned on that in the AugmentifiedStringProtocol example above, the where SubSequence: StringProtocol bit is redundant. It’s already in the declaration of StringProtocol in the standard library as of 4.1.

public protocol StringProtocol : BidirectionalCollection, Comparable, 
                                 ExpressibleByStringLiteral, Hashable, 
                                 LosslessStringConvertible, TextOutputStream, 
                                 TextOutputStreamable 
    where Self.Element == Character, Self.SubSequence : StringProtocol

It is not in 4.0.3 however. This means that in 4.1, my suggested change amounts only to adding the following to StringProtocol:

init(_ content: SubSequence.UnicodeScalarView)

No need for any changes to constraints, or usage of generics, or implementation!

Drop that line in (and document appropriately) and it should be good to go.

Oh, right, my bad. SubSequence comes from BidirectionalCollection, so there isn’t a need for what I wrote.

Yes, apparently. However, I want to note there are other identical members of String and Substring that aren’t present in StringProtocol. For instance,

convenience init<S>(_ elements: S) where S : Sequence, Character == S.Element

init(_ content: Substring.CharacterView)

and some subscripts as well. I am not very experienced with string processing, yet I have the feeling this is deliberate and there were reasons the core team refrained from making these members required in StringProtocol. There could be some rationale, explaining why these members are undesirable as general requirements, although they are present in String and Substring.

I suspect most divergences were accidental. Now is a great time to address this!

1 Like

You see, the fact that these members are present in the only two data structures of the Standard Library, conforming to StringProtocol, doesn’t mean they are necessarily valid as general requirements for the set of possible types StringProtocol is meant to generalize in the eyes of the core team.

Though of course, I am not sure. This is an assumption based on the fact that they, then again, obviously share a number of identical members.

Maybe it would be more correct for String and Substring to conform to a subset of StringProtocol - in other words, to a more concrete protocol refining StringProtocol- and therein rightfully require these shared members.

But of course! It would be interesting to hear from someone who is actually aware of the reasons