Additional String Processing APIs

Karl · May 13, 2020, 8:18pm

Also, a couple more things:

(Sub)String.withUTF8 should be added to StringProtocol. SE-0247 deferred this question, saying "For now, we’re not sure it’s worth the extra witness table entries at this point". However, if you're writing a parser which works on the UTF8 level, this means you either have to create your own protocol (so no witness table savings) or copy/paste your entry-points (including their documentation). It's kind of annoying.
I'd like to see us add support for "shared strings". The ABI already supports this, and there is a proof-of-concept PR. I haven't had time to write a proposal for it yet (so feel free to write your own, anybody). This is again really useful for parsers which deconstruct a String in to constituent substrings (perhaps performing some transformations along the way, like normalising the case), as these would all be able to share a single allocation.

One thing that's worth considering is whether this makes sense as an API on String or on Substring. It was an explicit feature of String's design that it shouldn't be self-slicing, and that users should be aware when storing a String that is backed by a potentially much larger allocation - hence Substring. It may make sense to limit this API to the latter, for similar reasons.

CTMacUser · May 14, 2020, 3:40pm

Like many of the various String APIs I've read here, the removing methods can be generalized to collections. I came up with a quick gist. I could generalize it further for sequences, but then you couldn't easily check if an affix was actually found and truncated or not.

See another recent post about Sequence.firstDelta(against:).

CTMacUser · May 25, 2020, 4:32am

Looking at NSString.replacingOccurrences(of: with: options: range:):

The last parameter restricts what part of the entire string to look at. We can replace that in a general RangeReplaceableCollection version by working with a sub-sequence.
We still need the first two parameters for the searched-for and replacement sequences.
We will add a parameter for a custom equivalence predicate. For an RRC that has an Equatable element type, there will be an overload that omits the predicate parameter.
Looking at the various flags of NSString.CompareOptions, many of them (caseInsensitive, literal, diacriticInsensitive, widthInsensitive, and forcedOrdering) can be omitted by a common normalization step for the source and search-for collections and/or use of a customized predicate.
The backwards and anchored flags will be moved to be function parameters.
(regularExpression is too much to handle right now. And numeric doesn't apply since it's an ordering, not equivalence, directive.)

Initially I saw that using an in-place replacement method would have the problem that the first replacement will void all the indices, screwing up subsequent searches. So I decided to ditch an in-place method, and make search-and-replace return a new copy. This changes the general requirements from RRC to a plain Collection. Then I decided to separate search-and-replace into separate searching and replacing methods. My prototype for the replacing method turned out to be:

extension Collection {
    /// Returns a copy of this collection into a collection of the given type,
    /// but where the subsequences of this collection indicated by the given
    /// list of index ranges are replaced by mapped versions of themselves, in
    /// the given direction, using the given mapping function.
    public func replaceSubranges<S, T, U>(
        listedBy list: S, backwards: Bool,  into type: T.Type, withResultsOf transform: (SubSequence) throws -> U
    ) rethrows -> T
        where S: Sequence, S.Element: RangeExpression, S.Element.Bound == Index, T: RangeReplaceableCollection, T.Element == Element, U: Sequence, U.Element == Element

    /// Copies this collection into an array, except that subsequences of this
    /// collection indicated by the given list of index ranges are replaced in
    /// the given direction by mapped versions of themselves, using the given
    /// mapping function.
    @inlinable
    public func replaceSubranges<S: Sequence, U: Sequence>(
        listedBy list: S, backwards: Bool, withResultsOf transform: (SubSequence) throws -> U
    ) rethrows -> [Element]
        where S.Element: RangeExpression, S.Element.Bound == Index, U.Element == Element
}

And the searching can be expressed as a sequence of ranges, packed into a new type:

/// A sequence of the locations of all the occurrances a substring appears in
/// some collection according to an element-equivalence function.
///
/// - Warning: `startIndex` is computed on the fly, making it O(*n*) instead of
///   O(1), where *n* is the length of the base collection.  Since collection
///   iteration repeatedly tests the last-reachable end-point, prefer forward
///   iteration when possible, since `endIndex` is a lot quicker to compute.
public struct MatchingRangesCollection<Base: Collection, TargetElement> { /*...*/ }
extension MatchingRangesCollection: LazyCollectionProtocol { /*...*/ }
extension MatchingRangesCollection.Index: Hashable where Base.Index: Hashable {}
extension MatchingRangesCollection: BidirectionalCollection where Base: BidirectionalCollection { /*...*/ }

extension Collection {
    /// Returns the index ranges for each subsequence of this collection that is
    /// equivalent to the given sequence, using the given predicate to compare
    /// elements.
    public func matchingRanges<S: Sequence>(for target: S, by areEquivalent: @escaping (Element, S.Element) -> Bool) -> MatchingRangesCollection<Self, S.Element>}
}

extension Collection where Element: Equatable {
    /// Returns the index ranges for each subsequence of this collection that is
    /// equal to the given sequence.
    @inlinable
    public func matchingRanges<S: Sequence>(for target: S) -> MatchingRangesCollection<Self, Element> where S.Element == Element
}

Then I tried to write a convenience method on Collection to search-and-replace by combining the matching-range collection with the replacement method. But I realized the method didn't need to accommodate non-local code-blocks, and couldn't because that would forbid throwing predicates. So I started over.

I moved the main searching code to some helper functions:

extension Collection {
    /// Returns the index range for the earliest subsequence of this collection
    /// that is equivalent to the given sequence, using the given predicate to
    /// compare elements.
    public func firstRange<C: Collection>(equaling target: C, by areEquivalent: (Element, C.Element) throws -> Bool) rethrows -> Range<Index>?
}

extension Collection where Element: Equatable {
    /// Returns the index range for the earliest subsequence of this collection
    /// that equals the given sequence.
    @inlinable
    public func firstRange<C: Collection>(equaling target: C) -> Range<Index>? where C.Element == Element
}

extension BidirectionalCollection {
    /// Returns the index range for the last subsequence of this collection that
    /// is equivalent to the given sequence, using the given predicate to
    /// compare elements.
    @inlinable
    public func lastRange<C: BidirectionalCollection>(equaling target: C, by areEquivalent: (Element, C.Element) throws -> Bool) rethrows -> Range<Index>?
}

extension BidirectionalCollection where Element: Equatable {
    /// Returns the index range for the last subsequence of this collection that
    /// equals the given sequence.
    @inlinable
    public func lastRange<C: BidirectionalCollection>(equaling target: C) -> Range<Index>? where C.Element == Element
}

Then wrote a search-and-replace method with an unconditional replacement value:

extension Collection {
    /// Returns a copy of this collection into a collection of the given type,
    /// except subsequences that match the given collection according to the
    /// given predicate are replaced with copies of another given collection.
    public func replacingOccurrences<T, R, RR>(
        of target: T, with replacement: R, onlyAsPrefix: Bool, into type: RR.Type, by areEquivalent: (Element, T.Element) throws -> Bool
    ) rethrows -> RR
        where T: Collection, R: Collection, R.Element == Element, RR: RangeReplaceableCollection, RR.Element == Element

    /// Copies this collection into an array, except subsequences that match the
    /// given collection according to the given predicate are replaced with
    /// copies of another given collection.
    @inlinable
    public func replacingOccurrences<T: Collection, R: Collection>(of target: T, with replacement: R, onlyAsPrefix: Bool, by areEquivalent: (Element, T.Element) throws -> Bool) rethrows -> [Element] where R.Element == Element
}

And expanded it for bidirectional collections where the replacement could be done forwards or backwards. (This decides which of a given pair of overlapping matches is replaced.):

extension BidirectionalCollection {
    /// Returns a copy of this collection into a collection of the given type,
    /// except subsequences that match the given collection according to the
    /// given predicate are replaced in a given order with copies of another
    /// given collection.
    @inlinable
    public func replacingOccurrences<T, R, RR>(
        of target: T, with replacement: R, backwards: Bool, onlyAtLeadingAdfix: Bool, into type: RR.Type, by areEquivalent: (Element, T.Element) throws -> Bool
    ) rethrows -> RR
        where T: BidirectionalCollection, R: BidirectionalCollection, R.Element == Element, RR: RangeReplaceableCollection, RR.Element == Element

    /// Copies this collection into an array, except subsequences that match the
    /// given collection according to the given predicate are replaced in a
    /// given order with copies of another given collection.
    @inlinable
    public func replacingOccurrences<T, R>(
        of target: T, with replacement: R, backwards: Bool, onlyAtLeadingAdfix: Bool, by areEquivalent: (Element, T.Element) throws -> Bool
    ) rethrows -> [Element]
        where T: BidirectionalCollection, R: BidirectionalCollection, R.Element == Element
}

Then I made overloads for when the element type is Equatable. (But they lock the return type as an Array.) Then I made overloads for RangeReplaceableCollection where Self is returned, along with overloads for BidirectionalCollection conformance and/or element-Equatable conformance.

Since my first idea for separate search and replace functionality failed, I did make in-place overloads too. They just call the separate-object versions then replace themselves.

However, with the separate functionality, we could make a regular-expression range-matching reference type. An instance of that type, initialized with a regex pattern and a string, can be fed to the replace routine as the range list and as part of the transformation closure. In the transformation closure, it can read the indices of the current targeted subsequence and apply regex rules for ordinal matches.

You can see them all in a gist.

jjrscott · July 15, 2020, 9:19pm

I've been thinking of a similar thing for a while now, but from a bit of a different angle:

What does Brian Kernighan's Regular Expression Matcher actually mean when implemented in Swift?

The result similarly drops down a few levels to BidirectionalCollection et al, and splits out the special pattern matching tokens (*, $ etc) into an enum. I'm a bit out of practice with generics but I do think this is more in the spirit of Swift an its DSLs than some runtime encoding of the pattern as a string.

enum RegularExpressionElement<Element> : Equatable where Element : Equatable {
    case any              // .
    case zeroOrMore       // *
    case begin            // ^
    case end              // $
    case element(Element) // a character
}

extension BidirectionalCollection where Element : Equatable {
    typealias RegularExpression = ArraySlice<RegularExpressionElement<Self.Element>>

    /* match: search for regexp anywhere in collection */
    func match(regexp: RegularExpression) -> Bool
}

Here's an example from the original String:

"hellooooo!".match("^h..lo*!$")

"hellooooo!".match([.begin, .element("h"), .any, .any, .element("l"), .element("o"), .zeroOrMore, .element("!"), .end])

There's a full implementation here: Reimplementation of Brian Kernighan's Regular Expression Matcher on BidirectionalCollection.

Update: I know that . zeroOrMore(.element("o")) would make more sense, but I wanted to keep as close to the C original as possible for this demonstration.

Karl · July 15, 2020, 9:53pm

Yeah, that’s the big problem. RRC is kind of broken, because it basically resets all of your knowledge about the collection - it should at least return the range of the inserted elements, otherwise you can’t really build any useful in-place modification algorithms on top of it. We really need to fix that.

We can’t really add that to RRC without having a default implementation for backwards compatibility, so we’d need to replace that protocol outright.

Karl · December 19, 2021, 7:35pm

Bumping this to say: we really need this.

In WebURL, we do low-level string processing at the UTF-8 level, with functions that work with generic Collections of UInt8. Those functions call withContiguousStorageIfAvailable, wrapping the provided buffer in a UnsafeBoundsCheckedBufferPointer and forwarding to the actual implementation. This is great for code-size: all contiguous data sources immediately drop down to a single generic specialization, which includes bounds-checking.

If the collection is ^not^ contiguous, we fall back to using it directly, and the compiler will create a bespoke specialization. Generally, we don't expect many non-contiguous collections, so it's not a big deal.

But there is one major exception: String.UTF8View. We know that, at runtime, almost all Strings will be native Swift Strings, with contiguous UTF-8 storage - but statically, the compiler still emits a specialization using String.UTF8View and storing String.Index-es, just in case it isn't.

I just added my own version of StringProtocol.withUTF8, and by using that instead (with no fallback to String.UTF8View), I was able to cut code-size by almost 20% (!!!!). That's just enormous. Using the cmpcodesize util from the compiler repository to test the benchmarks executable:

karl@Karls-MBP Benchmarks % ../../../swift-project/swift/utils/cmpcodesize/cmpcodesize.py before/WebURLBenchmark .build/release/WebURLBenchmark                                 
Title                              Section             Old             New  Percent
WebURLBenchmark                     __text:        1775009         1425601   -19.7%

The overall executable (which includes google benchmark, argumentparser, etc) went from 5MB to 4.4MB, and WebURL.swiftmodule went from 3.9MB to 3.1MB.

The detailed diff is basically what you might expect. We no longer emit specializations of the entire parser to handle input provided as a bridged String.

Only in old file(s)
[...]
    7528 generic specialization <WebURL.ASCII.NewlineAndTabFiltered<Swift.Substring.UTF8View>, WebURL.UnsafePresizedBufferWriter> of WebURL.ParsedURLString.ProcessedMapping.write<A where A1: WebURL.URLWriter>(inputString: A, baseURL: WebURL.WebURL?, to: inout A1) -> ()
   10647 generic specialization <Swift.Substring.UTF8View, WebURL.UnsafePresizedBufferWriter> of WebURL.ParsedURLString.ProcessedMapping.write<A where A1: WebURL.URLWriter>(inputString: A, baseURL: WebURL.WebURL?, to: inout A1) -> ()
Total size of functions only in old file: 416146
Total size of functions only in new file:  65078

So yeah - I think it is absolutely worth adding this function to StringProtocol. The code-size wins for low-level generic text processing are very, very, very much worth a single extra witness table entry.