Additional String Processing APIs

Many String processing algorithms can also be applied to Sequences or Collections in general, so suggest we start there. There have been discussions in the past, including an Offset indexing pitch which would be useful, among others.

2 Likes

I think what @Karl and @davedelong meant is that they want some of NSMutableString, NSString, etc's methods reproduced in String. I don't think we need to change anything in Foundation at all for that. We can leave it be, and just add whatever functionality we need into String.

1 Like

I would also like to see more convenient “ASCII” parsing, with the caveat that this often means the structure of a file is ASCII-range characters, but some content such as strings or comments may be UTF-8.

Addendum: in the context of Swift, this isn’t necessarily a String thing; Data or pointer types may be more appropriate, if we had a solution to the zero-cost ergonomic ASCII character literal problem. It just springs to mind here because strings and text-or-text-like-data-processing are closely related.

This would perhaps be reasonable for things like literal search and replace, but case-insensitive/diacritic-insensitive search and capitalization are probably non-starters since there’s no concept of languages or locales in the standard library. For literal searching, generic Collection and Sequence solutions should probably be preferred to string-specific ones.

1 Like

I guess in this case you really need to encode the "if present" part somehow: either by naming the methods removePrefixIfPresent or by using the Set's convention of returning the optional of the element if it was found and removed or nil otherwise.

The caveat I'd also be worried about if the "if present" part is not communicated is that it might be unclear if the method removes the prefix/suffix only if it fully matches or the method is flexible enough to remove some matching part. That is, should "ABCD".removePrefix("ABCZ") remove the first three letters or do nothing at all? How does it communicate this decision?

There is precedent for returning the removed value (For example: removeFirst() -> Character), there is also precedent for not returning anything (removeSubrange(_:)). Personally, I think returning a discardable optional would give developers the most flexibility.

Only a complete match should be acted upon. In general Swift is very explicit, so we should be ok as long as the behavior pattern of the new API fits with hasPrefix/hasSuffix (and there is proper documentation).

1 Like

However, removeSubrange(_:) is unambiguous in the sense that it does not care what characters there are — and the usual convention here, as with arrays, is that we have checked in prior if the range is sound and fits the bounds, otherwise it traps. removeFirst() also traps. So really, to use these methods, you first check the validity of the subrange, then you remove it. The prefix/suffix methods as suggested though seem to me like they would avoid those checks, which suddenly makes them more powerful despite being more specific.

So ideally, you'd make the proposed methods follow the same convention: first you ask the string for the range of a certain prefix (maybe via a method like firstRange(of: String) -> Range<String.Index>?). Then you remove the actual range.

Yes, this functionality should be in the standard library. You shouldn't need Foundation for basic find/replace.

Of course, they are in Foundation right now, because Foundation is used to owning its own String type, as well as all the other basic types, since it is the "standard library" for Objective-C.

Locale-dependent searching is an interesting question. I don't know enough about unicode to say whether our String type requires locales for correctness, but if it does I would be in favour of lifting that to the standard library, too.

1 Like

After a cursory look at the implementation of replacingOccurrences(of:with:options:range:) in swift-corelibs-foundation, it appears that Unicode processing (case folding?) is used. Not sure if this is required because of the case insensitive search option or just needed in general.

Call Stack for replacingOccurrences
  1. StringProtocol.replacingOccurrences
  2. NSString.replacingOccurrences
  3. NSMutableString.replaceOccurrences
  4. CFStringCreateArrayWithFindResults
  5. CFStringFindWithOptions
  6. CFStringFindWithOptionsAndLocale
  7. __CFStringFoldCharacterClusterAtIndex

I will work on a proof of concept replace extension and see what happens.

1 Like

Not sure if this is relevant, but there's things like:

let s = "ß"
print(s.lowercased()) // ß
print(s.uppercased()) // SS
print(s.capitalized)  // Ss

And I guess a whole lot of even more complicated stuff related to other languages than german.

2 Likes

Adding to @Jens's "ß" example, here is one with Turkish alphabet:

let i = "i"
let ı = "ı"
let İ = "İ"
let I = "I"

print(i.uppercased()) // I
print(ı.uppercased()) // I
print(İ.lowercased()) // i
print(I.lowercased()) // i

In Turkish, "i" and "İ" are the same letter, and "ı" and "I" the same. Because of the lack of locale-awareness in String, both "ı" and "i" are capitalised to "I", and 'I" and "İ" to "i".

I'm not sure how much it will help String by giving it locale-awareness though, since many texts contain multiple languages.

Ligatures in fonts are also affected by this, but that probably falls in NSFontManager's domain, not really in String's.

On another mostly unrelated note, this is often a supporting case for case-sensitive file systems.

1 Like

I recently tried to implement a tokenizer in Swift, and it was very awkward to do with NSScanner (as it seems to be aimed more at ad-hoc date parsing and the like where you don't need to store parse offsets etc.), and not obviously supported in String at all. And given it is a bit complicated to perform arithmetics on String.Index correctly, it would be helpful to make such things easier and more efficient.

In particular, a way to iterate over a String and test whether a certain string or sequence from a CharacterSet is present, and to also get their Range in the String would be really helpful. Something like

extension String {
    func extractIf(prefix: String, in range: inout Range) -> (String, Range)?
    func extracLongesttIf(prefixFrom: CharacterSet, in range: inout Range) -> (String, Range)?
}

This would take the search range and update it to exclude the characters just found, so you can just call this repeatedly with a var searchRange = Range(location: 0, length: myString.count) that then is updated by each subsequent call.

1 Like

FWIW, while the general idea of letters that don't have an uppercase version or have one that consists of other characters probably exists in other languages (if you leave aside languages like Japanese that don't even have the concept of case), this example in practice is not really useful:

There are no words that start with a sharp-S, so the capitalized() version will never occur in practice. Also, a few years ago the capital letter for sharp-S was standardized by the German ministries of education of all the states, so a better solution for uppercased()would be to finally update macOS keyboard layouts to allow typing that letter and leave Swift alone in this particular case.

A function that removes on partial prefixes can be composed from the whole-prefix version along with a divergence-testing method.

extractIf could be done with a divergence-testing function too. extractLongestIf could be adapted from self.firstIndex(where: { !prefixFrom.contains($0) }). Neither requires String; they should be added to Collection in general.

One thing that I feel is missing from strings in Swift is parsing.

A primary aspect of parsing that is currently clunky, difficult to use, and lacking in some regards is regular expressions. To start off, you have to import Foundation to be able to do anything with regular expressions in the first place. While there are currently a few methods that allow you to somewhat easily work with regular expressions by using .regularExpression option (replacingOccurrences(of:with:options:) and range(of:options:)), they aren't always sufficient and you often need to work with NSRegularExpression. This is were I largely take issue. I also got some discussion going regarding some of the issues with regular expressions here by the way. Anyways, NSRegularExpression is not a very swifty API and is often a pain to use. Working with capture groups and matches is also not very easy. I feel that a native Swift implementation that provides an API that leverages Swift's features would be quite helpful. Maybe even provide first-class regex support with a unique literal syntax with highlighting :man_shrugging:. @Michael_Ilseman came up with some ideas regarding how this could work in his State of String: ABI, Performance, Ergonomics, and You! doc. And even if it is not feasible to provide custom syntax and whatnot, I feel that a native Swift solution in some capacity is needed.

On a more general note though, I feel that we should look into what can be done to make parsing strings easer. I'm not sure how we can go about this, but, when trying to create a program that tokenized and parsed strings, I found that it got very complicated very quickly. I feel like there must be a better way, though I'm not really sure.

2 Likes

Also, a couple more things:

  • (Sub)String.withUTF8 should be added to StringProtocol. SE-0247 deferred this question, saying "For now, we’re not sure it’s worth the extra witness table entries at this point". However, if you're writing a parser which works on the UTF8 level, this means you either have to create your own protocol (so no witness table savings) or copy/paste your entry-points (including their documentation). It's kind of annoying.

  • I'd like to see us add support for "shared strings". The ABI already supports this, and there is a proof-of-concept PR. I haven't had time to write a proposal for it yet (so feel free to write your own, anybody). This is again really useful for parsers which deconstruct a String in to constituent substrings (perhaps performing some transformations along the way, like normalising the case), as these would all be able to share a single allocation.

    One thing that's worth considering is whether this makes sense as an API on String or on Substring. It was an explicit feature of String's design that it shouldn't be self-slicing, and that users should be aware when storing a String that is backed by a potentially much larger allocation - hence Substring. It may make sense to limit this API to the latter, for similar reasons.

2 Likes

Like many of the various String APIs I've read here, the removing methods can be generalized to collections. I came up with a quick gist. I could generalize it further for sequences, but then you couldn't easily check if an affix was actually found and truncated or not.

See another recent post about Sequence.firstDelta(against:).

Looking at NSString.replacingOccurrences(of: with: options: range:):

  • The last parameter restricts what part of the entire string to look at. We can replace that in a general RangeReplaceableCollection version by working with a sub-sequence.
  • We still need the first two parameters for the searched-for and replacement sequences.
  • We will add a parameter for a custom equivalence predicate. For an RRC that has an Equatable element type, there will be an overload that omits the predicate parameter.
  • Looking at the various flags of NSString.CompareOptions, many of them (caseInsensitive, literal, diacriticInsensitive, widthInsensitive, and forcedOrdering) can be omitted by a common normalization step for the source and search-for collections and/or use of a customized predicate.
  • The backwards and anchored flags will be moved to be function parameters.
  • (regularExpression is too much to handle right now. And numeric doesn't apply since it's an ordering, not equivalence, directive.)

Initially I saw that using an in-place replacement method would have the problem that the first replacement will void all the indices, screwing up subsequent searches. So I decided to ditch an in-place method, and make search-and-replace return a new copy. This changes the general requirements from RRC to a plain Collection. Then I decided to separate search-and-replace into separate searching and replacing methods. My prototype for the replacing method turned out to be:

extension Collection {
    /// Returns a copy of this collection into a collection of the given type,
    /// but where the subsequences of this collection indicated by the given
    /// list of index ranges are replaced by mapped versions of themselves, in
    /// the given direction, using the given mapping function.
    public func replaceSubranges<S, T, U>(
        listedBy list: S, backwards: Bool,  into type: T.Type, withResultsOf transform: (SubSequence) throws -> U
    ) rethrows -> T
        where S: Sequence, S.Element: RangeExpression, S.Element.Bound == Index, T: RangeReplaceableCollection, T.Element == Element, U: Sequence, U.Element == Element

    /// Copies this collection into an array, except that subsequences of this
    /// collection indicated by the given list of index ranges are replaced in
    /// the given direction by mapped versions of themselves, using the given
    /// mapping function.
    @inlinable
    public func replaceSubranges<S: Sequence, U: Sequence>(
        listedBy list: S, backwards: Bool, withResultsOf transform: (SubSequence) throws -> U
    ) rethrows -> [Element]
        where S.Element: RangeExpression, S.Element.Bound == Index, U.Element == Element
}

And the searching can be expressed as a sequence of ranges, packed into a new type:

/// A sequence of the locations of all the occurrances a substring appears in
/// some collection according to an element-equivalence function.
///
/// - Warning: `startIndex` is computed on the fly, making it O(*n*) instead of
///   O(1), where *n* is the length of the base collection.  Since collection
///   iteration repeatedly tests the last-reachable end-point, prefer forward
///   iteration when possible, since `endIndex` is a lot quicker to compute.
public struct MatchingRangesCollection<Base: Collection, TargetElement> { /*...*/ }
extension MatchingRangesCollection: LazyCollectionProtocol { /*...*/ }
extension MatchingRangesCollection.Index: Hashable where Base.Index: Hashable {}
extension MatchingRangesCollection: BidirectionalCollection where Base: BidirectionalCollection { /*...*/ }

extension Collection {
    /// Returns the index ranges for each subsequence of this collection that is
    /// equivalent to the given sequence, using the given predicate to compare
    /// elements.
    public func matchingRanges<S: Sequence>(for target: S, by areEquivalent: @escaping (Element, S.Element) -> Bool) -> MatchingRangesCollection<Self, S.Element>}
}

extension Collection where Element: Equatable {
    /// Returns the index ranges for each subsequence of this collection that is
    /// equal to the given sequence.
    @inlinable
    public func matchingRanges<S: Sequence>(for target: S) -> MatchingRangesCollection<Self, Element> where S.Element == Element
}

Then I tried to write a convenience method on Collection to search-and-replace by combining the matching-range collection with the replacement method. But I realized the method didn't need to accommodate non-local code-blocks, and couldn't because that would forbid throwing predicates. So I started over.

I moved the main searching code to some helper functions:

extension Collection {
    /// Returns the index range for the earliest subsequence of this collection
    /// that is equivalent to the given sequence, using the given predicate to
    /// compare elements.
    public func firstRange<C: Collection>(equaling target: C, by areEquivalent: (Element, C.Element) throws -> Bool) rethrows -> Range<Index>?
}

extension Collection where Element: Equatable {
    /// Returns the index range for the earliest subsequence of this collection
    /// that equals the given sequence.
    @inlinable
    public func firstRange<C: Collection>(equaling target: C) -> Range<Index>? where C.Element == Element
}

extension BidirectionalCollection {
    /// Returns the index range for the last subsequence of this collection that
    /// is equivalent to the given sequence, using the given predicate to
    /// compare elements.
    @inlinable
    public func lastRange<C: BidirectionalCollection>(equaling target: C, by areEquivalent: (Element, C.Element) throws -> Bool) rethrows -> Range<Index>?
}

extension BidirectionalCollection where Element: Equatable {
    /// Returns the index range for the last subsequence of this collection that
    /// equals the given sequence.
    @inlinable
    public func lastRange<C: BidirectionalCollection>(equaling target: C) -> Range<Index>? where C.Element == Element
}

Then wrote a search-and-replace method with an unconditional replacement value:

extension Collection {
    /// Returns a copy of this collection into a collection of the given type,
    /// except subsequences that match the given collection according to the
    /// given predicate are replaced with copies of another given collection.
    public func replacingOccurrences<T, R, RR>(
        of target: T, with replacement: R, onlyAsPrefix: Bool, into type: RR.Type, by areEquivalent: (Element, T.Element) throws -> Bool
    ) rethrows -> RR
        where T: Collection, R: Collection, R.Element == Element, RR: RangeReplaceableCollection, RR.Element == Element

    /// Copies this collection into an array, except subsequences that match the
    /// given collection according to the given predicate are replaced with
    /// copies of another given collection.
    @inlinable
    public func replacingOccurrences<T: Collection, R: Collection>(of target: T, with replacement: R, onlyAsPrefix: Bool, by areEquivalent: (Element, T.Element) throws -> Bool) rethrows -> [Element] where R.Element == Element
}

And expanded it for bidirectional collections where the replacement could be done forwards or backwards. (This decides which of a given pair of overlapping matches is replaced.):

extension BidirectionalCollection {
    /// Returns a copy of this collection into a collection of the given type,
    /// except subsequences that match the given collection according to the
    /// given predicate are replaced in a given order with copies of another
    /// given collection.
    @inlinable
    public func replacingOccurrences<T, R, RR>(
        of target: T, with replacement: R, backwards: Bool, onlyAtLeadingAdfix: Bool, into type: RR.Type, by areEquivalent: (Element, T.Element) throws -> Bool
    ) rethrows -> RR
        where T: BidirectionalCollection, R: BidirectionalCollection, R.Element == Element, RR: RangeReplaceableCollection, RR.Element == Element

    /// Copies this collection into an array, except subsequences that match the
    /// given collection according to the given predicate are replaced in a
    /// given order with copies of another given collection.
    @inlinable
    public func replacingOccurrences<T, R>(
        of target: T, with replacement: R, backwards: Bool, onlyAtLeadingAdfix: Bool, by areEquivalent: (Element, T.Element) throws -> Bool
    ) rethrows -> [Element]
        where T: BidirectionalCollection, R: BidirectionalCollection, R.Element == Element
}

Then I made overloads for when the element type is Equatable. (But they lock the return type as an Array.) Then I made overloads for RangeReplaceableCollection where Self is returned, along with overloads for BidirectionalCollection conformance and/or element-Equatable conformance.

Since my first idea for separate search and replace functionality failed, I did make in-place overloads too. They just call the separate-object versions then replace themselves.

However, with the separate functionality, we could make a regular-expression range-matching reference type. An instance of that type, initialized with a regex pattern and a string, can be fed to the replace routine as the range list and as part of the transformation closure. In the transformation closure, it can read the indices of the current targeted subsequence and apply regex rules for ordinal matches.

You can see them all in a gist.

2 Likes

I've been thinking of a similar thing for a while now, but from a bit of a different angle:

What does Brian Kernighan's Regular Expression Matcher actually mean when implemented in Swift?

The result similarly drops down a few levels to BidirectionalCollection et al, and splits out the special pattern matching tokens (*, $ etc) into an enum. I'm a bit out of practice with generics but I do think this is more in the spirit of Swift an its DSLs than some runtime encoding of the pattern as a string.

enum RegularExpressionElement<Element> : Equatable where Element : Equatable {
    case any              // .
    case zeroOrMore       // *
    case begin            // ^
    case end              // $
    case element(Element) // a character
}

extension BidirectionalCollection where Element : Equatable {
    typealias RegularExpression = ArraySlice<RegularExpressionElement<Self.Element>>

    /* match: search for regexp anywhere in collection */
    func match(regexp: RegularExpression) -> Bool
}

Here's an example from the original String:

"hellooooo!".match("^h..lo*!$")

"hellooooo!".match([.begin, .element("h"), .any, .any, .element("l"), .element("o"), .zeroOrMore, .element("!"), .end])

There's a full implementation here: Reimplementation of Brian Kernighan's Regular Expression Matcher on BidirectionalCollection.

Update: I know that . zeroOrMore(.element("o")) would make more sense, but I wanted to keep as close to the C original as possible for this demonstration.

1 Like

Yeah, that’s the big problem. RRC is kind of broken, because it basically resets all of your knowledge about the collection - it should at least return the range of the inserted elements, otherwise you can’t really build any useful in-place modification algorithms on top of it. We really need to fix that.

We can’t really add that to RRC without having a default implementation for backwards compatibility, so we’d need to replace that protocol outright.

Bumping this to say: we really need this.

In WebURL, we do low-level string processing at the UTF-8 level, with functions that work with generic Collections of UInt8. Those functions call withContiguousStorageIfAvailable, wrapping the provided buffer in a UnsafeBoundsCheckedBufferPointer and forwarding to the actual implementation. This is great for code-size: all contiguous data sources immediately drop down to a single generic specialization, which includes bounds-checking.

If the collection is ^not^ contiguous, we fall back to using it directly, and the compiler will create a bespoke specialization. Generally, we don't expect many non-contiguous collections, so it's not a big deal.

But there is one major exception: String.UTF8View. We know that, at runtime, almost all Strings will be native Swift Strings, with contiguous UTF-8 storage - but statically, the compiler still emits a specialization using String.UTF8View and storing String.Index-es, just in case it isn't.


I just added my own version of StringProtocol.withUTF8, and by using that instead (with no fallback to String.UTF8View), I was able to cut code-size by almost 20% (!!!!). That's just enormous. Using the cmpcodesize util from the compiler repository to test the benchmarks executable:

karl@Karls-MBP Benchmarks % ../../../swift-project/swift/utils/cmpcodesize/cmpcodesize.py before/WebURLBenchmark .build/release/WebURLBenchmark                                 
Title                              Section             Old             New  Percent
WebURLBenchmark                     __text:        1775009         1425601   -19.7%

The overall executable (which includes google benchmark, argumentparser, etc) went from 5MB to 4.4MB, and WebURL.swiftmodule went from 3.9MB to 3.1MB.

The detailed diff is basically what you might expect. We no longer emit specializations of the entire parser to handle input provided as a bridged String.

Only in old file(s)
[...]
    7528 generic specialization <WebURL.ASCII.NewlineAndTabFiltered<Swift.Substring.UTF8View>, WebURL.UnsafePresizedBufferWriter> of WebURL.ParsedURLString.ProcessedMapping.write<A where A1: WebURL.URLWriter>(inputString: A, baseURL: WebURL.WebURL?, to: inout A1) -> ()
   10647 generic specialization <Swift.Substring.UTF8View, WebURL.UnsafePresizedBufferWriter> of WebURL.ParsedURLString.ProcessedMapping.write<A where A1: WebURL.URLWriter>(inputString: A, baseURL: WebURL.WebURL?, to: inout A1) -> ()
Total size of functions only in old file: 416146
Total size of functions only in new file:  65078

So yeah - I think it is absolutely worth adding this function to StringProtocol. The code-size wins for low-level generic text processing are very, very, very much worth a single extra witness table entry.

7 Likes
Terms of Service

Privacy Policy

Cookie Policy