Additional String Processing APIs

The purpose of the thread is to determine what APIs (if any) the community needs for string manipulation that are not currently available.

For example, filtering by character set and string padding have been mentioned in the past.


If you have an API idea, some helpful things to include would be:

  • What are you trying to achieve?
  • What do you envision the new API looking like?
  • How would this be achieved today?
4 Likes

In general, I think it will be great if most of NSString's APIs are moved to String. Some of them might not be exactly appropriate for String. For example, APIs for working with paths in NSString are probably more appropriate if they are reproduced in URL.

6 Likes

Two major issues come to mind:

  • In-place substitution only exists on NSMutableString (NSMS.replaceOccurences(of: String, with: String, options: NSString.CompareOptions, range: NSRange). If you start with a Swift String, this means bridging to NSMS, performing the replacements, then bridging back. This is a pretty basic feature, and we should do better.

    I understand that there are designs (and prototypes, even) for generic Collection-based pattern matching and substitution. I think we should make it a priority.

  • ASCII string processing. Currently I'm working on parsing data formats that only accept ASCII strings, and the overheads from the unicode model are a significant performance drain. We pay for every call to index(after:), even though we check every character and will ultimately fail if any of them are non-ASCII!

    I would really like a way to query in advance whether a string is ASCII (so we can fail early if it isn't). String already sets an internal flag as part of its UTF8 validation, but doesn't expose that information as API. That means we need to check it ourselves, which is an O(n) operation. It seems like it would be simple to expose an isASCII property (similar to isContiguousUTF8) to make this easier.

    I've been thinking about making my own ASCIIString type to eliminate the indexing overheads. Ideally, this would work like String's existing views (.utf8/.utf16/etc), but I don't believe String provides convenient APIs to work at that level (the best I can think of is using String.utf8.withUnsafeBufferPointer { ... } to access the storage without unicode getting in the way, but then my .ascii view is bound to that scope). I also don't believe it's possible to support efficient mutations with a wrapper view, whether using the buffer-pointer or some stdlib-provided view (see the recent discussion about slices).

    So yeah, better support for ASCII strings is something that I'd appreciate.

5 Likes

I've been missing things around symmetry and complementing existing APIs. For example, we have ways to check if a String begins or ends with another String (hasPrefix(_:) and hasSuffix(_:)), but we don't have an easy way to remove them if they're present, such as:

mutating func removePrefix(_ prefix: String)
func removingPrefix(_ prefix: String) -> String

mutating func removeSuffix(_ suffix: String)
func removingSuffix(_ suffix: String) -> String

Sometimes I'll need to capitalize the first letter of a string, but not capitalize every word (like capitalized(with:) does):

func capitalizingFirstLetter(with locale: Locale? = nil) -> String
12 Likes

One function I noticed as missing is a Unicode-correct casefolded implementation.
This is different from lowercased as some characters go to their uppercase forms.

Also, a stdlib case-ignoring string wrapper/view would probably be more Unicode-correct than whatever most can come up with.

1 Like

Want to +1 these prefix and suffix methods. Had a use case for them recently. Would be nice to see them as part of the standard library.

2 Likes

Hmm, it seems that most substitution (including StringProtocol.replacingOccurrences(of:with:options:range:)) is part of Foundation and not the standard library (Probably due to String.CompareOptions being a type alias of NSString.CompareOptions).

capitalized(with:) is also in Foundation (because of Locale).

Since evolution does not cover Foundation, changes like this would be separate from a normal pitch.

I have started cateloging the changes suggested (work in progress).

Edit: Sorry, meant to create a general reply not a post reply.

1 Like

Many String processing algorithms can also be applied to Sequences or Collections in general, so suggest we start there. There have been discussions in the past, including an Offset indexing pitch which would be useful, among others.

2 Likes

I think what @Karl and @davedelong meant is that they want some of NSMutableString, NSString, etc's methods reproduced in String. I don't think we need to change anything in Foundation at all for that. We can leave it be, and just add whatever functionality we need into String.

1 Like

I would also like to see more convenient “ASCII” parsing, with the caveat that this often means the structure of a file is ASCII-range characters, but some content such as strings or comments may be UTF-8.

Addendum: in the context of Swift, this isn’t necessarily a String thing; Data or pointer types may be more appropriate, if we had a solution to the zero-cost ergonomic ASCII character literal problem. It just springs to mind here because strings and text-or-text-like-data-processing are closely related.

This would perhaps be reasonable for things like literal search and replace, but case-insensitive/diacritic-insensitive search and capitalization are probably non-starters since there’s no concept of languages or locales in the standard library. For literal searching, generic Collection and Sequence solutions should probably be preferred to string-specific ones.

1 Like

There is precedent for returning the removed value (For example: removeFirst() -> Character), there is also precedent for not returning anything (removeSubrange(_:)). Personally, I think returning a discardable optional would give developers the most flexibility.

Only a complete match should be acted upon. In general Swift is very explicit, so we should be ok as long as the behavior pattern of the new API fits with hasPrefix/hasSuffix (and there is proper documentation).

1 Like

Yes, this functionality should be in the standard library. You shouldn't need Foundation for basic find/replace.

Of course, they are in Foundation right now, because Foundation is used to owning its own String type, as well as all the other basic types, since it is the "standard library" for Objective-C.

Locale-dependent searching is an interesting question. I don't know enough about unicode to say whether our String type requires locales for correctness, but if it does I would be in favour of lifting that to the standard library, too.

1 Like

After a cursory look at the implementation of replacingOccurrences(of:with:options:range:) in swift-corelibs-foundation, it appears that Unicode processing (case folding?) is used. Not sure if this is required because of the case insensitive search option or just needed in general.

Call Stack for replacingOccurrences
  1. StringProtocol.replacingOccurrences
  2. NSString.replacingOccurrences
  3. NSMutableString.replaceOccurrences
  4. CFStringCreateArrayWithFindResults
  5. CFStringFindWithOptions
  6. CFStringFindWithOptionsAndLocale
  7. __CFStringFoldCharacterClusterAtIndex

I will work on a proof of concept replace extension and see what happens.

1 Like

Not sure if this is relevant, but there's things like:

let s = "ß"
print(s.lowercased()) // ß
print(s.uppercased()) // SS
print(s.capitalized)  // Ss

And I guess a whole lot of even more complicated stuff related to other languages than german.

2 Likes

Adding to @Jens's "ß" example, here is one with Turkish alphabet:

let i = "i"
let ı = "ı"
let İ = "İ"
let I = "I"

print(i.uppercased()) // I
print(ı.uppercased()) // I
print(İ.lowercased()) // i
print(I.lowercased()) // i

In Turkish, "i" and "İ" are the same letter, and "ı" and "I" the same. Because of the lack of locale-awareness in String, both "ı" and "i" are capitalised to "I", and 'I" and "İ" to "i".

I'm not sure how much it will help String by giving it locale-awareness though, since many texts contain multiple languages.

Ligatures in fonts are also affected by this, but that probably falls in NSFontManager's domain, not really in String's.

On another mostly unrelated note, this is often a supporting case for case-sensitive file systems.

1 Like

I recently tried to implement a tokenizer in Swift, and it was very awkward to do with NSScanner (as it seems to be aimed more at ad-hoc date parsing and the like where you don't need to store parse offsets etc.), and not obviously supported in String at all. And given it is a bit complicated to perform arithmetics on String.Index correctly, it would be helpful to make such things easier and more efficient.

In particular, a way to iterate over a String and test whether a certain string or sequence from a CharacterSet is present, and to also get their Range in the String would be really helpful. Something like

extension String {
    func extractIf(prefix: String, in range: inout Range) -> (String, Range)?
    func extracLongesttIf(prefixFrom: CharacterSet, in range: inout Range) -> (String, Range)?
}

This would take the search range and update it to exclude the characters just found, so you can just call this repeatedly with a var searchRange = Range(location: 0, length: myString.count) that then is updated by each subsequent call.

1 Like

FWIW, while the general idea of letters that don't have an uppercase version or have one that consists of other characters probably exists in other languages (if you leave aside languages like Japanese that don't even have the concept of case), this example in practice is not really useful:

There are no words that start with a sharp-S, so the capitalized() version will never occur in practice. Also, a few years ago the capital letter for sharp-S was standardized by the German ministries of education of all the states, so a better solution for uppercased()would be to finally update macOS keyboard layouts to allow typing that letter and leave Swift alone in this particular case.

A function that removes on partial prefixes can be composed from the whole-prefix version along with a divergence-testing method.

extractIf could be done with a divergence-testing function too. extractLongestIf could be adapted from self.firstIndex(where: { !prefixFrom.contains($0) }). Neither requires String; they should be added to Collection in general.

One thing that I feel is missing from strings in Swift is parsing.

A primary aspect of parsing that is currently clunky, difficult to use, and lacking in some regards is regular expressions. To start off, you have to import Foundation to be able to do anything with regular expressions in the first place. While there are currently a few methods that allow you to somewhat easily work with regular expressions by using .regularExpression option (replacingOccurrences(of:with:options:) and range(of:options:)), they aren't always sufficient and you often need to work with NSRegularExpression. This is were I largely take issue. I also got some discussion going regarding some of the issues with regular expressions here by the way. Anyways, NSRegularExpression is not a very swifty API and is often a pain to use. Working with capture groups and matches is also not very easy. I feel that a native Swift implementation that provides an API that leverages Swift's features would be quite helpful. Maybe even provide first-class regex support with a unique literal syntax with highlighting :man_shrugging:. @Michael_Ilseman came up with some ideas regarding how this could work in his State of String: ABI, Performance, Ergonomics, and You! doc. And even if it is not feasible to provide custom syntax and whatnot, I feel that a native Swift solution in some capacity is needed.

On a more general note though, I feel that we should look into what can be done to make parsing strings easer. I'm not sure how we can go about this, but, when trying to create a program that tokenized and parsed strings, I found that it got very complicated very quickly. I feel like there must be a better way, though I'm not really sure.

2 Likes