String hygiene

Given the need to use something like NSLinguisticTagger to determine if a string is (predominantly) in an RTL language, I would think this kind of analysis is best left out of String.

P.S. Or to put it another way: how do you determine the directionality of a string?

Understood, but do the meanings of "start" and "end" change based on the directionality of the language? It seems to me that "right" and "left" don't. (Even though I think "start" and "end" are more intuitive for LTR languages.)

Understood. I think you were thinking of terminology when you wrote implementation. You raise a good question.

How about ""[...], or String()[...]?

I started Corner-cases in `Character` classification of whitespace. Could you elaborate there why you don't want "\u{020}\u{301}" to be whitespace?

I think this should clearly be on StringProtocol (and thus String and Substring). If we go with the non-scalar interpretations of whitespace, then we should definitely have it on all 3 (sigh) UnicodeScalarViews in the stdlib.

I think having an overload that takes a Set<Character> would be very useful. The overload that doesn't take such a set could still be implemented more efficiently.

Should we have such trimmed overloads on BidirectionalCollection? Should we have one that takes a Set<Element> and/or one that takes (Element) -> Bool? This could be pretty simple, implemented in terms of drop(while:) from the front and an equivalent (to be pitched, I suppose) drop-while from the back.

This way we can trim e.g. Arrays in general. It would also provide a better interim story for users who are stuck using [UInt8] instead of String.

"start" and "end" already carry connotations, e.g. startIndex and endIndex. Trimming is similar to chaining str.dropFirst(...).dropLast(...), so I suppose "first" and "last" are also up for grabs. Finally, "leading" and "trailing" may also be good names. Directionality doesn't affect sequence order of the graphemes themselves. It applies more so to rendering, presentation, and other interpretation.

Aren‘t those varations allocating throw away strings? We should return a subtring referencing to the storage of self instead. In my sample above I showed how. First guard will return a substring poiting to the end of the trimmed string (which contains only chars we‘re trimming) where as the second guard is pointing to the beginning of the trimmed string. Both will return an empty substring which should still reference to self.

  • self[endIndex ..< endIndex]
  • self[startIndex ..< startIndex]

We proceed it like that because in the first guard we‘re trimming all chars from left to right and may end up on the last index, hence endIndex. The second guard is trimming from right to left (e.g. “ “.trimmed(from: .end)) and can end up at the start, hence startIndex.


The very first guard that checks for emptyness should return self[...].

As a performance/implementation detail, no, this would form a slice of the canonical empty string bit pattern, which is a form of small string. Thus it avoids a retain on the original storage, avoids persisting its lifetime and doesn't count as a use for COW. These are lower level concerns and probably shouldn't be used to justify semantics, though.

I'm not sure what the best behavior guarantees to give for empty slices in general. Perhaps @Ben_Cohen would know.

What guarantees can we make? Empty slices do not have sensical index interchange with their bases, AFAICT. I don't think we have APIs to access the original outer string (or plans to do so). Once the substring is created, it's considered a distinct value so e.g. mutations to either trigger a COW.

There are potential gotchas with either of the definitions (this is true for all slices), but we'd need to pick one:

(swift) let str = "abc"
// str : String = "abc"
(swift) let substr = str[str.startIndex..<str.startIndex]
// substr : Substring = ""
(swift) substr.indices.count
// r0 : Int = 0
(swift) str[substr.startIndex]
// r1 : Character = "a"
(swift) let substr2 = str[str.endIndex..<str.endIndex]
// substr2 : Substring = ""
(swift) substr2.indices.count
// r2 : Int = 0
(swift) str[substr2.startIndex]
// Trap!

That completely thwarts the purpose of this effort, which is to find convenience functions commonly added by third parties. No one is removing Foundation trimming here. We're introducing trimmed() and/or trim() specifically to handle whitespace uses, which is added to eliminate the full call. The API must take use case into account (which is why I have wavered on a Substring implementation) and add as few affordances as possible. I have introduced left/right trims because they can be subsumed out entirely in the default API call. (And yes, I think people have swayed me around to option sets)

1 Like

I have been looking at this primarily as a convenience function, which allows a little help for left/right -- a feature that is not in the original brief. I have also looked at those as overrides of the default behavior, not as drivers.

I'm going to update to OptionSet but I do prefer having a single possible override rather than a full feature set. This API is supposed to be as simple as possible and option sets are conceptually more complex, especially since they defy my semantics of "trim but skip the right" or "trim but skip the left".

I agree a generalized version of this would be useful, but I don't like the idea of accepting a Set for that (say I want to trim all values < 0.1 from a [Double]). (Element) -> Bool sounds nicer, but is kind of cumbersome if you do want a set.

Like mentioned before, the proposed ContainmentSet would fit well here. Set would naturally conform, and in the future it might become possible to have (Element) -> Bool conform as well. So maybe the right thing here is to first introduce the proposed API on StringProtocol and then later, if we get ContainmentSet or something equivalent, generalize it to live on BidirectionalCollection or wherever?

This particular effort was not started to create a perfect API. It is to respond to a community-sourced need that is not satisfied by the current standard library. That's why I've been wavering between producing a Substring and String. This is really the first effort to push into this community-driven space. As I review the original use-cases, they are meant to produce Strings without coercion, and where efficiency is not a motivating factor.

If we are driven by a particular use-case, how far should we go with that muse. Certainly we want something to be as useful as possible and as Swifty as possible but if we start returning Substrings, I can foresee many libraries implementing var trimmedAsString because we're not giving people the tool that does what they want and need. Producing a string isn't the most efficient approach nor is it the most general but it provides tooling that expresses the task common to an overwhelming number of use cases.

In exploring this area, pulling on a single thread is proving to unravel a much bigger problem than we set out to solve.

To recap:

  • Andrew discovered a great demand for String.trim.
  • We did not want this tooling to depend on Cocoa.
  • We wanted to add a simple way to offer start-or-end trimming.
  • We wanted to define what whitespace meant.
  • We wanted to define what a general trimming API would look like.

Maybe the What Characters are and are not whitespace? discussion should be expanded to include all pertinent character sets. Perhaps some of that work has already been covered in Character properties. And maybe that's a can that can be kicked down the road as an implementation detail that can be changed without Swift Evolution discussion.

trimmingCharacters(in:) is a part of StringProtocol. We may want to enhance that API to include from: for .start and .end. If StringProtocol offers trimmingCharacters(in:, from: = default), this proposal becomes way simpler because it just has to call it with whatever the character set for newlines and whitespaces becomes. At the same time, trimmingCharacters(in:) returns String. Maybe it shouldn't. Also, strings aren't the only trimmable thing: so are arrays or really any bidirectionalcollection.

Anyway, I don't want to lose sight of our brief: introduce a convenient way to trim a string.

3 Likes

I'm not sure why failing to match NSString's functionality "completely thwarts the purpose of this effort" when the whole reason the creators of these additions are adding these functions is because String didn't provide the API they're used to using from NSString (and perhaps other string APIs) in the first place. I also do not see how increasing the portability of a Swift codebase "completely thwarts the purpose of this effort", either.

The draft proposal certainly covers the majority of the surface area. While what follows is certainly uncommon, the data shows a small handful of additions where people do in fact want to trim new lines or only the characters in a given string.

I think it's reasonable to cover the majority of use cases to the exclusion of the small handful I sympathise with. I've established how I feel, put my arguments forward, I'm confident they've been considered and I thank you for this. I won't beat this any further so the thread can move onto more meaningful discussion.

NSString-style trimming is already in Foundation. It is not part of this proposal.

It doesn't look like anyone replied to the post I made about this, or the questions that I asked, so I still don't understand this at all. Do any Swift standard library methods take an OptionSet? It makes very little sense to me except as an interoperability type for certain C-like enums (e.g. NS_OPTIONS) and perhaps in narrow cases for memory efficiency reasons. If you really want a set (i.e. you think people want to do set operations on them and/or you think being able to pass the empty set to turn the function into a no-op makes sense) then you can just use Set directly. The three-valued enum seemed perfect to me, though.

As for Substring vs String, I still think it's important to match existing String expectations here and return a Substring. The discussion about reconsidering the ergonomics of Substring should happen at a much higher level than a single piece of API. The SE-0163 decision left the question around whether there should be implicit conversion from Substring to String open, pending feedback after the changes were made. There should be enough experience with this now to open a new discussion about that.

Thanks for all your work on this, and sorry for all the back-and-forth changes. I think a proposal that took a three-value enum to determine the ends to trim from and returned a Substring would be great and provably useful. All the more complex overloads, with sets of elements to remove and extensions to other Collections, can wait until the many related design questions are answered.

1 Like

There are already 50+ posts here, plus 20 more at the whitespace topic, plus more if this gets to proposal and review phases. If we are going to spend hundreds of collective hours on this, we might as well introduce a proper trimming API like every language out there, instead of micro-patching and ending up with a franken-API that is part standard library, part Foundation and part DIY.

I think the minimum is to support these functions:

func trim(_ cutset: Set<Character> = .whitespace) -> String
func trimLeft(_ cutset: Set<Character> = .whitespace) -> String
func trimRight(_ cutset: Set<Character> = .whitespace) -> String
func trim(where: (Character) -> Bool) -> String
func trimLeft(where: (Character) -> Bool) -> String
func trimRight(where: (Character) -> Bool) -> String

The evolution committee can decide on the names, if they have mutable versions, if they return String or Substring, if they can merge into one mega function with nested enums and what have you, but if you are not proposing a full API I think it is a waste of everyone's time and energy.

2 Likes

I don't see why OptionSet is the obvious or natural model here

Because you have two fundamental operations: trim the start or trim the end and they can be combined into a third operation: trim both.

What does the raw value mean

It doesn't matter.

If you think this needs to be a set then just use Set<TrimEnd> where TrimEnd is an enum with .start and .end cases

Or just use an OptionSet which provides the same functionality in a smaller footprint.

the empty set doesn't really make any sense

Yes it does. The empty set is the same as doing no trim operation.

Imagine a user interface where you want to give the user the ability to trim a string from either or both ends. You can give them a checkbox to trim from the left and a checkbox to trim from the right and you can set the actions to insert or remove the relevant option in an OptionSet. Then to actually trim the string

string.trim(from: selectedOptions)

If you use an enum, on checking the checkbox, your actions have to test the existing value of the option enum to determine whether to change it to the value for the trimming end just selected or both ends. On unchecking, if the value is "both" you have to change it to the other end and you also need a way to represent no trimming, which either means a case in your enum for "none" or making it an optional. In either case, you have to wrap the call to trim with a test to find out if you need to do the trim at all

if let trimOption = trimOption
{
     string.trim(from: trimOption)
}

I guarantee that use of OptionSet will make the code using the API generally cleaner.

1 Like

It doesn't look like anyone replied to the post I made about this, or the questions that I asked

I just did.

Another advantage of OptionSet over an enum and an ordinary Set is that the individual items are of the same type (they are sets containing one element) as the container so you don't need two APIs, one to allow a single option and one to allow multiple options.

Sure, but there's no explosion of options here, there's only three that make sense and will ever make sense.

Okay, well, what does this mean, using the definition from the current proposal:

" test ".trimmed(from: .init(rawValue: 127))

This isn't convincing to me. Should every API in the standard library have a actuallyDoNothing: Bool parameter in case someone wants to connect it to a checkbox in a UI? I don't see the precedent in the standard library here.

2 Likes

It means trim both because both bits are set. Result is "test"

1 Like

I agree with this sentiment. There are very clearly only three states this trimming method can accept; start, end, and both. We aren't expecting to add more cases, let alone new cases that can be combined with others; an OptionSet is overkill. The standard library should not aim to support no-op arguments.

I think this argument was more to showcase how one can abuse an OptionSet type to write less intuitive code, but I think the argument itself is counterintuitive since that's just a necessary entry point for an OptionSet type due to semantical requirements from the protocols.

May I quote the stdlib docs:

Option sets all conform to RawRepresentable by inheritance using the
OptionSet protocol. Whether using an option set or creating your own,
you use the raw value of an option set instance to store the instance's
bitfield. The raw value must therefore be of a type that conforms to the
FixedWidthInteger protocol, such as UInt8 or Int. For example, the
Direction type defines an option set for the four directions you can
move in a game.

struct Directions: OptionSet {
    let rawValue: UInt8

    static let up    = Directions(rawValue: 1 << 0)
    static let down  = Directions(rawValue: 1 << 1)
    static let left  = Directions(rawValue: 1 << 2)
    static let right = Directions(rawValue: 1 << 3)
}

Unlike enumerations, option sets provide a nonfailable init(rawValue:)
initializer to convert from a raw value, because option sets don't have an
enumerated list of all possible cases. Option set values have
a one-to-one correspondence with their associated raw values.

In the case of the Directions option set, an instance can contain zero,
one, or more of the four defined directions. This example declares a
constant with three currently allowed moves. The raw value of the
allowedMoves instance is the result of the bitwise OR of its three
members' raw values:

let allowedMoves: Directions = [.up, .down, .left]
print(allowedMoves.rawValue)
// Prints "7"

Option sets use bitwise operations on their associated raw values to
implement their mathematical set operations. For example, the contains()
method on allowedMoves performs a bitwise AND operation to check whether
the option set contains an element.

In our case here we're doing exactly the same, we're creating a type that describes directions for trimming, leading/left/start and trailing/right/end while we will also have the zero version and potentially the default for bothSides (a combination of all the directions).

Just because .init(rawValue: 127) is possible it does not mean every OptionSet conforming type should not exist.

1 Like