String hygiene

You could even add a static var full: Trimming { return [.start, .end] } to Trimming which would allow you to use .full as [.start, .end] if you were worried about it

1 Like

I wasn't actually sure what the standard library rule is for returning a String vs a Substring, so I looked into it. Given that drop(while:), dropFirst(), dropFirst(_:), dropLast(), dropLast(_:), subscript(_:), prefix(_:), prefix(through:), prefix(upTo:), prefix(while:), suffix(_:), suffix(from:), split(separator:maxSplits:omittingEmptySubsequences:) and split(maxSplits:omittingEmptySubsequences:whereSeparator:) all return Substring (or an Array thereof), it seems obvious to me that any trim method(s) should also return Substring.

Looking at the various remove methods, it's also clear that the in-place version, trim(from:), probably shouldn't return self.

3 Likes

A few more things:

  • LINE SEPARATOR is U+2028, not U+0301 (made me really confused when I tried to use the \u{020}\u{301}abc\n\u{301} example)
  • Some whitespace characters combine with other whitespace characters. For sure \r\n is a single character in Swift (which the current code does not work on). Because of this, I think it would make more sense to operate on the unicodeScalars view and then adjust the bounds at the end so that they are valid characters. A working example of this is here
  • It would be nice if this was available at least to Substring if not all of StringProtocol. Unfortunately that would break my previous implementation so if anyone knows how to do something similar without using samePosition(in:) that would be great

Most of these return Substring because they come from Sequence/Collection. Semantically, I think trim should maybe also go onto one of these rather than String. That would also make it available on Substring as well.

CharacterSet isn't too bad but it really abides by the concept of "what happens in CharacterSet, stays in CharacterSet". I think the biggest modification I would like to see is the ability to get the characters used out of CharacterSet.

I do wonder though if it would benefit from the added functionality in 211 if and when that gets accepted.

@Erica_Sadun first thank you that the implementation is now Foundation free :slight_smile: Furthermore I support the ideas about returning a Substring instead, this feels more natural and should potentially increase the performance as well. I also support the idea of an OptionSet type instead of an enum here. To me trimmed(from: [.start, .end]) reads better then trimmed(from: .full), but since it's already defaulted to this it will be seldom. I made a few modifications to play around with those ideas.

  1. I'm using an OptionSet type in the implementation TrimmPositions.
  2. I moved the contains function into a closure for better readability and also aligned the guard/else blocks (the provided code follows the 80 char line width).
  3. Non-mutating trimmed functions returns Substring instead (even for the empty string).
public struct TrimPositions : OptionSet {
  public var rawValue: UInt8
  public init(rawValue: UInt8) {
    self.rawValue = rawValue
  }
  public static let start = TrimmPositions(rawValue: 1 << 0)
  public static let end = TrimmPositions(rawValue: 1 << 1)
}

public func trimmed(
  from positions: TrimPositions = [.start, .end]
) -> Substring {
  // Ensure that this implementation does not rely on the NSString 
  // implementation of trimmingCharacters(in: .whitespacesAndNewlines)

  guard !isEmpty else { return self[...] }
  var (trimStart, trimEnd) = (startIndex, index(before: endIndex))

  let isNeitherWhitespaceNorNewlineCharacter: (String.Index) -> Bool = {
    !String.whitespaceAndNewlineCharacters.contains(self[$0])
  }

  let shouldBeTrimmed = positions.contains([.start, .end])

  if shouldBeTrimmed || positions.contains(.start) {
    guard
      let start = indices.first(where: isNeitherWhitespaceNorNewlineCharacter)
    else { return self[endIndex ..< endIndex] /* return an empty substring from the end */ }
    trimStart = start
  }

  if shouldBeTrimmed || positions.contains(.end) {
    guard
      let end = indices.lazy
        .reversed()
        .first(where: isNeitherWhitespaceNorNewlineCharacter)
    else { return self[startIndex ..< startIndex] /* return an empty substring from the start */ }
    trimEnd = end
  }

  return self[trimStart ... trimEnd]
}

public mutating func trim(from positions: TrimPositions = [.start, .end]) {
  self = String(self.trimmed(from: positions))
}

Last but not least I removed -> String from the mutating trim function since it should not return self. I think this is very important since if this is going into stdlib it should mimic other existing APIs like reverse() -> Void or sort() -> Void etc.


What about such an option?

public static let bothSides: TrimPositions = [.start, .end]

Such an option would allow us to express something like string.trimmed(from: .bothSides) instead of string.trimmed(from: [.start, .end]).

And why are we using start/end instead of left/right?


The else bodies from guard both return an empty Substring still referencing to self for cases like the following:

let test = " "
test.trimmed().count == 0
test.trimmed(from: .start).count == 0
test.trimmed(from: .end).count == 0

Edit: I added lazy to the second guard so that reversed does not copy the same array since we don‘t need it anyway.

1 Like

I have no clue about Unicode, but for reference this is what Go considers a white space, which seems to be different from your list (ranges below are inclusive):

0x0009 - 0x000d
0x0020
0x0085
0x1680
0x2000 - 0x200a
0x2028 - 0x2029
0x202f
0x205f
0x3000

They also support trimming using a cutset string or a function:

Maybe some annoying name bikeshedding, but wouldn't be trimmed(_ sides: …) be more appropriate?

I am not a native speaker but for me it seems like trim does not use a preposition and it sounds wrong to "trim from" to me.
-> You trim the start and you trim the end.

Sorry for the annoyance :nerd_face:

I don't think an OptionSet parameter really works here because string.trimmed(from: []) is nonsensical. A simple enum is probably better with start, end, and bothSides cases; exact names TBD. The method definition is simply func trimmed(from directions: TrimDirections = .bothSides) -> String.

That aside, I support the proposal to raise trimming functionality out of Foundation and into the standard library.

2 Likes

string.trimmed(from: []) is nonsensical

It's not nonsensical. It means trim from neither end i.e. don't do any trimming. It's a no-op.

You may argue that it is useless, but I would argue that it is not useless in the same sense that an empty array is not useless. For example, if you have a piece of functionality which allows a user to trim leading, trailing, leading and trailing, or no spaces, then the code will be cleaner with the option set than with the enum.

7 Likes

That is correct, in most cases you don‘t need the empty option but it would be really handy in generic algortihms because it makes it cleaner and requires you to write less code to achieve the same thing.

The problem may just be an implementation detail that doesn't consider "\r\n" to be whitespace. Swift is nearing its 5th major version and we're stabilizing the ABI of a fixed-layout and fairly inlinable String, Character, String.Index, etc. Common sense dictates that the standard library should be able to make up its mind as to what Characters are and are not whitespace :slight_smile:

I'll spin off a separate thread to discuss this distinction (with my views and justification as well, when I've written them all down). It directly influences the Character Properties proposal and other future work (e.g. String.lines()).

I do agree that it would make sense for this API to also be available on the UnicodeScalarView, where it would operate scalar-by-scalar rather than grapheme-by-grapheme. This is very useful for compatibility with many other language's views of strings as unsegmented sequences of code units (or scalars).

5 Likes

I started with left right, considered trailing leading, and ended with start end to better mimic startIndex/endIndex.

I don't mind returning an empty substring but what is the best way to construct that?

I am against using an option set. I think supporting trim from the start or the end of the string is fine but they are never intended to be used simultaneously. The default behavior is "both" or "a full trim".

I decided to include it in the original design because it's easier to have it voted out than voted in.

For my first go at this, I directly copied the white space details from NSString's trimmingCharacters(in:).

Still up for grabs:

  • grapheme-by-grapheme or scalar-by-scalar (I want accented whitespace to be non-whitespace)
  • What types/protocols adopt this: string, substring, stringprotocol, unicodescalarview
  • A standards based decision of what constitutes whitespace, coordinated with character properties and other proposals.

I feel strongly about allowing the caller specify the Set<Character> value because it'd be a regression over NSString's trimmingCharacters(in:) API, API which provides an expectation or precedent for expected behaviour. Failing to match its functionality in this case would increase residual dependence on Foundation and limit the portability of a codebase to other platforms.

Another reason I feel strongly about it is there are times when I've wanted to strip some whitespace to the exclusion of others. More relevant to the present discussion is what counts as whitespace, changes to that definition in the future and perhaps trivially, any concerns about performance and the number of members in the character set.

Also, in my opinion, there's nothing more infuriating than an API lacking simple affordances.

I think whitespace is by far the most common thing to be trimmed from a String, and this opinion is supported by the survey quoted in the proposal, so I think there is a strong case for accepting it even without being able to customise the set of Characters. There's a much more involved decision to be made about how to represent sets of Characters generally (e.g. Set<Character> would be a ridiculous way to represent the set of all emoji), how to specify such sets using Character properties, how this relates to other sets that aren't feasible to represent by enumerating all their elements (e.g. sets of integers with particular properties, this ContainmentSet proposal), etc. Once these decisions are made, the solution can be additively adopted by the trim methods at the same time as the other String methods adopt them.

3 Likes

I started with left right, considered trailing leading, and ended with start end to better mimic startIndex/endIndex.

The main point against left and right is that it has a different meaning depending on the particular script you are using. I assume that's why, in autolayout, we have leading and trailing and not left and right. I think you are correct to follow the prior art in this case.

I am against using an option set. I think supporting trim from the start or the end of the string is fine but they are never intended to be used simultaneously. The default behavior is "both" or "a full trim".

I don't understand this comment at all. Of course start and end can be used simultaneously, that's what "both" means. In fact, you can define "both" in terms of start and end.

struct Trimming: OptionSet
{
    var rawValue: UInt8

    static let start = Trimming(rawValue: 1)
    static let end = Trimming(rawValue: 2)
    static let both: Trimming = [start, end]
    static let neither: Trimming = []
}

OptionSet is the obvious model for this type.

I don't see why OptionSet is the obvious or natural model here. What does the raw value mean and why should it be a bitset? Why would you ever initialise this from the raw value? If you think this needs to be a set then just use Set<TrimEnd> where TrimEnd is an enum with .start and .end cases. I personally don't see the advantage of it being a set at all though, because it's unlikely that you're going to use set operations on it, the empty set doesn't really make any sense, etc. The three-valued enum in the proposal seems much more natural.

+1 for using an enum.

Will whatever implementation terminology (see @avi's comment below) we choose be impacted by RTL languages?