String hygiene

@Erica_Sadun do we really want it to be a in Foundation or do we want to generalize it without any dependencies? If it's meant for the stdlib then the implementation should not rely on Foundation.

3 Likes

It's clear to me from @Michael_Ilseman's example that the NSString implementation is incorrect for Swift anyway, regardless of the dependency issues. " \u{301}" is not whitespace when processed as a single Swift Character, and String functionality should generally operate on the Character level. Are there any exceptions in the current String API, setting aside the unicodeScalars, utf16 and utf8 views? I guess you could say technically some of the edge cases with append, etc. because adjacent Characters may merge.

Can we use CharacterSet or does that fit into "no, that needs a redesign too"?

There is nothing wrong with using CharacterSet if it gives the semantics we want. What I'm looking for is: what semantics do we want, and why?

IMHO this API (like trimmingCharacters(in:)) fits in Foundation, not in the Standard Library.

I wouldn't say that it is incorrect for Swift. I would say that it seems inconsistent with a view of String as a collection of graphemes. These semantics may be perfect for, e.g., String.UnicodeScalarView.trimmed().

Degenerate graphemes are a reality that we have to deal with, and we (unfortunately) cannot have a nice algebraic String because of them. But, it does seem odd for String to have a common-use API that produces them.

I don't think I communicated my point well. My example was meant to demonstrate some corner cases that we need to have a strategy for, even if a random package's implementation doesn't care about the details. I see (at least) 3 potential results, each of which has some upsides and downsides. The question is what the result should be.

I’ll use \u{020} to represent a space explicitly.

“\u{020}\u{301}abc\n\u{301}”.trimmed() could yield:

  1. “abc”, under the view that String APIs operate on graphemes and a grapheme leading with whitespace is considered whitespace.
  2. “\u{020}\u{301}abc\n\u{301}”, under the view that String APIs operate on graphemes and a grapheme with a combining character is not whitespace.
  3. “\u{301}abc\n\u{301}”, which results in a degenerate leading grapheme.

I feel like #3 makes sense for a trimmed() on the UnicodeScalarView or the the code unit views. For String, we need to figure out which of these 3 (or perhaps an unknown 4th option) is the least harmful behavior.

1 Like

I've updated the gist. Is this more like you want? I avoided CharacterSet but that can be easily re-introduced.

(I went with I believe #2, but can again adjust accordingly)

1 Like

I'm wondering if the trimming functions should return a String or Substring. For example, if an algorithm was trimming a lot of strings, might memory usage increase undesirably?

I've no data for a warranted concern, just throwing the question out there. I would gladly accept the response that returning a Substring would be a case of premature optimisation. Perhaps we could find some data from usage of the Core Foundation APIs or the standard libraries of other general purpose languages?

If those sources show undesirable memory usage, maybe the proposal should include a range API for locating the desired text and the trimming functions use those?

1 Like

Another thought: parameterise the Set<Character> so users can trim specific characters, such as spaces but not new lines? Obviously, it could still use Trim.whitespaceAndNewlineCharacters by default.

I wrote it both ways for the non-mutating version and could support either consensus.

A few more thoughts.

The Trim enum has option semantics. If that's correct, maybe it should inherit from OptionSet instead? In that case, the .full case/option is redundant because full is just the set with start and end options specified. Specifying [.start, .end] better corresponds to the concept of trimming, too (full feels awkward, like an attempt to say bothSides (or bothEnds) but with less characters).

Trivially, parameterising the ends of the string to trim as a OptionSet saves you the cost of instantiating both [.start, .full] and [.end, .full] during the course of the implementation. Again, of course, the default value for this endsToTrim (or whatever) parameter would obviously be [.start, .end].

(It just occurred to me: my sincerest apologies if this is not the right place to give feedback about the trimming gist, but not sure where else).

I added the directionality for @sbromberger. My first implementation of the side-trim used option sets. Then I started using the type and I hated how it looks, so I re-did everything to not use option sets, justifying the readability for the two specialized use cases: start and end. I defaulted it out with "full" so that can be elided for normal use.

Compare:

string.trimmed()
string.trimmed(from: .start)
string.trimmed(from: .end)

with

string.trimmed(from: [.start])
string.trimmed(from: [.end])

and possibly this horror[1]:

string.trimmed(from: [.start, .end])

[1] I apply the "Lattner Rule of Gross™" here. cc @Chris_Lattner3

2 Likes

OptionSet doesn't seem to require the use of the array literal notation for a single option. For example, I'd expect string.trimmed(from: .start) and string.trimmed(from: [.start]) to compile and achieve the same thing.

If the default is [.start, .end], there'd be less chance of seeing such horrors at the call site.

6 Likes

Quick question, is there any reason trimmed couldn't return a Substring? It only removes from the front and end of the string, so is there any time it would have to modify the original string?

2 Likes

You could even add a static var full: Trimming { return [.start, .end] } to Trimming which would allow you to use .full as [.start, .end] if you were worried about it

1 Like

I wasn't actually sure what the standard library rule is for returning a String vs a Substring, so I looked into it. Given that drop(while:), dropFirst(), dropFirst(_:), dropLast(), dropLast(_:), subscript(_:), prefix(_:), prefix(through:), prefix(upTo:), prefix(while:), suffix(_:), suffix(from:), split(separator:maxSplits:omittingEmptySubsequences:) and split(maxSplits:omittingEmptySubsequences:whereSeparator:) all return Substring (or an Array thereof), it seems obvious to me that any trim method(s) should also return Substring.

Looking at the various remove methods, it's also clear that the in-place version, trim(from:), probably shouldn't return self.

3 Likes

A few more things:

  • LINE SEPARATOR is U+2028, not U+0301 (made me really confused when I tried to use the \u{020}\u{301}abc\n\u{301} example)
  • Some whitespace characters combine with other whitespace characters. For sure \r\n is a single character in Swift (which the current code does not work on). Because of this, I think it would make more sense to operate on the unicodeScalars view and then adjust the bounds at the end so that they are valid characters. A working example of this is here
  • It would be nice if this was available at least to Substring if not all of StringProtocol. Unfortunately that would break my previous implementation so if anyone knows how to do something similar without using samePosition(in:) that would be great

Most of these return Substring because they come from Sequence/Collection. Semantically, I think trim should maybe also go onto one of these rather than String. That would also make it available on Substring as well.

CharacterSet isn't too bad but it really abides by the concept of "what happens in CharacterSet, stays in CharacterSet". I think the biggest modification I would like to see is the ability to get the characters used out of CharacterSet.

I do wonder though if it would benefit from the added functionality in 211 if and when that gets accepted.

@Erica_Sadun first thank you that the implementation is now Foundation free :slight_smile: Furthermore I support the ideas about returning a Substring instead, this feels more natural and should potentially increase the performance as well. I also support the idea of an OptionSet type instead of an enum here. To me trimmed(from: [.start, .end]) reads better then trimmed(from: .full), but since it's already defaulted to this it will be seldom. I made a few modifications to play around with those ideas.

  1. I'm using an OptionSet type in the implementation TrimmPositions.
  2. I moved the contains function into a closure for better readability and also aligned the guard/else blocks (the provided code follows the 80 char line width).
  3. Non-mutating trimmed functions returns Substring instead (even for the empty string).
public struct TrimPositions : OptionSet {
  public var rawValue: UInt8
  public init(rawValue: UInt8) {
    self.rawValue = rawValue
  }
  public static let start = TrimmPositions(rawValue: 1 << 0)
  public static let end = TrimmPositions(rawValue: 1 << 1)
}

public func trimmed(
  from positions: TrimPositions = [.start, .end]
) -> Substring {
  // Ensure that this implementation does not rely on the NSString 
  // implementation of trimmingCharacters(in: .whitespacesAndNewlines)

  guard !isEmpty else { return self[...] }
  var (trimStart, trimEnd) = (startIndex, index(before: endIndex))

  let isNeitherWhitespaceNorNewlineCharacter: (String.Index) -> Bool = {
    !String.whitespaceAndNewlineCharacters.contains(self[$0])
  }

  let shouldBeTrimmed = positions.contains([.start, .end])

  if shouldBeTrimmed || positions.contains(.start) {
    guard
      let start = indices.first(where: isNeitherWhitespaceNorNewlineCharacter)
    else { return self[endIndex ..< endIndex] /* return an empty substring from the end */ }
    trimStart = start
  }

  if shouldBeTrimmed || positions.contains(.end) {
    guard
      let end = indices.lazy
        .reversed()
        .first(where: isNeitherWhitespaceNorNewlineCharacter)
    else { return self[startIndex ..< startIndex] /* return an empty substring from the start */ }
    trimEnd = end
  }

  return self[trimStart ... trimEnd]
}

public mutating func trim(from positions: TrimPositions = [.start, .end]) {
  self = String(self.trimmed(from: positions))
}

Last but not least I removed -> String from the mutating trim function since it should not return self. I think this is very important since if this is going into stdlib it should mimic other existing APIs like reverse() -> Void or sort() -> Void etc.


What about such an option?

public static let bothSides: TrimPositions = [.start, .end]

Such an option would allow us to express something like string.trimmed(from: .bothSides) instead of string.trimmed(from: [.start, .end]).

And why are we using start/end instead of left/right?


The else bodies from guard both return an empty Substring still referencing to self for cases like the following:

let test = " "
test.trimmed().count == 0
test.trimmed(from: .start).count == 0
test.trimmed(from: .end).count == 0

Edit: I added lazy to the second guard so that reversed does not copy the same array since we don‘t need it anyway.

1 Like
Terms of Service

Privacy Policy

Cookie Policy