You could even add a static var full: Trimming { return [.start, .end] }
to Trimming
which would allow you to use .full
as [.start, .end]
if you were worried about it
I wasn't actually sure what the standard library rule is for returning a String
vs a Substring
, so I looked into it. Given that drop(while:)
, dropFirst()
, dropFirst(_:)
, dropLast()
, dropLast(_:)
, subscript(_:)
, prefix(_:)
, prefix(through:)
, prefix(upTo:)
, prefix(while:)
, suffix(_:)
, suffix(from:)
, split(separator:maxSplits:omittingEmptySubsequences:)
and split(maxSplits:omittingEmptySubsequences:whereSeparator:)
all return Substring
(or an Array
thereof), it seems obvious to me that any trim
method(s) should also return Substring
.
Looking at the various remove
methods, it's also clear that the in-place version, trim(from:)
, probably shouldn't return self
.
A few more things:
-
LINE SEPARATOR
is U+2028, not U+0301 (made me really confused when I tried to use the\u{020}\u{301}abc\n\u{301}
example) - Some whitespace characters combine with other whitespace characters. For sure
\r\n
is a single character in Swift (which the current code does not work on). Because of this, I think it would make more sense to operate on theunicodeScalars
view and then adjust the bounds at the end so that they are valid characters. A working example of this is here - It would be nice if this was available at least to
Substring
if not all ofStringProtocol
. Unfortunately that would break my previous implementation so if anyone knows how to do something similar without usingsamePosition(in:)
that would be great
Most of these return Substring
because they come from Sequence
/Collection
. Semantically, I think trim
should maybe also go onto one of these rather than String
. That would also make it available on Substring
as well.
CharacterSet
isn't too bad but it really abides by the concept of "what happens in CharacterSet
, stays in CharacterSet
". I think the biggest modification I would like to see is the ability to get the characters used out of CharacterSet
.
I do wonder though if it would benefit from the added functionality in 211 if and when that gets accepted.
@Erica_Sadun first thank you that the implementation is now Foundation
free Furthermore I support the ideas about returning a
Substring
instead, this feels more natural and should potentially increase the performance as well. I also support the idea of an OptionSet
type instead of an enum here. To me trimmed(from: [.start, .end])
reads better then trimmed(from: .full)
, but since it's already defaulted to this it will be seldom. I made a few modifications to play around with those ideas.
- I'm using an
OptionSet
type in the implementationTrimmPositions
. - I moved the contains function into a closure for better readability and also aligned the
guard/else
blocks (the provided code follows the 80 char line width). - Non-mutating
trimmed
functions returnsSubstring
instead (even for the empty string).
public struct TrimPositions : OptionSet {
public var rawValue: UInt8
public init(rawValue: UInt8) {
self.rawValue = rawValue
}
public static let start = TrimmPositions(rawValue: 1 << 0)
public static let end = TrimmPositions(rawValue: 1 << 1)
}
public func trimmed(
from positions: TrimPositions = [.start, .end]
) -> Substring {
// Ensure that this implementation does not rely on the NSString
// implementation of trimmingCharacters(in: .whitespacesAndNewlines)
guard !isEmpty else { return self[...] }
var (trimStart, trimEnd) = (startIndex, index(before: endIndex))
let isNeitherWhitespaceNorNewlineCharacter: (String.Index) -> Bool = {
!String.whitespaceAndNewlineCharacters.contains(self[$0])
}
let shouldBeTrimmed = positions.contains([.start, .end])
if shouldBeTrimmed || positions.contains(.start) {
guard
let start = indices.first(where: isNeitherWhitespaceNorNewlineCharacter)
else { return self[endIndex ..< endIndex] /* return an empty substring from the end */ }
trimStart = start
}
if shouldBeTrimmed || positions.contains(.end) {
guard
let end = indices.lazy
.reversed()
.first(where: isNeitherWhitespaceNorNewlineCharacter)
else { return self[startIndex ..< startIndex] /* return an empty substring from the start */ }
trimEnd = end
}
return self[trimStart ... trimEnd]
}
public mutating func trim(from positions: TrimPositions = [.start, .end]) {
self = String(self.trimmed(from: positions))
}
Last but not least I removed -> String
from the mutating trim
function since it should not return self
. I think this is very important since if this is going into stdlib it should mimic other existing APIs like reverse() -> Void
or sort() -> Void
etc.
What about such an option?
public static let bothSides: TrimPositions = [.start, .end]
Such an option would allow us to express something like string.trimmed(from: .bothSides)
instead of string.trimmed(from: [.start, .end])
.
And why are we using start/end
instead of left/right
?
The else bodies from guard
both return an empty Substring
still referencing to self for cases like the following:
let test = " "
test.trimmed().count == 0
test.trimmed(from: .start).count == 0
test.trimmed(from: .end).count == 0
Edit: I added lazy
to the second guard so that reversed
does not copy the same array since we don‘t need it anyway.
I have no clue about Unicode, but for reference this is what Go considers a white space, which seems to be different from your list (ranges below are inclusive):
0x0009 - 0x000d
0x0020
0x0085
0x1680
0x2000 - 0x200a
0x2028 - 0x2029
0x202f
0x205f
0x3000
They also support trimming using a cutset string or a function:
Maybe some annoying name bikeshedding, but wouldn't be trimmed(_ sides: …)
be more appropriate?
I am not a native speaker but for me it seems like trim
does not use a preposition and it sounds wrong to "trim from" to me.
-> You trim the start and you trim the end.
Sorry for the annoyance
I don't think an OptionSet
parameter really works here because string.trimmed(from: [])
is nonsensical. A simple enum is probably better with start
, end
, and bothSides
cases; exact names TBD. The method definition is simply func trimmed(from directions: TrimDirections = .bothSides) -> String
.
That aside, I support the proposal to raise trimming functionality out of Foundation and into the standard library.
string.trimmed(from: [])
is nonsensical
It's not nonsensical. It means trim from neither end i.e. don't do any trimming. It's a no-op.
You may argue that it is useless, but I would argue that it is not useless in the same sense that an empty array is not useless. For example, if you have a piece of functionality which allows a user to trim leading, trailing, leading and trailing, or no spaces, then the code will be cleaner with the option set than with the enum.
That is correct, in most cases you don‘t need the empty option but it would be really handy in generic algortihms because it makes it cleaner and requires you to write less code to achieve the same thing.
The problem may just be an implementation detail that doesn't consider "\r\n" to be whitespace. Swift is nearing its 5th major version and we're stabilizing the ABI of a fixed-layout and fairly inlinable String, Character, String.Index, etc. Common sense dictates that the standard library should be able to make up its mind as to what Character
s are and are not whitespace
I'll spin off a separate thread to discuss this distinction (with my views and justification as well, when I've written them all down). It directly influences the Character Properties proposal and other future work (e.g. String.lines()
).
I do agree that it would make sense for this API to also be available on the UnicodeScalarView
, where it would operate scalar-by-scalar rather than grapheme-by-grapheme. This is very useful for compatibility with many other language's views of strings as unsegmented sequences of code units (or scalars).
I started with left right, considered trailing leading, and ended with start end to better mimic startIndex/endIndex.
I don't mind returning an empty substring but what is the best way to construct that?
I am against using an option set. I think supporting trim from the start or the end of the string is fine but they are never intended to be used simultaneously. The default behavior is "both" or "a full trim".
I decided to include it in the original design because it's easier to have it voted out than voted in.
For my first go at this, I directly copied the white space details from NSString
's trimmingCharacters(in:)
.
Still up for grabs:
- grapheme-by-grapheme or scalar-by-scalar (I want accented whitespace to be non-whitespace)
- What types/protocols adopt this: string, substring, stringprotocol, unicodescalarview
- A standards based decision of what constitutes whitespace, coordinated with character properties and other proposals.
I feel strongly about allowing the caller specify the Set<Character>
value because it'd be a regression over NSString
's trimmingCharacters(in:)
API, API which provides an expectation or precedent for expected behaviour. Failing to match its functionality in this case would increase residual dependence on Foundation and limit the portability of a codebase to other platforms.
Another reason I feel strongly about it is there are times when I've wanted to strip some whitespace to the exclusion of others. More relevant to the present discussion is what counts as whitespace, changes to that definition in the future and perhaps trivially, any concerns about performance and the number of members in the character set.
Also, in my opinion, there's nothing more infuriating than an API lacking simple affordances.
I think whitespace is by far the most common thing to be trimmed from a String
, and this opinion is supported by the survey quoted in the proposal, so I think there is a strong case for accepting it even without being able to customise the set of Character
s. There's a much more involved decision to be made about how to represent sets of Character
s generally (e.g. Set<Character>
would be a ridiculous way to represent the set of all emoji), how to specify such sets using Character
properties, how this relates to other sets that aren't feasible to represent by enumerating all their elements (e.g. sets of integers with particular properties, this ContainmentSet
proposal), etc. Once these decisions are made, the solution can be additively adopted by the trim
methods at the same time as the other String
methods adopt them.
I started with left right, considered trailing leading, and ended with start end to better mimic startIndex/endIndex.
The main point against left and right is that it has a different meaning depending on the particular script you are using. I assume that's why, in autolayout, we have leading and trailing and not left and right. I think you are correct to follow the prior art in this case.
I am against using an option set. I think supporting trim from the start or the end of the string is fine but they are never intended to be used simultaneously. The default behavior is "both" or "a full trim".
I don't understand this comment at all. Of course start and end can be used simultaneously, that's what "both" means. In fact, you can define "both" in terms of start and end.
struct Trimming: OptionSet
{
var rawValue: UInt8
static let start = Trimming(rawValue: 1)
static let end = Trimming(rawValue: 2)
static let both: Trimming = [start, end]
static let neither: Trimming = []
}
OptionSet is the obvious model for this type.
I don't see why OptionSet
is the obvious or natural model here. What does the raw value mean and why should it be a bitset? Why would you ever initialise this from the raw value? If you think this needs to be a set then just use Set<TrimEnd>
where TrimEnd
is an enum with .start
and .end
cases. I personally don't see the advantage of it being a set at all though, because it's unlikely that you're going to use set operations on it, the empty set doesn't really make any sense, etc. The three-valued enum in the proposal seems much more natural.
+1 for using an enum
.
Will whatever implementation terminology (see @avi's comment below) we choose be impacted by RTL languages?