A thread to discuss trim/pad/center/etc. cc @Michael_Ilseman @armcknight @Dante-Broggi @tkrajacic
Updated gist to remove errant mutating
: trim.md · GitHub
A thread to discuss trim/pad/center/etc. cc @Michael_Ilseman @armcknight @Dante-Broggi @tkrajacic
Updated gist to remove errant mutating
: trim.md · GitHub
What should be the result of the following?
" \u{301}abc\n\u{301}".trimmed()
I think this would be a view over the characters of a string, so that the unicode grapheme clusters have already been resolved. I'm guessing that since swift gave me back ́abc\ń
for your example, the grapheme clustering pass happens before escape sequences like \n
.
So in this case, you get back exactly ́abc\ń
, and any trimmable characters you added on either end of your input should be removed.
If you implemented this using trimmingCharacters
, then it's actually what NSString returned. Since we'll want this for the standard library itself, we'll need an implementation that doesn't depend on Foundation (edit: this is an implementation detail, not important to this discussion). We'll need to define the semantics in these situations.
More precisely:
Array(" \u{301}abc\n\u{301}".trimmed().unicodeScalars)
// => Array<Unicode.Scalar> = ["\u{0301}", "a", "b", "c", "\n", "\u{0301}"]
So it did drop the leading space. This means that:
let str = " \u{301}abc\n\u{301}"
str.count // => 6
str.trimmed().count // => 6
(str + str).count // => 12
(str + str.trimmed()).count // => 11
Which might not be intuitive, as trimming results in an isolated combining scalar. This is further complicated by the fact that Unicode pre-4.1 recommended the use of a space followed by a combining scalar as a technique for displaying isolated combining scalars. If someone was doing this, they certainly don't want it trimmed! But, this was a bad idea (because computers use spaces as separators ), and they reverted the suggestion in 4.1 and later.
I'm not arguing either way, I'm just curious what the proper semantics should be. There might not be a good answer, so we'd have to figure out what the least bad answer is.
Sorry, I'm not sure my example translated well. The first character is actually a space with an acute accent superset, so it combined your leading space and acute accent codepoint. Check out the unicode scalars for your example:
Array(" \u{301}abc\n\u{301}".unicodeScalars)
// => $R1: [Unicode.Scalar] = 7 values {
// [0] = U' '
// [1] = U'́'
// [2] = U'a'
// [3] = U'b'
// [4] = U'c'
// [5] = U'\n'
// [6] = U'́'
// }
Those accents are, for my lack of vocabulary, before the empty character, haha. It's the difference between these:
7> " \u{0301}"
$R5: String = " ́"
8> "\u{0301}"
$R6: String = "́"
Also though, ideally we could maintain some mapping from the original literal string definition to the unicode resolved version, and use those relationships to extract the correct ranges of characters from the original literal to return as a new set of backing unicode points. I hope that made any amount of sense.
Actually when I plug those resulting strings from my second block back into swift and extract unicode scalars again, that information seems to be preserved:
9> let b = " ́"
b: String = " ́"
11> Array(b.unicodeScalars)
$R8: [Unicode.Scalar] = 2 values {
[0] = U' '
[1] = U'́'
}
12> let c = "́"
c: String = "́"
13> Array(c.unicodeScalars)
$R9: [Unicode.Scalar] = 1 value {
[0] = U'́'
}
@Erica_Sadun do we really want it to be a in Foundation
or do we want to generalize it without any dependencies? If it's meant for the stdlib then the implementation should not rely on Foundation
.
It's clear to me from @Michael_Ilseman's example that the NSString
implementation is incorrect for Swift anyway, regardless of the dependency issues. " \u{301}"
is not whitespace when processed as a single Swift Character
, and String
functionality should generally operate on the Character
level. Are there any exceptions in the current String
API, setting aside the unicodeScalars
, utf16
and utf8
views? I guess you could say technically some of the edge cases with append
, etc. because adjacent Characters
may merge.
Can we use CharacterSet or does that fit into "no, that needs a redesign too"?
There is nothing wrong with using CharacterSet
if it gives the semantics we want. What I'm looking for is: what semantics do we want, and why?
IMHO this API (like trimmingCharacters(in:)
) fits in Foundation
, not in the Standard Library.
I wouldn't say that it is incorrect for Swift. I would say that it seems inconsistent with a view of String as a collection of graphemes. These semantics may be perfect for, e.g., String.UnicodeScalarView.trimmed()
.
Degenerate graphemes are a reality that we have to deal with, and we (unfortunately) cannot have a nice algebraic String because of them. But, it does seem odd for String to have a common-use API that produces them.
I don't think I communicated my point well. My example was meant to demonstrate some corner cases that we need to have a strategy for, even if a random package's implementation doesn't care about the details. I see (at least) 3 potential results, each of which has some upsides and downsides. The question is what the result should be.
I’ll use \u{020} to represent a space explicitly.
“\u{020}\u{301}abc\n\u{301}”.trimmed()
could yield:
I feel like #3 makes sense for a trimmed()
on the UnicodeScalarView or the the code unit views. For String
, we need to figure out which of these 3 (or perhaps an unknown 4th option) is the least harmful behavior.
I've updated the gist. Is this more like you want? I avoided CharacterSet
but that can be easily re-introduced.
(I went with I believe #2, but can again adjust accordingly)
I'm wondering if the trimming functions should return a String
or Substring
. For example, if an algorithm was trimming a lot of strings, might memory usage increase undesirably?
I've no data for a warranted concern, just throwing the question out there. I would gladly accept the response that returning a Substring
would be a case of premature optimisation. Perhaps we could find some data from usage of the Core Foundation APIs or the standard libraries of other general purpose languages?
If those sources show undesirable memory usage, maybe the proposal should include a range API for locating the desired text and the trimming functions use those?
Another thought: parameterise the Set<Character>
so users can trim specific characters, such as spaces but not new lines? Obviously, it could still use Trim.whitespaceAndNewlineCharacters
by default.
I wrote it both ways for the non-mutating version and could support either consensus.
A few more thoughts.
The Trim
enum has option semantics. If that's correct, maybe it should inherit from OptionSet
instead? In that case, the .full
case/option is redundant because full
is just the set with start
and end
options specified. Specifying [.start, .end]
better corresponds to the concept of trimming, too (full
feels awkward, like an attempt to say bothSides
(or bothEnds
) but with less characters).
Trivially, parameterising the ends of the string to trim as a OptionSet
saves you the cost of instantiating both [.start, .full]
and [.end, .full]
during the course of the implementation. Again, of course, the default value for this endsToTrim
(or whatever) parameter would obviously be [.start, .end]
.
(It just occurred to me: my sincerest apologies if this is not the right place to give feedback about the trimming gist, but not sure where else).
I added the directionality for @sbromberger. My first implementation of the side-trim used option sets. Then I started using the type and I hated how it looks, so I re-did everything to not use option sets, justifying the readability for the two specialized use cases: start and end. I defaulted it out with "full" so that can be elided for normal use.
Compare:
string.trimmed()
string.trimmed(from: .start)
string.trimmed(from: .end)
with
string.trimmed(from: [.start])
string.trimmed(from: [.end])
and possibly this horror[1]:
string.trimmed(from: [.start, .end])
[1] I apply the "Lattner Rule of Gross™" here. cc @Chris_Lattner3
OptionSet
doesn't seem to require the use of the array literal notation for a single option. For example, I'd expect string.trimmed(from: .start)
and string.trimmed(from: [.start])
to compile and achieve the same thing.
If the default is [.start, .end]
, there'd be less chance of seeing such horrors at the call site.
Quick question, is there any reason trimmed
couldn't return a Substring
? It only removes from the front and end of the string, so is there any time it would have to modify the original string?