String hygiene

Erica_Sadun · May 18, 2018, 11:06pm

A thread to discuss trim/pad/center/etc. cc @Michael_Ilseman @armcknight @Dante-Broggi @tkrajacic

Updated gist to remove errant mutating: trim.md · GitHub

Michael_Ilseman · May 18, 2018, 11:12pm

What should be the result of the following?

" \u{301}abc\n\u{301}".trimmed()

armcknight · May 18, 2018, 11:34pm

I think this would be a view over the characters of a string, so that the unicode grapheme clusters have already been resolved. I'm guessing that since swift gave me back ́abc\ń for your example, the grapheme clustering pass happens before escape sequences like \n.

So in this case, you get back exactly ́abc\ń, and any trimmable characters you added on either end of your input should be removed.

Michael_Ilseman · May 18, 2018, 11:50pm

If you implemented this using trimmingCharacters, then it's actually what NSString returned. ~~Since we'll want this for the standard library itself, we'll need an implementation that doesn't depend on Foundation~~ (edit: this is an implementation detail, not important to this discussion). We'll need to define the semantics in these situations.

More precisely:

Array(" \u{301}abc\n\u{301}".trimmed().unicodeScalars)
// => Array<Unicode.Scalar> = ["\u{0301}", "a", "b", "c", "\n", "\u{0301}"]

So it did drop the leading space. This means that:

let str = " \u{301}abc\n\u{301}"
str.count // => 6
str.trimmed().count // => 6
(str + str).count // => 12
(str + str.trimmed()).count // => 11

Which might not be intuitive, as trimming results in an isolated combining scalar. This is further complicated by the fact that Unicode pre-4.1 recommended the use of a space followed by a combining scalar as a technique for displaying isolated combining scalars. If someone was doing this, they certainly don't want it trimmed! But, this was a bad idea (because computers use spaces as separators ), and they reverted the suggestion in 4.1 and later.

I'm not arguing either way, I'm just curious what the proper semantics should be. There might not be a good answer, so we'd have to figure out what the least bad answer is.

armcknight · May 19, 2018, 12:01am

Sorry, I'm not sure my example translated well. The first character is actually a space with an acute accent superset, so it combined your leading space and acute accent codepoint. Check out the unicode scalars for your example:

Array(" \u{301}abc\n\u{301}".unicodeScalars)
// => $R1: [Unicode.Scalar] = 7 values {
//   [0] = U' '
//   [1] = U'́'
//   [2] = U'a'
//   [3] = U'b'
//   [4] = U'c'
//   [5] = U'\n'
//   [6] = U'́'
// }

Those accents are, for my lack of vocabulary, before the empty character, haha. It's the difference between these:

  7> " \u{0301}"
$R5: String = " ́"
  8> "\u{0301}"
$R6: String = "́"

Also though, ideally we could maintain some mapping from the original literal string definition to the unicode resolved version, and use those relationships to extract the correct ranges of characters from the original literal to return as a new set of backing unicode points. I hope that made any amount of sense.

armcknight · May 19, 2018, 12:14am

Actually when I plug those resulting strings from my second block back into swift and extract unicode scalars again, that information seems to be preserved:

  9> let b = " ́"
b: String = " ́"
 11> Array(b.unicodeScalars)
$R8: [Unicode.Scalar] = 2 values {
  [0] = U' '
  [1] = U'́'
}
 12> let c = "́"
c: String = "́"
 13> Array(c.unicodeScalars)
$R9: [Unicode.Scalar] = 1 value {
  [0] = U'́'
}

DevAndArtist · May 19, 2018, 6:06am

@Erica_Sadun do we really want it to be a in Foundation or do we want to generalize it without any dependencies? If it's meant for the stdlib then the implementation should not rely on Foundation.

jawbroken · May 20, 2018, 6:16am

It's clear to me from @Michael_Ilseman's example that the NSString implementation is incorrect for Swift anyway, regardless of the dependency issues. " \u{301}" is not whitespace when processed as a single Swift Character, and String functionality should generally operate on the Character level. Are there any exceptions in the current String API, setting aside the unicodeScalars, utf16 and utf8 views? I guess you could say technically some of the edge cases with append, etc. because adjacent Characters may merge.

Erica_Sadun · May 20, 2018, 8:30pm

Can we use CharacterSet or does that fit into "no, that needs a redesign too"?

Michael_Ilseman · May 20, 2018, 8:33pm

There is nothing wrong with using CharacterSet if it gives the semantics we want. What I'm looking for is: what semantics do we want, and why?

pvieito · May 20, 2018, 8:40pm

IMHO this API (like trimmingCharacters(in:)) fits in Foundation, not in the Standard Library.

Michael_Ilseman · May 20, 2018, 9:02pm

I wouldn't say that it is incorrect for Swift. I would say that it seems inconsistent with a view of String as a collection of graphemes. These semantics may be perfect for, e.g., String.UnicodeScalarView.trimmed().

Degenerate graphemes are a reality that we have to deal with, and we (unfortunately) cannot have a nice algebraic String because of them. But, it does seem odd for String to have a common-use API that produces them.

I don't think I communicated my point well. My example was meant to demonstrate some corner cases that we need to have a strategy for, even if a random package's implementation doesn't care about the details. I see (at least) 3 potential results, each of which has some upsides and downsides. The question is what the result should be.

I’ll use \u{020} to represent a space explicitly.

“\u{020}\u{301}abc\n\u{301}”.trimmed() could yield:

“abc”, under the view that String APIs operate on graphemes and a grapheme leading with whitespace is considered whitespace.
“\u{020}\u{301}abc\n\u{301}”, under the view that String APIs operate on graphemes and a grapheme with a combining character is not whitespace.
“\u{301}abc\n\u{301}”, which results in a degenerate leading grapheme.

I feel like #3 makes sense for a trimmed() on the UnicodeScalarView or the the code unit views. For String, we need to figure out which of these 3 (or perhaps an unknown 4th option) is the least harmful behavior.

Erica_Sadun · May 20, 2018, 10:24pm

I've updated the gist. Is this more like you want? I avoided CharacterSet but that can be easily re-introduced.

(I went with I believe #2, but can again adjust accordingly)

kiel · May 21, 2018, 12:48am

I'm wondering if the trimming functions should return a String or Substring. For example, if an algorithm was trimming a lot of strings, might memory usage increase undesirably?

I've no data for a warranted concern, just throwing the question out there. I would gladly accept the response that returning a Substring would be a case of premature optimisation. Perhaps we could find some data from usage of the Core Foundation APIs or the standard libraries of other general purpose languages?

If those sources show undesirable memory usage, maybe the proposal should include a range API for locating the desired text and the trimming functions use those?

kiel · May 21, 2018, 12:52am

Another thought: parameterise the Set<Character> so users can trim specific characters, such as spaces but not new lines? Obviously, it could still use Trim.whitespaceAndNewlineCharacters by default.

Erica_Sadun · May 21, 2018, 12:53am

I wrote it both ways for the non-mutating version and could support either consensus.

kiel · May 21, 2018, 1:26am

A few more thoughts.

The Trim enum has option semantics. If that's correct, maybe it should inherit from OptionSet instead? In that case, the .full case/option is redundant because full is just the set with start and end options specified. Specifying [.start, .end] better corresponds to the concept of trimming, too (full feels awkward, like an attempt to say bothSides (or bothEnds) but with less characters).

Trivially, parameterising the ends of the string to trim as a OptionSet saves you the cost of instantiating both [.start, .full] and [.end, .full] during the course of the implementation. Again, of course, the default value for this endsToTrim (or whatever) parameter would obviously be [.start, .end].

(It just occurred to me: my sincerest apologies if this is not the right place to give feedback about the trimming gist, but not sure where else).

Erica_Sadun · May 21, 2018, 1:34am

I added the directionality for @sbromberger. My first implementation of the side-trim used option sets. Then I started using the type and I hated how it looks, so I re-did everything to not use option sets, justifying the readability for the two specialized use cases: start and end. I defaulted it out with "full" so that can be elided for normal use.

Compare:

string.trimmed()
string.trimmed(from: .start)
string.trimmed(from: .end)

with

string.trimmed(from: [.start])
string.trimmed(from: [.end])

and possibly this horror^[1]:

string.trimmed(from: [.start, .end])

^[1] I apply the "Lattner Rule of Gross™" here. cc @Chris_Lattner3

kiel · May 21, 2018, 1:44am

OptionSet doesn't seem to require the use of the array literal notation for a single option. For example, I'd expect string.trimmed(from: .start) and string.trimmed(from: [.start]) to compile and achieve the same thing.

If the default is [.start, .end], there'd be less chance of seeing such horrors at the call site.

TellowKrinkle · May 21, 2018, 2:49am

Quick question, is there any reason trimmed couldn't return a Substring? It only removes from the front and end of the string, so is there any time it would have to modify the original string?