String hygiene

A thread to discuss trim/pad/center/etc. cc @Michael_Ilseman @armcknight @Dante-Broggi @tkrajacic

Updated gist to remove errant mutating: trim.md · GitHub

5 Likes

What should be the result of the following?

" \u{301}abc\n\u{301}".trimmed()

I think this would be a view over the characters of a string, so that the unicode grapheme clusters have already been resolved. I'm guessing that since swift gave me back ́abc\ń for your example, the grapheme clustering pass happens before escape sequences like \n.

So in this case, you get back exactly ́abc\ń, and any trimmable characters you added on either end of your input should be removed.

If you implemented this using trimmingCharacters, then it's actually what NSString returned. Since we'll want this for the standard library itself, we'll need an implementation that doesn't depend on Foundation (edit: this is an implementation detail, not important to this discussion). We'll need to define the semantics in these situations.

More precisely:

Array(" \u{301}abc\n\u{301}".trimmed().unicodeScalars)
// => Array<Unicode.Scalar> = ["\u{0301}", "a", "b", "c", "\n", "\u{0301}"]

So it did drop the leading space. This means that:

let str = " \u{301}abc\n\u{301}"
str.count // => 6
str.trimmed().count // => 6
(str + str).count // => 12
(str + str.trimmed()).count // => 11

Which might not be intuitive, as trimming results in an isolated combining scalar. This is further complicated by the fact that Unicode pre-4.1 recommended the use of a space followed by a combining scalar as a technique for displaying isolated combining scalars. If someone was doing this, they certainly don't want it trimmed! But, this was a bad idea (because computers use spaces as separators ), and they reverted the suggestion in 4.1 and later.

I'm not arguing either way, I'm just curious what the proper semantics should be. There might not be a good answer, so we'd have to figure out what the least bad answer is.

3 Likes

Sorry, I'm not sure my example translated well. The first character is actually a space with an acute accent superset, so it combined your leading space and acute accent codepoint. Check out the unicode scalars for your example:

Array(" \u{301}abc\n\u{301}".unicodeScalars)
// => $R1: [Unicode.Scalar] = 7 values {
//   [0] = U' '
//   [1] = U'́'
//   [2] = U'a'
//   [3] = U'b'
//   [4] = U'c'
//   [5] = U'\n'
//   [6] = U'́'
// }

Those accents are, for my lack of vocabulary, before the empty character, haha. It's the difference between these:

  7> " \u{0301}"
$R5: String = " ́"
  8> "\u{0301}"
$R6: String = "́"

Also though, ideally we could maintain some mapping from the original literal string definition to the unicode resolved version, and use those relationships to extract the correct ranges of characters from the original literal to return as a new set of backing unicode points. I hope that made any amount of sense.

Actually when I plug those resulting strings from my second block back into swift and extract unicode scalars again, that information seems to be preserved:

  9> let b = " ́"
b: String = " ́"
 11> Array(b.unicodeScalars)
$R8: [Unicode.Scalar] = 2 values {
  [0] = U' '
  [1] = U'́'
}
 12> let c = "́"
c: String = "́"
 13> Array(c.unicodeScalars)
$R9: [Unicode.Scalar] = 1 value {
  [0] = U'́'
}

@Erica_Sadun do we really want it to be a in Foundation or do we want to generalize it without any dependencies? If it's meant for the stdlib then the implementation should not rely on Foundation.

3 Likes

It's clear to me from @Michael_Ilseman's example that the NSString implementation is incorrect for Swift anyway, regardless of the dependency issues. " \u{301}" is not whitespace when processed as a single Swift Character, and String functionality should generally operate on the Character level. Are there any exceptions in the current String API, setting aside the unicodeScalars, utf16 and utf8 views? I guess you could say technically some of the edge cases with append, etc. because adjacent Characters may merge.

Can we use CharacterSet or does that fit into "no, that needs a redesign too"?

There is nothing wrong with using CharacterSet if it gives the semantics we want. What I'm looking for is: what semantics do we want, and why?

IMHO this API (like trimmingCharacters(in:)) fits in Foundation, not in the Standard Library.

I wouldn't say that it is incorrect for Swift. I would say that it seems inconsistent with a view of String as a collection of graphemes. These semantics may be perfect for, e.g., String.UnicodeScalarView.trimmed().

Degenerate graphemes are a reality that we have to deal with, and we (unfortunately) cannot have a nice algebraic String because of them. But, it does seem odd for String to have a common-use API that produces them.

I don't think I communicated my point well. My example was meant to demonstrate some corner cases that we need to have a strategy for, even if a random package's implementation doesn't care about the details. I see (at least) 3 potential results, each of which has some upsides and downsides. The question is what the result should be.

I’ll use \u{020} to represent a space explicitly.

“\u{020}\u{301}abc\n\u{301}”.trimmed() could yield:

  1. “abc”, under the view that String APIs operate on graphemes and a grapheme leading with whitespace is considered whitespace.
  2. “\u{020}\u{301}abc\n\u{301}”, under the view that String APIs operate on graphemes and a grapheme with a combining character is not whitespace.
  3. “\u{301}abc\n\u{301}”, which results in a degenerate leading grapheme.

I feel like #3 makes sense for a trimmed() on the UnicodeScalarView or the the code unit views. For String, we need to figure out which of these 3 (or perhaps an unknown 4th option) is the least harmful behavior.

1 Like

I've updated the gist. Is this more like you want? I avoided CharacterSet but that can be easily re-introduced.

(I went with I believe #2, but can again adjust accordingly)

1 Like

I'm wondering if the trimming functions should return a String or Substring. For example, if an algorithm was trimming a lot of strings, might memory usage increase undesirably?

I've no data for a warranted concern, just throwing the question out there. I would gladly accept the response that returning a Substring would be a case of premature optimisation. Perhaps we could find some data from usage of the Core Foundation APIs or the standard libraries of other general purpose languages?

If those sources show undesirable memory usage, maybe the proposal should include a range API for locating the desired text and the trimming functions use those?

1 Like

Another thought: parameterise the Set<Character> so users can trim specific characters, such as spaces but not new lines? Obviously, it could still use Trim.whitespaceAndNewlineCharacters by default.

I wrote it both ways for the non-mutating version and could support either consensus.

A few more thoughts.

The Trim enum has option semantics. If that's correct, maybe it should inherit from OptionSet instead? In that case, the .full case/option is redundant because full is just the set with start and end options specified. Specifying [.start, .end] better corresponds to the concept of trimming, too (full feels awkward, like an attempt to say bothSides (or bothEnds) but with less characters).

Trivially, parameterising the ends of the string to trim as a OptionSet saves you the cost of instantiating both [.start, .full] and [.end, .full] during the course of the implementation. Again, of course, the default value for this endsToTrim (or whatever) parameter would obviously be [.start, .end].

(It just occurred to me: my sincerest apologies if this is not the right place to give feedback about the trimming gist, but not sure where else).

I added the directionality for @sbromberger. My first implementation of the side-trim used option sets. Then I started using the type and I hated how it looks, so I re-did everything to not use option sets, justifying the readability for the two specialized use cases: start and end. I defaulted it out with "full" so that can be elided for normal use.

Compare:

string.trimmed()
string.trimmed(from: .start)
string.trimmed(from: .end)

with

string.trimmed(from: [.start])
string.trimmed(from: [.end])

and possibly this horror[1]:

string.trimmed(from: [.start, .end])

[1] I apply the "Lattner Rule of Gross™" here. cc @Chris_Lattner3

2 Likes

OptionSet doesn't seem to require the use of the array literal notation for a single option. For example, I'd expect string.trimmed(from: .start) and string.trimmed(from: [.start]) to compile and achieve the same thing.

If the default is [.start, .end], there'd be less chance of seeing such horrors at the call site.

6 Likes

Quick question, is there any reason trimmed couldn't return a Substring? It only removes from the front and end of the string, so is there any time it would have to modify the original string?

2 Likes