Strings in Swift 4

Jon_Hull · January 20, 2017, 11:19pm

Perhaps we could separate the proposal from the manifesto, and then use the manifesto to collect future plans as well in all of the areas around string processing (which could later be broken off into smaller proposals), similar to how the Generics manifesto has been working?

My main concern with the Swift 4 process is that there are so many good ideas being thrown about, which are then deferred, but not really captured in a central/organized place… so we keep having the same discussion over and over. Ultimately, I am convinced we need a separate facility/process for brainstorming about future proposals, but in the mean-time having a few manifestos capturing future ideas that we then use as guide posts that we are headed towards is a good step.

Thanks,
Jon

···

On Jan 20, 2017, at 2:20 PM, Dave Abrahams via swift-evolution <swift-evolution@swift.org> wrote:

on Fri Jan 20 2017, Gwendal Roué <swift-evolution@swift.org <mailto:swift-evolution@swift.org>> wrote:

One ask - make string interpolation great again?

I have a dream, that ExpressibleByStringInterpolation would allow to distinguish literal segments
and embedded inputs.

Today, the documentation of this protocol [1] says:

 "One cookie: $\(price), \(number) cookies: $\(price * number)."
 // <=>
 let message = String(stringInterpolation:
 String(stringInterpolationSegment: "One cookie: $"),
 String(stringInterpolationSegment: price),
 String(stringInterpolationSegment: ", "),
 String(stringInterpolationSegment: number),
 String(stringInterpolationSegment: " cookies: $"),
 String(stringInterpolationSegment: price * number),
 String(stringInterpolationSegment: "."))

This means that ExpressibleByStringInterpolation can't distinguish "foo" from `bar` in "foo\(bar)".

If this distinction were possible, some nice features could emerge, such as context-sensitive
escaping:

 // func render(_ html: HTML)
 let title = "<script>boom();</script>"
 render("<h1>\(title)</h1>") // escapes input

 // func query(_ sql: SQL)
 let name = "Robert'); DROP TABLE students; --"
 query("SELECT * FROM students WHERE name = \(name)") // avoids SQL injection

Ideally, a solution for multi-line literals (for strings and interpolated strings) would be found,
too.

I wish the manifesto would address these topics as well :-)

This is totally something we want to fix, but as part of a wholesale
reform of the ExpressibleByXXX protocols. It's outside the scope of the
manifesto.

--
-Dave

_______________________________________________
swift-evolution mailing list
swift-evolution@swift.org <mailto:swift-evolution@swift.org>
https://lists.swift.org/mailman/listinfo/swift-evolution

allevato · January 20, 2017, 4:35pm

I'm excited to see this taking shape. Thanks for all the hard work putting
this together!

A few random thoughts I had while reading it:

* You talk about an integer `codeUnitOffset` property for indexes. Since
the current String implementation can switch between backing storage of
ASCII or UTF-16 depending on the content of the string and how it's
obtained, presumably this means that integer is not necessarily the same as
the offset into the buffer, correct? (In other words, for a UTF-16-stored
string, you would have to multiply it by 2.)

* You discuss the possibility of exposing some String methods, like
`uppercase()`, on Character. Since Swift abstracts away the encoding, it
seems like Characters are essentially Strings that are enforced at runtime
(and sometimes at compile time, in the case of initialization from
literals) to contain exactly 1 grapheme cluster. Given that, I think it
would be worthwhile for Character to support *any* method on String that
would be sensical to operate on a single character—case transformations
(though perhaps not titlecase?), accessing its UTF-8 or UTF-16 views, and
so forth. I would ask whether it makes sense to have a shared protocol
between Character and String that defines those methods, but I'll defer on
that because it feels like it would be a "bag of methods" rather than
semantically meaningful.

On that same point, if I have a lightweight (<= 63 bit) Character, many of
those operations can only currently be performed by constructing a String
from it, which incurs a time and heap allocation penalty. (And indeed,
there are TODOs in the code base to avoid doing such things internally, in
the case of Character comparisons.) Which leads me to my next thought,
since I've been doing a lot with Swift String performance lately...

* Currently, Character and String have divergent internal implementations.
A Character can be "small" (<= 63 bits in UTF-8 packed into an integer) or
"large" (> 63 bits with a heap-allocated buffer). Strings are just backed
by a heap-allocated buffer. In this write-up, you say "Many strings are
short enough to store in 64 bits"—not just characters. If that's the case,
can those optimizations be lowered into _StringCore (or its new-world
counterpart), which would allow both Characters *and* small Strings to reap
the benefits of the more efficient implementation? This would let
Characters get implementations of common methods like `uppercase()` for
free, and there would be a zero-cost conversion from Characters to Strings.
The only real difference between the types would be the APIs they vend, the
semantic concept that they represent to users, and validation.

* The talk about implicit conversions between Substring and String bums me
out, even though I see the importance of it in this context and know that
it outweighs the alternatives. Given that the Swift team seems to prefer
explicit to implicit conversions in general, I would hope that if they feel
it's important enough to make a special case for the standard library, it
could be a language feature that you'd consider making available to anyone.

···

On Fri, Jan 20, 2017 at 7:35 AM Ben Cohen via swift-evolution < swift-evolution@swift.org> wrote:

On Jan 19, 2017, at 10:42 PM, Jose Cheyo Jimenez <cheyo@masters3d.com> > wrote:

I just have one concern about the slice of a string being called
Substring. Why not StringSlice? The word substring can mean so many things,
specially in cocoa.

This idea has a lot of merit, as does the option of not giving them a
top-level name at all e.g. they could be String.Slice or
String.SubSequence. It would underscore that they really aren’t meant to be
used except as the result of a slicing operation or to efficiently pass a
slice. OTOH, Substring is a term of art so can help with clarity.

_______________________________________________
swift-evolution mailing list
swift-evolution@swift.org
https://lists.swift.org/mailman/listinfo/swift-evolution

dabrahams · January 21, 2017, 12:56am

Perhaps we could separate the proposal from the manifesto, and then
use the manifesto to collect future plans as well in all of the areas
around string processing (which could later be broken off into smaller
proposals), similar to how the Generics manifesto has been working?

Yes, that's the idea. I guess my point is that while we want to flesh
these areas out, and they should go in the manifesto, I think now is the
wrong time to dive into the details. We should leave the necessary
notes and placeholders in the manifesto so that we'll be able to come
back later and fill them in.

···

on Fri Jan 20 2017, Jonathan Hull <swift-evolution@swift.org> wrote:

My main concern with the Swift 4 process is that there are so many
good ideas being thrown about, which are then deferred, but not really
captured in a central/organized place… so we keep having the same
discussion over and over. Ultimately, I am convinced we need a
separate facility/process for brainstorming about future proposals,
but in the mean-time having a few manifestos capturing future ideas
that we then use as guide posts that we are headed towards is a good
step.

Thanks,
Jon

On Jan 20, 2017, at 2:20 PM, Dave Abrahams via swift-evolution > <swift-evolution@swift.org> wrote:

on Fri Jan 20 2017, Gwendal Roué >> <swift-evolution@swift.org >> <mailto:swift-evolution@swift.org>> >> wrote:

One ask - make string interpolation great again?

I have a dream, that ExpressibleByStringInterpolation would allow to distinguish literal segments
and embedded inputs.

Today, the documentation of this protocol [1] says:

 "One cookie: $\(price), \(number) cookies: $\(price * number)."
 // <=>
 let message = String(stringInterpolation:
 String(stringInterpolationSegment: "One cookie: $"),
 String(stringInterpolationSegment: price),
 String(stringInterpolationSegment: ", "),
 String(stringInterpolationSegment: number),
 String(stringInterpolationSegment: " cookies: $"),
 String(stringInterpolationSegment: price * number),
 String(stringInterpolationSegment: "."))

This means that ExpressibleByStringInterpolation can't distinguish "foo" from `bar` in "foo\(bar)".

If this distinction were possible, some nice features could emerge, such as context-sensitive
escaping:

 // func render(_ html: HTML)
 let title = "<script>boom();</script>"
 render("<h1>\(title)</h1>") // escapes input

 // func query(_ sql: SQL)
 let name = "Robert'); DROP TABLE students; --"
 query("SELECT * FROM students WHERE name = \(name)") // avoids SQL injection

Ideally, a solution for multi-line literals (for strings and interpolated strings) would be found,
too.

I wish the manifesto would address these topics as well :-)

This is totally something we want to fix, but as part of a wholesale
reform of the ExpressibleByXXX protocols. It's outside the scope of the
manifesto.

--
-Dave

_______________________________________________
swift-evolution mailing list
swift-evolution@swift.org

<mailto:swift-evolution@swift.org>

https://lists.swift.org/mailman/listinfo/swift-evolution

<https://lists.swift.org/mailman/listinfo/swift-evolution>
_______________________________________________
swift-evolution mailing list
swift-evolution@swift.org
https://lists.swift.org/mailman/listinfo/swift-evolution

--
-Dave

ben-cohen · January 21, 2017, 1:55am

Question: why do you think integer indices are so desirable?

Integer indexing is simple, but also encourages anti-patterns (tortured open-coded while loops with unexpected fencepost errors, conflation of positions and distances into a single type) and our goal should be to make most everyday higher-level operations, such as finding/tokenizing, so easy that Swift programmers don’t feel they need to resort to loops as often.

Examples where formIndex is so common yet so cumbersome that it would be worth efforts to create integer-indexed versions of string might be indicators of important missing features on our collection or string APIs. So do pass them along.

(There are definitely known gaps in them today – slicing needs improving as the manifesto mentions for things like slices from an index to n elements later. Also, we need support for in-place remove(where:) operations. But the more commonly needed cases we know about that aren’t covered, the better)

···

On Jan 20, 2017, at 2:58 PM, Saagar Jha via swift-evolution <swift-evolution@swift.org> wrote:

Sorry if I wasn’t clear; I’m looking for indexing using Int, instead of using formIndex.

dabrahams · January 21, 2017, 5:16pm

Along the lines of interpolation and formatting, any plans or ideas about localization/translations?

Only vague ideas at this point.

Formatting on interpolation will be nice, but it doesn’t help those of us who have to work with constant strings from a localization file.

Yes, we recognize that's the issue and would want to make sure the design integrated smoothly with the use of localization files.

···

Sent from my iPad

On Jan 20, 2017, at 3:06 AM, Kevin Nattinger <swift@nattinger.net> wrote:

On Jan 20, 2017, at 10:30 AM, Maxim Veksler via swift-evolution <swift-evolution@swift.org> wrote:

Great document! Pleasure to read and see the excellence design powers that go into Swift.

One ask - make string interpolation great again?

Taking from examples supplied at https://github.com/apple/swift/blob/master/docs/StringManifesto.md#string-interpolation

"Column 1: \(n.format(radix:16, width:8)) *** \(message)"

Why not use:

"Column 1: ${n.format(radix:16, width:8)} *** $message"

Which for my preference makes the syntax feel more readable, avoids the "double ))" in terms of string interpolation termination and function termination points. And if that's not enough brings the "feel" of the language to be scriptable in nature common in bash, sh, zsh and co.. scripting interpreters and has been adopted as part of ES6 interpolation syntax[1].

[1] Template literals (Template strings) - JavaScript | MDN

On Fri, Jan 20, 2017 at 9:19 AM Rien via swift-evolution <swift-evolution@swift.org> wrote:
Wow, I fully support the intention (becoming better than Perl) but I cannot comment on the contents without studying it for a couple of days…

Regards,
Rien

Site: http://balancingrock.nl
Blog: http://swiftrien.blogspot.com
Github: Swiftrien (Rien) · GitHub
Project: http://swiftfire.nl

> On 20 Jan 2017, at 03:56, Ben Cohen via swift-evolution <swift-evolution@swift.org> wrote:
>
> Hi all,
>
> Below is our take on a design manifesto for Strings in Swift 4 and beyond.
>
> Probably best read in rendered markdown on GitHub:
> https://github.com/apple/swift/blob/master/docs/StringManifesto.md
>
> We’re eager to hear everyone’s thoughts.
>
> Regards,
> Ben and Dave
>
>
> # String Processing For Swift 4
>
> * Authors: [Dave Abrahams](https://github.com/dabrahams\), [Ben Cohen](https://github.com/airspeedswift\)
>
> The goal of re-evaluating Strings for Swift 4 has been fairly ill-defined thus
> far, with just this short blurb in the
> [list of goals](https://lists.swift.org/pipermail/swift-evolution/Week-of-Mon-20160725/025676.html\):
>
>> **String re-evaluation**: String is one of the most important fundamental
>> types in the language. The standard library leads have numerous ideas of how
>> to improve the programming model for it, without jeopardizing the goals of
>> providing a unicode-correct-by-default model. Our goal is to be better at
>> string processing than Perl!
>
> For Swift 4 and beyond we want to improve three dimensions of text processing:
>
> 1. Ergonomics
> 2. Correctness
> 3. Performance
>
> This document is meant to both provide a sense of the long-term vision
> (including undecided issues and possible approaches), and to define the scope of
> work that could be done in the Swift 4 timeframe.
>
> ## General Principles
>
> ### Ergonomics
>
> It's worth noting that ergonomics and correctness are mutually-reinforcing. An
> API that is easy to use—but incorrectly—cannot be considered an ergonomic
> success. Conversely, an API that's simply hard to use is also hard to use
> correctly. Acheiving optimal performance without compromising ergonomics or
> correctness is a greater challenge.
>
> Consistency with the Swift language and idioms is also important for
> ergonomics. There are several places both in the standard library and in the
> foundation additions to `String` where patterns and practices found elsewhere
> could be applied to improve usability and familiarity.
>
> ### API Surface Area
>
> Primary data types such as `String` should have APIs that are easily understood
> given a signature and a one-line summary. Today, `String` fails that test. As
> you can see, the Standard Library and Foundation both contribute significantly to
> its overall complexity.
>
> **Method Arity** | **Standard Library** | **Foundation**
> ---|:---:|:---:
> 0: `ƒ()` | 5 | 7
> 1: `ƒ(:)` | 19 | 48
> 2: `ƒ(::)` | 13 | 19
> 3: `ƒ(:::)` | 5 | 11
> 4: `ƒ(::::)` | 1 | 7
> 5: `ƒ(:::::)` | - | 2
> 6: `ƒ(::::::)` | - | 1
>
> **API Kind** | **Standard Library** | **Foundation**
> ---|:---:|:---:
> `init` | 41 | 18
> `func` | 42 | 55
> `subscript` | 9 | 0
> `var` | 26 | 14
>
> **Total: 205 APIs**
>
> By contrast, `Int` has 80 APIs, none with more than two parameters.[0] String processing is complex enough; users shouldn't have
> to press through physical API sprawl just to get started.
>
> Many of the choices detailed below contribute to solving this problem,
> including:
>
> * Restoring `Collection` conformance and dropping the `.characters` view.
> * Providing a more general, composable slicing syntax.
> * Altering `Comparable` so that parameterized
> (e.g. case-insensitive) comparison fits smoothly into the basic syntax.
> * Clearly separating language-dependent operations on text produced
> by and for humans from language-independent
> operations on text produced by and for machine processing.
> * Relocating APIs that fall outside the domain of basic string processing and
> discouraging the proliferation of ad-hoc extensions.
>
>
> ### Batteries Included
>
> While `String` is available to all programs out-of-the-box, crucial APIs for
> basic string processing tasks are still inaccessible until `Foundation` is
> imported. While it makes sense that `Foundation` is needed for domain-specific
> jobs such as
> [linguistic tagging](https://developer.apple.com/reference/foundation/nslinguistictagger\),
> one should not need to import anything to, for example, do case-insensitive
> comparison.
>
> ### Unicode Compliance and Platform Support
>
> The Unicode standard provides a crucial objective reference point for what
> constitutes correct behavior in an extremely complex domain, so
> Unicode-correctness is, and will remain, a fundamental design principle behind
> Swift's `String`. That said, the Unicode standard is an evolving document, so
> this objective reference-point is not fixed.[1] While
> many of the most important operations—e.g. string hashing, equality, and
> non-localized comparison—will be stable, the semantics
> of others, such as grapheme breaking and localized comparison and case
> conversion, are expected to change as platforms are updated, so programs should
> be written so their correctness does not depend on precise stability of these
> semantics across OS versions or platforms. Although it may be possible to
> imagine static and/or dynamic analysis tools that will help users find such
> errors, the only sure way to deal with this fact of life is to educate users.
>
> ## Design Points
>
> ### Internationalization
>
> There is strong evidence that developers cannot determine how to use
> internationalization APIs correctly. Although documentation could and should be
> improved, the sheer size, complexity, and diversity of these APIs is a major
> contributor to the problem, causing novices to tune out, and more experienced
> programmers to make avoidable mistakes.
>
> The first step in improving this situation is to regularize all localized
> operations as invocations of normal string operations with extra
> parameters. Among other things, this means:
>
> 1. Doing away with `localizedXXX` methods
> 2. Providing a terse way to name the current locale as a parameter
> 3. Automatically adjusting defaults for options such
> as case sensitivity based on whether the operation is localized.
> 4. Removing correctness traps like `localizedCaseInsensitiveCompare` (see
> guidance in the
> [Internationalization and Localization Guide](https://developer.apple.com/library/content/documentation/MacOSX/Conceptual/BPInternational/InternationalizingYourCode/InternationalizingYourCode.html\).
>
> Along with appropriate documentation updates, these changes will make localized
> operations more teachable, comprehensible, and approachable, thereby lowering a
> barrier that currently leads some developers to ignore localization issues
> altogether.
>
> #### The Default Behavior of `String`
>
> Although this isn't well-known, the most accessible form of many operations on
> Swift `String` (and `NSString`) are really only appropriate for text that is
> intended to be processed for, and consumed by, machines. The semantics of the
> operations with the simplest spellings are always non-localized and
> language-agnostic.
>
> Two major factors play into this design choice:
>
> 1. Machine processing of text is important, so we should have first-class,
> accessible functions appropriate to that use case.
>
> 2. The most general localized operations require a locale parameter not required
> by their un-localized counterparts. This naturally skews complexity towards
> localized operations.
>
> Reaffirming that `String`'s simplest APIs have
> language-independent/machine-processed semantics has the benefit of clarifying
> the proper default behavior of operations such as comparison, and allows us to
> make [significant optimizations](#collation-semantics) that were previously
> thought to conflict with Unicode.
>
> #### Future Directions
>
> One of the most common internationalization errors is the unintentional
> presentation to users of text that has not been localized, but regularizing APIs
> and improving documentation can go only so far in preventing this error.
> Combined with the fact that `String` operations are non-localized by default,
> the environment for processing human-readable text may still be somewhat
> error-prone in Swift 4.
>
> For an audience of mostly non-experts, it is especially important that naïve
> code is very likely to be correct if it compiles, and that more sophisticated
> issues can be revealed progressively. For this reason, we intend to
> specifically and separately target localization and internationalization
> problems in the Swift 5 timeframe.
>
> ### Operations With Options
>
> There are three categories of common string operation that commonly need to be
> tuned in various dimensions:
>
> **Operation**|**Applicable Options**
> ---|---
> sort ordering | locale, case/diacritic/width-insensitivity
> case conversion | locale
> pattern matching | locale, case/diacritic/width-insensitivity
>
> The defaults for case-, diacritic-, and width-insensitivity are different for
> localized operations than for non-localized operations, so for example a
> localized sort should be case-insensitive by default, and a non-localized sort
> should be case-sensitive by default. We propose a standard “language” of
> defaulted parameters to be used for these purposes, with usage roughly like this:
>
> ```swift
> x.compared(to: y, case: .sensitive, in: swissGerman)
>
> x.lowercased(in: .currentLocale)
>
> x.allMatches(
> somePattern, case: .insensitive, diacritic: .insensitive)
> ```
>
> This usage might be supported by code like this:
>
> ```swift
> enum StringSensitivity {
> case sensitive
> case insensitive
> }
>
> extension Locale {
> static var currentLocale: Locale { ... }
> }
>
> extension Unicode {
> // An example of the option language in declaration context,
> // with nil defaults indicating unspecified, so defaults can be
> // driven by the presence/absence of a specific Locale
> func frobnicated(
> case caseSensitivity: StringSensitivity? = nil,
> diacritic diacriticSensitivity: StringSensitivity? = nil,
> width widthSensitivity: StringSensitivity? = nil,
> in locale: Locale? = nil
> ) -> Self { ... }
> }
> ```
>
> ### Comparing and Hashing Strings
>
> #### Collation Semantics
>
> What Unicode says about collation—which is used in `<`, `==`, and hashing— turns
> out to be quite interesting, once you pick it apart. The full Unicode Collation
> Algorithm (UCA) works like this:
>
> 1. Fully normalize both strings
> 2. Convert each string to a sequence of numeric triples to form a collation key
> 3. “Flatten” the key by concatenating the sequence of first elements to the
> sequence of second elements to the sequence of third elements
> 4. Lexicographically compare the flattened keys
>
> While step 1 can usually
> be [done quickly](UAX #15: Unicode Normalization Forms) and
> incrementally, step 2 uses a collation table that maps matching *sequences* of
> unicode scalars in the normalized string to *sequences* of triples, which get
> accumulated into a collation key. Predictably, this is where the real costs
> lie.
>
> *However*, there are some bright spots to this story. First, as it turns out,
> string sorting (localized or not) should be done down to what's called
> the
> [“identical” level](UTS #10: Unicode Collation Algorithm),
> which adds a step 3a: append the string's normalized form to the flattened
> collation key. At first blush this just adds work, but consider what it does
> for equality: two strings that normalize the same, naturally, will collate the
> same. But also, *strings that normalize differently will always collate
> differently*. In other words, for equality, it is sufficient to compare the
> strings' normalized forms and see if they are the same. We can therefore
> entirely skip the expensive part of collation for equality comparison.
>
> Next, naturally, anything that applies to equality also applies to hashing: it
> is sufficient to hash the string's normalized form, bypassing collation keys.
> This should provide significant speedups over the current implementation.
> Perhaps more importantly, since comparison down to the “identical” level applies
> even to localized strings, it means that hashing and equality can be implemented
> exactly the same way for localized and non-localized text, and hash tables with
> localized keys will remain valid across current-locale changes.
>
> Finally, once it is agreed that the *default* role for `String` is to handle
> machine-generated and machine-readable text, the default ordering of `String`s
> need no longer use the UCA at all. It is sufficient to order them in any way
> that's consistent with equality, so `String` ordering can simply be a
> lexicographical comparison of normalized forms,[4]
> (which is equivalent to lexicographically comparing the sequences of grapheme
> clusters), again bypassing step 2 and offering another speedup.
>
> This leaves us executing the full UCA *only* for localized sorting, and ICU's
> implementation has apparently been very well optimized.
>
> Following this scheme everywhere would also allow us to make sorting behavior
> consistent across platforms. Currently, we sort `String` according to the UCA,
> except that—*only on Apple platforms*—pairs of ASCII characters are ordered by
> unicode scalar value.
>
> #### Syntax
>
> Because the current `Comparable` protocol expresses all comparisons with binary
> operators, string comparisons—which may require
> additional [options](#operations-with-options)—do not fit smoothly into the
> existing syntax. At the same time, we'd like to solve other problems with
> comparison, as outlined
> in
> [this proposal](https://gist.github.com/CodaFi/f0347bd37f1c407bf7ea0c429ead380e\)
> (implemented by changes at the head
> of
> [this branch](https://github.com/CodaFi/swift/commits/space-the-final-frontier\)).
> We should adopt a modification of that proposal that uses a method rather than
> an operator `<=>`:
>
> ```swift
> enum SortOrder { case before, same, after }
>
> protocol Comparable : Equatable {
> func compared(to: Self) -> SortOrder
> ...
> }
> ```
>
> This change will give us a syntactic platform on which to implement methods with
> additional, defaulted arguments, thereby unifying and regularizing comparison
> across the library.
>
> ```swift
> extension String {
> func compared(to: Self) -> SortOrder
>
> }
> ```
>
> **Note:** `SortOrder` should bridge to `NSComparisonResult`. It's also possible
> that the standard library simply adopts Foundation's `ComparisonResult` as is,
> but we believe the community should at least consider alternate naming before
> that happens. There will be an opportunity to discuss the choices in detail
> when the modified
> [Comparison Proposal](https://gist.github.com/CodaFi/f0347bd37f1c407bf7ea0c429ead380e\) comes
> up for review.
>
> ### `String` should be a `Collection` of `Character`s Again
>
> In Swift 2.0, `String`'s `Collection` conformance was dropped, because we
> convinced ourselves that its semantics differed from those of `Collection` too
> significantly.
>
> It was always well understood that if strings were treated as sequences of
> `UnicodeScalar`s, algorithms such as `lexicographicalCompare`, `elementsEqual`,
> and `reversed` would produce nonsense results. Thus, in Swift 1.0, `String` was
> a collection of `Character` (extended grapheme clusters). During 2.0
> development, though, we realized that correct string concatenation could
> occasionally merge distinct grapheme clusters at the start and end of combined
> strings.
>
> This quirk aside, every aspect of strings-as-collections-of-graphemes appears to
> comport perfectly with Unicode. We think the concatenation problem is tolerable,
> because the cases where it occurs all represent partially-formed constructs. The
> largest class—isolated combining characters such as ◌́ (U+0301 COMBINING ACUTE
> ACCENT)—are explicitly called out in the Unicode standard as
> “[degenerate](UAX #29: Unicode Text Segmentation)” or
> “[defective](http://www.unicode.org/versions/Unicode9.0.0/ch03.pdf\)”. The other
> cases—such as a string ending in a zero-width joiner or half of a regional
> indicator—appear to be equally transient and unlikely outside of a text editor.
>
> Admitting these cases encourages exploration of grapheme composition and is
> consistent with what appears to be an overall Unicode philosophy that “no
> special provisions are made to get marginally better behavior for… cases that
> never occur in practice.”[2] Furthermore, it seems
> unlikely to disturb the semantics of any plausible algorithms. We can handle
> these cases by documenting them, explicitly stating that the elements of a
> `String` are an emergent property based on Unicode rules.
>
> The benefits of restoring `Collection` conformance are substantial:
>
> * Collection-like operations encourage experimentation with strings to
> investigate and understand their behavior. This is useful for teaching new
> programmers, but also good for experienced programmers who want to
> understand more about strings/unicode.
>
> * Extended grapheme clusters form a natural element boundary for Unicode
> strings. For example, searching and matching operations will always produce
> results that line up on grapheme cluster boundaries.
>
> * Character-by-character processing is a legitimate thing to do in many real
> use-cases, including parsing, pattern matching, and language-specific
> transformations such as transliteration.
>
> * `Collection` conformance makes a wide variety of powerful operations
> available that are appropriate to `String`'s default role as the vehicle for
> machine processed text.
>
> The methods `String` would inherit from `Collection`, where similar to
> higher-level string algorithms, have the right semantics. For example,
> grapheme-wise `lexicographicalCompare`, `elementsEqual`, and application of
> `flatMap` with case-conversion, produce the same results one would expect
> from whole-string ordering comparison, equality comparison, and
> case-conversion, respectively. `reverse` operates correctly on graphemes,
> keeping diacritics moored to their base characters and leaving emoji intact.
> Other methods such as `indexOf` and `contains` make obvious sense. A few
> `Collection` methods, like `min` and `max`, may not be particularly useful
> on `String`, but we don't consider that to be a problem worth solving, in
> the same way that we wouldn't try to suppress `min` and `max` on a
> `Set([UInt8])` that was used to store IP addresses.
>
> * Many of the higher-level operations that we want to provide for `String`s,
> such as parsing and pattern matching, should apply to any `Collection`, and
> many of the benefits we want for `Collections`, such
> as unified slicing, should accrue
> equally to `String`. Making `String` part of the same protocol hierarchy
> allows us to write these operations once and not worry about keeping the
> benefits in sync.
>
> * Slicing strings into substrings is a crucial part of the vocabulary of
> string processing, and all other sliceable things are `Collection`s.
> Because of its collection-like behavior, users naturally think of `String`
> in collection terms, but run into frustrating limitations where it fails to
> conform and are left to wonder where all the differences lie. Many simply
> “correct” this limitation by declaring a trivial conformance:
>
> ```swift
> extension String : BidirectionalCollection {}
> ```
>
> Even if we removed indexing-by-element from `String`, users could still do
> this:
>
> ```swift
> extension String : BidirectionalCollection {
> subscript(i: Index) -> Character { return characters[i] }
> }
> ```
>
> It would be much better to legitimize the conformance to `Collection` and
> simply document the oddity of any concatenation corner-cases, than to deny
> users the benefits on the grounds that a few cases are confusing.
>
> Note that the fact that `String` is a collection of graphemes does *not* mean
> that string operations will necessarily have to do grapheme boundary
> recognition. See the Unicode protocol section for details.
>
> ### `Character` and `CharacterSet`
>
> `Character`, which represents a
> Unicode
> [extended grapheme cluster](UAX #29: Unicode Text Segmentation),
> is a bit of a black box, requiring conversion to `String` in order to
> do any introspection, including interoperation with ASCII. To fix this, we should:
>
> - Add a `unicodeScalars` view much like `String`'s, so that the sub-structure
> of grapheme clusters is discoverable.
> - Add a failable `init` from sequences of scalars (returning nil for sequences
> that contain 0 or 2+ graphemes).
> - (Lower priority) expose some operations, such as `func uppercase() ->
> String`, `var isASCII: Bool`, and, to the extent they can be sensibly
> generalized, queries of unicode properties that should also be exposed on
> `UnicodeScalar` such as `isAlphabetic` and `isGraphemeBase` .
>
> Despite its name, `CharacterSet` currently operates on the Swift `UnicodeScalar`
> type. This means it is usable on `String`, but only by going through the unicode
> scalar view. To deal with this clash in the short term, `CharacterSet` should be
> renamed to `UnicodeScalarSet`. In the longer term, it may be appropriate to
> introduce a `CharacterSet` that provides similar functionality for extended
> grapheme clusters.[5]
>
> ### Unification of Slicing Operations
>
> Creating substrings is a basic part of String processing, but the slicing
> operations that we have in Swift are inconsistent in both their spelling and
> their naming:
>
> * Slices with two explicit endpoints are done with subscript, and support
> in-place mutation:
>
> ```swift
> s[i..<j].mutate()
> ```
>
> * Slicing from an index to the end, or from the start to an index, is done
> with a method and does not support in-place mutation:
> ```swift
> s.prefix(upTo: i).readOnly()
> ```
>
> Prefix and suffix operations should be migrated to be subscripting operations
> with one-sided ranges i.e. `s.prefix(upTo: i)` should become `s[..<i]`, as
> in
> [this proposal](https://github.com/apple/swift-evolution/blob/9cf2685293108ea3efcbebb7ee6a8618b83d4a90/proposals/0132-sequence-end-ops.md\).
> With generic subscripting in the language, that will allow us to collapse a wide
> variety of methods and subscript overloads into a single implementation, and
> give users an easy-to-use and composable way to describe subranges.
>
> Further extending this EDSL to integrate use-cases like `s.prefix(maxLength: 5)`
> is an ongoing research project that can be considered part of the potential
> long-term vision of text (and collection) processing.
>
> ### Substrings
>
> When implementing substring slicing, languages are faced with three options:
>
> 1. Make the substrings the same type as string, and share storage.
> 2. Make the substrings the same type as string, and copy storage when making the substring.
> 3. Make substrings a different type, with a storage copy on conversion to string.
>
> We think number 3 is the best choice. A walk-through of the tradeoffs follows.
>
> #### Same type, shared storage
>
> In Swift 3.0, slicing a `String` produces a new `String` that is a view into a
> subrange of the original `String`'s storage. This is why `String` is 3 words in
> size (the start, length and buffer owner), unlike the similar `Array` type
> which is only one.
>
> This is a simple model with big efficiency gains when chopping up strings into
> multiple smaller strings. But it does mean that a stored substring keeps the
> entire original string buffer alive even after it would normally have been
> released.
>
> This arrangement has proven to be problematic in other programming languages,
> because applications sometimes extract small strings from large ones and keep
> those small strings long-term. That is considered a memory leak and was enough
> of a problem in Java that they changed from substrings sharing storage to
> making a copy in 1.7.
>
> #### Same type, copied storage
>
> Copying of substrings is also the choice made in C#, and in the default
> `NSString` implementation. This approach avoids the memory leak issue, but has
> obvious performance overhead in performing the copies.
>
> This in turn encourages trafficking in string/range pairs instead of in
> substrings, for performance reasons, leading to API challenges. For example:
>
> ```swift
> foo.compare(bar, range: start..<end)
> ```
>
> Here, it is not clear whether `range` applies to `foo` or `bar`. This
> relationship is better expressed in Swift as a slicing operation:
>
> ```swift
> foo[start..<end].compare(bar)
> ```
>
> Not only does this clarify to which string the range applies, it also brings
> this sub-range capability to any API that operates on `String` "for free". So
> these other combinations also work equally well:
>
> ```swift
> // apply range on argument rather than target
> foo.compare(bar[start..<end])
> // apply range on both
> foo[start..<end].compare(bar[start1..<end1])
> // compare two strings ignoring first character
> foo.dropFirst().compare(bar.dropFirst())
> ```
>
> In all three cases, an explicit range argument need not appear on the `compare`
> method itself. The implementation of `compare` does not need to know anything
> about ranges. Methods need only take range arguments when that was an
> integral part of their purpose (for example, setting the start and end of a
> user's current selection in a text box).
>
> #### Different type, shared storage
>
> The desire to share underlying storage while preventing accidental memory leaks
> occurs with slices of `Array`. For this reason we have an `ArraySlice` type.
> The inconvenience of a separate type is mitigated by most operations used on
> `Array` from the standard library being generic over `Sequence` or `Collection`.
>
> We should apply the same approach for `String` by introducing a distinct
> `SubSequence` type, `Substring`. Similar advice given for `ArraySlice` would apply to `Substring`:
>
>> Important: Long-term storage of `Substring` instances is discouraged. A
>> substring holds a reference to the entire storage of a larger string, not
>> just to the portion it presents, even after the original string's lifetime
>> ends. Long-term storage of a `Substring` may therefore prolong the lifetime
>> of large strings that are no longer otherwise accessible, which can appear
>> to be memory leakage.
>
> When assigning a `Substring` to a longer-lived variable (usually a stored
> property) explicitly of type `String`, a type conversion will be performed, and
> at this point the substring buffer is copied and the original string's storage
> can be released.
>
> A `String` that was not its own `Substring` could be one word—a single tagged
> pointer—without requiring additional allocations. `Substring`s would be a view
> onto a `String`, so are 3 words - pointer to owner, pointer to start, and a
> length. The small string optimization for `Substring` would take advantage of
> the larger size, probably with a less compressed encoding for speed.
>
> The downside of having two types is the inconvenience of sometimes having a
> `Substring` when you need a `String`, and vice-versa. It is likely this would
> be a significantly bigger problem than with `Array` and `ArraySlice`, as
> slicing of `String` is such a common operation. It is especially relevant to
> existing code that assumes `String` is the currency type. To ease the pain of
> type mismatches, `Substring` should be a subtype of `String` in the same way
> that `Int` is a subtype of `Optional<Int>`. This would give users an implicit
> conversion from `Substring` to `String`, as well as the usual implicit
> conversions such as `[Substring]` to `[String]` that other subtype
> relationships receive.
>
> In most cases, type inference combined with the subtype relationship should
> make the type difference a non-issue and users will not care which type they
> are using. For flexibility and optimizability, most operations from the
> standard library will traffic in generic models of
> [`Unicode`](#the--code-unicode--code--protocol).
>
> ##### Guidance for API Designers
>
> In this model, **if a user is unsure about which type to use, `String` is always
> a reasonable default**. A `Substring` passed where `String` is expected will be
> implicitly copied. When compared to the “same type, copied storage” model, we
> have effectively deferred the cost of copying from the point where a substring
> is created until it must be converted to `String` for use with an API.
>
> A user who needs to optimize away copies altogether should use this guideline:
> if for performance reasons you are tempted to add a `Range` argument to your
> method as well as a `String` to avoid unnecessary copies, you should instead
> use `Substring`.
>
> ##### The “Empty Subscript”
>
> To make it easy to call such an optimized API when you only have a `String` (or
> to call any API that takes a `Collection`'s `SubSequence` when all you have is
> the `Collection`), we propose the following “empty subscript” operation,
>
> ```swift
> extension Collection {
> subscript() -> SubSequence {
> return self[startIndex..<endIndex]
> }
> }
> ```
>
> which allows the following usage:
>
> ```swift
> funcThatIsJustLooking(at: person.name) // pass person.name as Substring
> ```
>
> The `` syntax can be offered as a fixit when needed, similar to `&` for an
> `inout` argument. While it doesn't help a user to convert `[String]` to
> `[Substring]`, the need for such conversions is extremely rare, can be done with
> a simple `map` (which could also be offered by a fixit):
>
> ```swift
> takesAnArrayOfSubstring(arrayOfString.map { $0 })
> ```
>
> #### Other Options Considered
>
> As we have seen, all three options above have downsides, but it's possible
> these downsides could be eliminated/mitigated by the compiler. We are proposing
> one such mitigation—implicit conversion—as part of the the "different type,
> shared storage" option, to help avoid the cognitive load on developers of
> having to deal with a separate `Substring` type.
>
> To avoid the memory leak issues of a "same type, shared storage" substring
> option, we considered whether the compiler could perform an implicit copy of
> the underlying storage when it detects the string is being "stored" for long
> term usage, say when it is assigned to a stored property. The trouble with this
> approach is it is very difficult for the compiler to distinguish between
> long-term storage versus short-term in the case of abstractions that rely on
> stored properties. For example, should the storing of a substring inside an
> `Optional` be considered long-term? Or the storing of multiple substrings
> inside an array? The latter would not work well in the case of a
> `components(separatedBy:)` implementation that intended to return an array of
> substrings. It would also be difficult to distinguish intentional medium-term
> storage of substrings, say by a lexer. There does not appear to be an effective
> consistent rule that could be applied in the general case for detecting when a
> substring is truly being stored long-term.
>
> To avoid the cost of copying substrings under "same type, copied storage", the
> optimizer could be enhanced to to reduce the impact of some of those copies.
> For example, this code could be optimized to pull the invariant substring out
> of the loop:
>
> ```swift
> for _ in 0..<lots {
> someFunc(takingString: bigString[bigRange])
> }
> ```
>
> It's worth noting that a similar optimization is needed to avoid an equivalent
> problem with implicit conversion in the "different type, shared storage" case:
>
> ```swift
> let substring = bigString[bigRange]
> for _ in 0..<lots { someFunc(takingString: substring) }
> ```
>
> However, in the case of "same type, copied storage" there are many use cases
> that cannot be optimized as easily. Consider the following simple definition of
> a recursive `contains` algorithm, which when substring slicing is linear makes
> the overall algorithm quadratic:
>
> ```swift
> extension String {
> func containsChar(_ x: Character) -> Bool {
> return !isEmpty && (first == x || dropFirst().containsChar(x))
> }
> }
> ```
>
> For the optimizer to eliminate this problem is unrealistic, forcing the user to
> remember to optimize the code to not use string slicing if they want it to be
> efficient (assuming they remember):
>
> ```swift
> extension String {
> // add optional argument tracking progress through the string
> func containsCharacter(_ x: Character, atOrAfter idx: Index? = nil) -> Bool {
> let idx = idx ?? startIndex
> return idx != endIndex
> && (self[idx] == x || containsCharacter(x, atOrAfter: index(after: idx)))
> }
> }
> ```
>
> #### Substrings, Ranges and Objective-C Interop
>
> The pattern of passing a string/range pair is common in several Objective-C
> APIs, and is made especially awkward in Swift by the non-interchangeability of
> `Range<String.Index>` and `NSRange`.
>
> ```swift
> s2.find(s2, sourceRange: NSRange(j..<s2.endIndex, in: s2))
> ```
>
> In general, however, the Swift idiom for operating on a sub-range of a
> `Collection` is to *slice* the collection and operate on that:
>
> ```swift
> s2.find(s2[j..<s2.endIndex])
> ```
>
> Therefore, APIs that operate on an `NSString`/`NSRange` pair should be imported
> without the `NSRange` argument. The Objective-C importer should be changed to
> give these APIs special treatment so that when a `Substring` is passed, instead
> of being converted to a `String`, the full `NSString` and range are passed to
> the Objective-C method, thereby avoiding a copy.
>
> As a result, you would never need to pass an `NSRange` to these APIs, which
> solves the impedance problem by eliminating the argument, resulting in more
> idiomatic Swift code while retaining the performance benefit. To help users
> manually handle any cases that remain, Foundation should be augmented to allow
> the following syntax for converting to and from `NSRange`:
>
> ```swift
> let nsr = NSRange(i..<j, in: s) // An NSRange corresponding to s[i..<j]
> let iToJ = Range(nsr, in: s) // Equivalent to i..<j
> ```
>
> ### The `Unicode` protocol
>
> With `Substring` and `String` being distinct types and sharing almost all
> interface and semantics, and with the highest-performance string processing
> requiring knowledge of encoding and layout that the currency types can't
> provide, it becomes important to capture the common “string API” in a protocol.
> Since Unicode conformance is a key feature of string processing in swift, we
> call that protocol `Unicode`:
>
> **Note:** The following assumes several features that are planned but not yet implemented in
> Swift, and should be considered a sketch rather than a final design.
>
> ```swift
> protocol Unicode
> : Comparable, BidirectionalCollection where Element == Character {
>
> associatedtype Encoding : UnicodeEncoding
> var encoding: Encoding { get }
>
> associatedtype CodeUnits
> : RandomAccessCollection where Element == Encoding.CodeUnit
> var codeUnits: CodeUnits { get }
>
> associatedtype UnicodeScalars
> : BidirectionalCollection where Element == UnicodeScalar
> var unicodeScalars: UnicodeScalars { get }
>
> associatedtype ExtendedASCII
> : BidirectionalCollection where Element == UInt32
> var extendedASCII: ExtendedASCII { get }
>
> var unicodeScalars: UnicodeScalars { get }
> }
>
> extension Unicode {
> // ... define high-level non-mutating string operations, e.g. search ...
>
> func compared<Other: Unicode>(
> to rhs: Other,
> case caseSensitivity: StringSensitivity? = nil,
> diacritic diacriticSensitivity: StringSensitivity? = nil,
> width widthSensitivity: StringSensitivity? = nil,
> in locale: Locale? = nil
> ) -> SortOrder { ... }
> }
>
> extension Unicode : RangeReplaceableCollection where CodeUnits :
> RangeReplaceableCollection {
> // Satisfy protocol requirement
> mutating func replaceSubrange<C : Collection>(_: Range<Index>, with: C)
> where C.Element == Element
>
> // ... define high-level mutating string operations, e.g. replace ...
> }
>
> ```
>
> The goal is that `Unicode` exposes the underlying encoding and code units in
> such a way that for types with a known representation (e.g. a high-performance
> `UTF8String`) that information can be known at compile-time and can be used to
> generate a single path, while still allowing types like `String` that admit
> multiple representations to use runtime queries and branches to fast path
> specializations.
>
> **Note:** `Unicode` would make a fantastic namespace for much of
> what's in this proposal if we could get the ability to nest types and
> protocols in protocols.
>
>
> ### Scanning, Matching, and Tokenization
>
> #### Low-Level Textual Analysis
>
> We should provide convenient APIs processing strings by character. For example,
> it should be easy to cleanly express, “if this string starts with `"f"`, process
> the rest of the string as follows…” Swift is well-suited to expressing this
> common pattern beautifully, but we need to add the APIs. Here are two examples
> of the sort of code that might be possible given such APIs:
>
> ```swift
> if let firstLetter = input.droppingPrefix(alphabeticCharacter) {
> somethingWith(input) // process the rest of input
> }
>
> if let (number, restOfInput) = input.parsingPrefix(Int.self) {
> ...
> }
> ```
>
> The specific spelling and functionality of APIs like this are TBD. The larger
> point is to make sure matching-and-consuming jobs are well-supported.
>
> #### Unified Pattern Matcher Protocol
>
> Many of the current methods that do matching are overloaded to do the same
> logical operations in different ways, with the following axes:
>
> - Logical Operation: `find`, `split`, `replace`, match at start
> - Kind of pattern: `CharacterSet`, `String`, a regex, a closure
> - Options, e.g. case/diacritic sensitivity, locale. Sometimes a part of
> the method name, and sometimes an argument
> - Whole string or subrange.
>
> We should represent these aspects as orthogonal, composable components,
> abstracting pattern matchers into a protocol like
> [this one](https://github.com/apple/swift/blob/master/test/Prototypes/PatternMatching.swift#L33\),
> that can allow us to define logical operations once, without introducing
> overloads, and massively reducing API surface area.
>
> For example, using the strawman prefix `%` syntax to turn string literals into
> patterns, the following pairs would all invoke the same generic methods:
>
> ```swift
> if let found = s.firstMatch(%"searchString") { ... }
> if let found = s.firstMatch(someRegex) { ... }
>
> for m in s.allMatches((%"searchString"), case: .insensitive) { ... }
> for m in s.allMatches(someRegex) { ... }
>
> let items = s.split(separatedBy: ", ")
> let tokens = s.split(separatedBy: CharacterSet.whitespace)
> ```
>
> Note that, because Swift requires the indices of a slice to match the indices of
> the range from which it was sliced, operations like `firstMatch` can return a
> `Substring?` in lieu of a `Range<String.Index>?`: the indices of the match in
> the string being searched, if needed, can easily be recovered as the
> `startIndex` and `endIndex` of the `Substring`.
>
> Note also that matching operations are useful for collections in general, and
> would fall out of this proposal:
>
> ```
> // replace subsequences of contiguous NaNs with zero
> forces.replace(oneOrMore([Float.nan]), [0.0])
> ```
>
> #### Regular Expressions
>
> Addressing regular expressions is out of scope for this proposal.
> That said, it is important that to note the pattern matching protocol mentioned
> above provides a suitable foundation for regular expressions, and types such as
> `NSRegularExpression` can easily be retrofitted to conform to it. In the
> future, support for regular expression literals in the compiler could allow for
> compile-time syntax checking and optimization.
>
> ### String Indices
>
> `String` currently has four views—`characters`, `unicodeScalars`, `utf8`, and
> `utf16`—each with its own opaque index type. The APIs used to translate indices
> between views add needless complexity, and the opacity of indices makes them
> difficult to serialize.
>
> The index translation problem has two aspects:
>
> 1. `String` views cannot consume one anothers' indices without a cumbersome
> conversion step. An index into a `String`'s `characters` must be translated
> before it can be used as a position in its `unicodeScalars`. Although these
> translations are rarely needed, they add conceptual and API complexity.
> 2. Many APIs in the core libraries and other frameworks still expose `String`
> positions as `Int`s and regions as `NSRange`s, which can only reference a
> `utf16` view and interoperate poorly with `String` itself.
>
> #### Index Interchange Among Views
>
> String's need for flexible backing storage and reasonably-efficient indexing
> (i.e. without dynamically allocating and reference-counting the indices
> themselves) means indices need an efficient underlying storage type. Although
> we do not wish to expose `String`'s indices *as* integers, `Int` offsets into
> underlying code unit storage makes a good underlying storage type, provided
> `String`'s underlying storage supports random-access. We think random-access
> *code-unit storage* is a reasonable requirement to impose on all `String`
> instances.
>
> Making these `Int` code unit offsets conveniently accessible and constructible
> solves the serialization problem:
>
> ```swift
> clipboard.write(s.endIndex.codeUnitOffset)
> let offset = clipboard.read(Int.self)
> let i = String.Index(codeUnitOffset: offset)
> ```
>
> Index interchange between `String` and its `unicodeScalars`, `codeUnits`,
> and [`extendedASCII`](#parsing-ascii-structure) views can be made entirely
> seamless by having them share an index type (semantics of indexing a `String`
> between grapheme cluster boundaries are TBD—it can either trap or be forgiving).
> Having a common index allows easy traversal into the interior of graphemes,
> something that is often needed, without making it likely that someone will do it
> by accident.
>
> - `String.index(after:)` should advance to the next grapheme, even when the
> index points partway through a grapheme.
>
> - `String.index(before:)` should move to the start of the grapheme before
> the current position.
>
> Seamless index interchange between `String` and its UTF-8 or UTF-16 views is not
> crucial, as the specifics of encoding should not be a concern for most use
> cases, and would impose needless costs on the indices of other views. That
> said, we can make translation much more straightforward by exposing simple
> bidirectional converting `init`s on both index types:
>
> ```swift
> let u8Position = String.UTF8.Index(someStringIndex)
> let originalPosition = String.Index(u8Position)
> ```
>
> #### Index Interchange with Cocoa
>
> We intend to address `NSRange`s that denote substrings in Cocoa APIs as
> described [later in this document](#substrings--ranges-and-objective-c-interop).
> That leaves the interchange of bare indices with Cocoa APIs trafficking in
> `Int`. Hopefully such APIs will be rare, but when needed, the following
> extension, which would be useful for all `Collections`, can help:
>
> ```swift
> extension Collection {
> func index(offset: IndexDistance) -> Index {
> return index(startIndex, offsetBy: offset)
> }
> func offset(of i: Index) -> IndexDistance {
> return distance(from: startIndex, to: i)
> }
> }
> ```
>
> Then integers can easily be translated into offsets into a `String`'s `utf16`
> view for consumption by Cocoa:
>
> ```swift
> let cocoaIndex = s.utf16.offset(of: String.UTF16Index(i))
> let swiftIndex = s.utf16.index(offset: cocoaIndex)
> ```
>
> ### Formatting
>
> A full treatment of formatting is out of scope of this proposal, but
> we believe it's crucial for completing the text processing picture. This
> section details some of the existing issues and thinking that may guide future
> development.
>
> #### Printf-Style Formatting
>
> `String.format` is designed on the `printf` model: it takes a format string with
> textual placeholders for substitution, and an arbitrary list of other arguments.
> The syntax and meaning of these placeholders has a long history in
> C, but for anyone who doesn't use them regularly they are cryptic and complex,
> as the `printf (3)` man page attests.
>
> Aside from complexity, this style of API has two major problems: First, the
> spelling of these placeholders must match up to the types of the arguments, in
> the right order, or the behavior is undefined. Some limited support for
> compile-time checking of this correspondence could be implemented, but only for
> the cases where the format string is a literal. Second, there's no reasonable
> way to extend the formatting vocabulary to cover the needs of new types: you are
> stuck with what's in the box.
>
> #### Foundation Formatters
>
> The formatters supplied by Foundation are highly capable and versatile, offering
> both formatting and parsing services. When used for formatting, though, the
> design pattern demands more from users than it should:
>
> * Matching the type of data being formatted to a formatter type
> * Creating an instance of that type
> * Setting stateful options (`currency`, `dateStyle`) on the type. Note: the
> need for this step prevents the instance from being used and discarded in
> the same expression where it is created.
> * Overall, introduction of needless verbosity into source
>
> These may seem like small issues, but the experience of Apple localization
> experts is that the total drag of these factors on programmers is such that they
> tend to reach for `String.format` instead.
>
> #### String Interpolation
>
> Swift string interpolation provides a user-friendly alternative to printf's
> domain-specific language (just write ordinary swift code!) and its type safety
> problems (put the data right where it belongs!) but the following issues prevent
> it from being useful for localized formatting (among other jobs):
>
> * [SR-2303](https://bugs.swift.org/browse/SR-2303\) We are unable to restrict
> types used in string interpolation.
> * [SR-1260](https://bugs.swift.org/browse/SR-1260\) String interpolation can't
> distinguish (fragments of) the base string from the string substitutions.
>
> In the long run, we should improve Swift string interpolation to the point where
> it can participate in most any formatting job. Mostly this centers around
> fixing the interpolation protocols per the previous item, and supporting
> localization.
>
> To be able to use formatting effectively inside interpolations, it needs to be
> both lightweight (because it all happens in-situ) and discoverable. One
> approach would be to standardize on `format` methods, e.g.:
>
> ```swift
> "Column 1: \(n.format(radix:16, width:8)) *** \(message)"
>
> "Something with leading zeroes: \(x.format(fill: zero, width:8))"
> ```
>
> ### C String Interop
>
> Our support for interoperation with nul-terminated C strings is scattered and
> incoherent, with 6 ways to transform a C string into a `String` and four ways to
> do the inverse. These APIs should be replaced with the following
>
> ```swift
> extension String {
> /// Constructs a `String` having the same contents as `nulTerminatedUTF8`.
> ///
> /// - Parameter nulTerminatedUTF8: a sequence of contiguous UTF-8 encoded
> /// bytes ending just before the first zero byte (NUL character).
> init(cString nulTerminatedUTF8: UnsafePointer<CChar>)
>
> /// Constructs a `String` having the same contents as `nulTerminatedCodeUnits`.
> ///
> /// - Parameter nulTerminatedCodeUnits: a sequence of contiguous code units in
> /// the given `encoding`, ending just before the first zero code unit.
> /// - Parameter encoding: describes the encoding in which the code units
> /// should be interpreted.
> init<Encoding: UnicodeEncoding>(
> cString nulTerminatedCodeUnits: UnsafePointer<Encoding.CodeUnit>,
> encoding: Encoding)
>
> /// Invokes the given closure on the contents of the string, represented as a
> /// pointer to a null-terminated sequence of UTF-8 code units.
> func withCString<Result>(
> _ body: (UnsafePointer<CChar>) throws -> Result) rethrows -> Result
> }
> ```
>
> In both of the construction APIs, any invalid encoding sequence detected will
> have its longest valid prefix replaced by U+FFFD, the Unicode replacement
> character, per Unicode specification. This covers the common case. The
> replacement is done *physically* in the underlying storage and the validity of
> the result is recorded in the `String`'s `encoding` such that future accesses
> need not be slowed down by possible error repair separately.
>
> Construction that is aborted when encoding errors are detected can be
> accomplished using APIs on the `encoding`. String types that retain their
> physical encoding even in the presence of errors and are repaired on-the-fly can
> be built as different instances of the `Unicode` protocol.
>
> ### Unicode 9 Conformance
>
> Unicode 9 (and MacOS 10.11) brought us support for family emoji, which changes
> the process of properly identifying `Character` boundaries. We need to update
> `String` to account for this change.
>
> ### High-Performance String Processing
>
> Many strings are short enough to store in 64 bits, many can be stored using only
> 8 bits per unicode scalar, others are best encoded in UTF-16, and some come to
> us already in some other encoding, such as UTF-8, that would be costly to
> translate. Supporting these formats while maintaining usability for
> general-purpose APIs demands that a single `String` type can be backed by many
> different representations.
>
> That said, the highest performance code always requires static knowledge of the
> data structures on which it operates, and for this code, dynamic selection of
> representation comes at too high a cost. Heavy-duty text processing demands a
> way to opt out of dynamism and directly use known encodings. Having this
> ability can also make it easy to cleanly specialize code that handles dynamic
> cases for maximal efficiency on the most common representations.
>
> To address this need, we can build models of the `Unicode` protocol that encode
> representation information into the type, such as `NFCNormalizedUTF16String`.
>
> ### Parsing ASCII Structure
>
> Although many machine-readable formats support the inclusion of arbitrary
> Unicode text, it is also common that their fundamental structure lies entirely
> within the ASCII subset (JSON, YAML, many XML formats). These formats are often
> processed most efficiently by recognizing ASCII structural elements as ASCII,
> and capturing the arbitrary sections between them in more-general strings. The
> current String API offers no way to efficiently recognize ASCII and skip past
> everything else without the overhead of full decoding into unicode scalars.
>
> For these purposes, strings should supply an `extendedASCII` view that is a
> collection of `UInt32`, where values less than `0x80` represent the
> corresponding ASCII character, and other values represent data that is specific
> to the underlying encoding of the string.
>
> ## Language Support
>
> This proposal depends on two new features in the Swift language:
>
> 1. **Generic subscripts**, to
> enable unified slicing syntax.
>
> 2. **A subtype relationship** between
> `Substring` and `String`, enabling framework APIs to traffic solely in
> `String` while still making it possible to avoid copies by handling
> `Substring`s where necessary.
>
> Additionally, **the ability to nest types and protocols inside
> protocols** could significantly shrink the footprint of this proposal
> on the top-level Swift namespace.
>
>
> ## Open Questions
>
> ### Must `String` be limited to storing UTF-16 subset encodings?
>
> - The ability to handle `UTF-8`-encoded strings (models of `Unicode`) is not in
> question here; this is about what encodings must be storable, without
> transcoding, in the common currency type called “`String`”.
> - ASCII, Latin-1, UCS-2, and UTF-16 are UTF-16 subsets. UTF-8 is not.
> - If we have a way to get at a `String`'s code units, we need a concrete type in
> which to express them in the API of `String`, which is a concrete type
> - If String needs to be able to represent UTF-32, presumably the code units need
> to be `UInt32`.
> - Not supporting UTF-32-encoded text seems like one reasonable design choice.
> - Maybe we can allow UTF-8 storage in `String` and expose its code units as
> `UInt16`, just as we would for Latin-1.
> - Supporting only UTF-16-subset encodings would imply that `String` indices can
> be serialized without recording the `String`'s underlying encoding.
>
> ### Do we need a type-erasable base protocol for UnicodeEncoding?
>
> UnicodeEncoding has an associated type, but it may be important to be able to
> traffic in completely dynamic encoding values, e.g. for “tell me the most
> efficient encoding for this string.”
>
> ### Should there be a string “facade?”
>
> One possible design alternative makes `Unicode` a vehicle for expressing
> the storage and encoding of code units, but does not attempt to give it an API
> appropriate for `String`. Instead, string APIs would be provided by a generic
> wrapper around an instance of `Unicode`:
>
> ```swift
> struct StringFacade<U: Unicode> : BidirectionalCollection {
>
> // ...APIs for high-level string processing here...
>
> var unicode: U // access to lower-level unicode details
> }
>
> typealias String = StringFacade<StringStorage>
> typealias Substring = StringFacade<StringStorage.SubSequence>
> ```
>
> This design would allow us to de-emphasize lower-level `String` APIs such as
> access to the specific encoding, by putting them behind a `.unicode` property.
> A similar effect in a facade-less design would require a new top-level
> `StringProtocol` playing the role of the facade with an an `associatedtype
> Storage : Unicode`.
>
> An interesting variation on this design is possible if defaulted generic
> parameters are introduced to the language:
>
> ```swift
> struct String<U: Unicode = StringStorage>
> : BidirectionalCollection {
>
> // ...APIs for high-level string processing here...
>
> var unicode: U // access to lower-level unicode details
> }
>
> typealias Substring = String<StringStorage.SubSequence>
> ```
>
> One advantage of such a design is that naïve users will always extend “the right
> type” (`String`) without thinking, and the new APIs will show up on `Substring`,
> `MyUTF8String`, etc. That said, it also has downsides that should not be
> overlooked, not least of which is the confusability of the meaning of the word
> “string.” Is it referring to the generic or the concrete type?
>
> ### `TextOutputStream` and `TextOutputStreamable`
>
> `TextOutputStreamable` is intended to provide a vehicle for
> efficiently transporting formatted representations to an output stream
> without forcing the allocation of storage. Its use of `String`, a
> type with multiple representations, at the lowest-level unit of
> communication, conflicts with this goal. It might be sufficient to
> change `TextOutputStream` and `TextOutputStreamable` to traffic in an
> associated type conforming to `Unicode`, but that is not yet clear.
> This area will require some design work.
>
> ### `description` and `debugDescription`
>
> * Should these be creating localized or non-localized representations?
> * Is returning a `String` efficient enough?
> * Is `debugDescription` pulling the weight of the API surface area it adds?
>
> ### `StaticString`
>
> `StaticString` was added as a byproduct of standard library developed and kept
> around because it seemed useful, but it was never truly *designed* for client
> programmers. We need to decide what happens with it. Presumably *something*
> should fill its role, and that should conform to `Unicode`.
>
> ## Footnotes
>
> 0 The integers rewrite currently underway is expected to
> substantially reduce the scope of `Int`'s API by using more
> generics. [](#a0)
>
> 1 In practice, these semantics will usually be tied to the
> version of the installed [ICU](http://icu-project.org) library, which
> programmatically encodes the most complex rules of the Unicode Standard and its
> de-facto extension, CLDR.[](#a1)
>
> 2
> See
> [UAX #29: Unicode Text Segmentation](UAX #29: Unicode Text Segmentation). Note
> that inserting Unicode scalar values to prevent merging of grapheme clusters would
> also constitute a kind of misbehavior (one of the clusters at the boundary would
> not be found in the result), so would be relatively costly to implement, with
> little benefit. [](#a2)
>
> 4 The use of non-UCA-compliant ordering is fully sanctioned by
> the Unicode standard for this purpose. In fact there's
> a [whole chapter](http://www.unicode.org/versions/Unicode9.0.0/ch05.pdf\)
> dedicated to it. In particular, §5.17 says:
>
>> When comparing text that is visible to end users, a correct linguistic sort
>> should be used, as described in _Section 5.16, Sorting and
>> Searching_. However, in many circumstances the only requirement is for a
>> fast, well-defined ordering. In such cases, a binary ordering can be used.
>
> [](#a4)
>
>
> 5 The queries supported by `NSCharacterSet` map directly onto
> properties in a table that's indexed by unicode scalar value. This table is
> part of the Unicode standard. Some of these queries (e.g., “is this an
> uppercase character?”) may have fairly obvious generalizations to grapheme
> clusters, but exactly how to do it is a research topic and *ideally* we'd either
> establish the existing practice that the Unicode committee would standardize, or
> the Unicode committee would do the research and we'd implement their
> result.[](#a5)
>
> _______________________________________________
> swift-evolution mailing list
> swift-evolution@swift.org
> https://lists.swift.org/mailman/listinfo/swift-evolution

_______________________________________________
swift-evolution mailing list
swift-evolution@swift.org
https://lists.swift.org/mailman/listinfo/swift-evolution

_______________________________________________
swift-evolution mailing list
swift-evolution@swift.org
https://lists.swift.org/mailman/listinfo/swift-evolution

Chris_Lattner · January 22, 2017, 11:31pm

I agree that it is important to make String[Slice] and Array[Slice] consistent. If there is an implicit conversion for one, it makes sense for their to be an implicit conversion for the other.

That said, an implicit conversion here is something that we need to consider very carefully. Adding them would definitely increase programmer convenience in some cases, but it comes with two potentially serious costs:

1) The conversion from a slice to a container is a copying and O(n) memory allocating operation. Swift tends to prefer keeping these sorts of operations explicit, in order to make it easier to reason about performance of code. For example, if you are forced to write:

   let x = … something that returns a slice.
   foo(String(x))
   foo(String(x))

then you’re likely to notice the fact that you’re doing two expensive operations, which are redundant. If the conversion is implicit, you’d never notice. Also, the best solution may not be to create a single local temporary, it might actually be to change “foo” to take a slice.

2) Implicit conversions like this are known to slow down the type checker, sometimes substantially. I know that there are improvements planned, but this is exactly the sort of thing that increases the search space the constraint solver needs to evaluate, and it is already exponential. This sort of issue is the root cause of the embarrassing “expression too complex” errors.

-Chris

···

On Jan 20, 2017, at 2:23 PM, Jonathan Hull via swift-evolution <swift-evolution@swift.org> wrote:

Still digesting, but I definitely support the goal of string processing even better than Perl. Some random thoughts:

• I also like the suggestion of implicit conversion from substring slices to strings based on a subtype relationship, since I keep running into that issue when trying to use array slices.

Interesting. Could you offer some examples?

Nothing catastrophic. Mainly just having to wrap all of my slices in Array() to actually use them, which obfuscates the purpose of my code. It also took me an embarrassingly long time to figure out that was what I had to do to make it work. For the longest time, I couldn’t understand why anyone would use slices because I couldn’t actually use them with any API… and then someone mentioned wrapping it in Array() here on Evolution and I finally got it.

Chris_Lattner · January 22, 2017, 11:40pm

Jordan points out that the generalized slicing syntax stomps on '...x'
and 'x...', which would be somewhat obvious candidates for variadic
splatting if that ever becomes a thing. Now, variadics are a much more
esoteric feature and slicing is much more important to day-to-day
programming, so this isn't the end of the world IMO, but it is
something we'd be giving up.

Good point, Jordan.

In my experiments with introducing one-sided operators in Swift 3, I was not able to find a case where you actually wanted to write `c[i...]`. Everything I tried needed to use `c[i..<]` instead. My conclusion was that there was no possible use for postfix `...`; after all, `c[i...]` means `c[i...c.endIndex]`, which means `c[i..<c.index(after: c.endIndex)]`, which violates a precondition on `index(after:)`.

Right, the only sensible semantics for a one sided range with an open end point is that it goes to the end of the collection. I see a few different potential colors to paint this bikeshed with, all of which would have the semantics “c[i..<c.endIndex]”:

1) Provide "c[i...]":
2) Provide "c[i..<]":
3) Provide both "c[i..<]” and "c[i…]":

Since all of these operations would have the same behavior, it comes down to subjective questions:

a) Do we want redundancy? IMO, no, which is why #3 is not very desirable.
b) Which is easier to explain to people? As you say, "i..< is shorthand for i..<endindex” is nice and simple, which leans towards #2.
c) Which is subjectively nicer looking? IMO, #1 is much nicer typographically. The ..< formulation looks like symbol soup, particularly because most folks would not put a space before ].

There is no obvious winner, but to me, I tend to prefer #1. What do other folks think?

If that's the case, you can reserve postfix `...` for future variadics features, while using prefix `...` for these one-sided ranges.

I’m personally not very worried about this, the feature doesn’t exist yet and there are lots of ways to spell it. This is something that could and probably should deserve a more explicit/heavy syntax for clarity.

-Chris

···

On Jan 20, 2017, at 9:39 PM, Brent Royal-Gordon via swift-evolution <swift-evolution@swift.org> wrote:

On Jan 20, 2017, at 2:45 PM, Dave Abrahams via swift-evolution <swift-evolution@swift.org> wrote:
on Fri Jan 20 2017, Joe Groff <swift-evolution@swift.org> wrote:

Charlie_Monroe1 · January 23, 2017, 6:30am

doesn't necessarily mean that ignoring that case is the right thing to do. In fact, it means that Unicode won't do anything to protect programs against these, and if Swift doesn't, chances are that no one will. Isolated combining characters break a number of expectations that developers could reasonably have:

(a + b).count == a.count + b.count
(a + b).startsWith(a)
(a + b).endsWith(b)
(a + b).find(a) // or .find(b)

Of course, this can be documented, but people want easy, and documentation is hard.

Yes. Unfortunately they also want the ability to append a string consisiting of a combining character to another string and have it append. And they don't want to be prevented from forming valid-but-defective Unicode strings.

[…]

Can you suggest an alternative that doesn't violate the Unicode standard and supports the expected use-cases, somehow?

I'm not sure I understand. Did we go from "this is a degenerate/defective <https://github.com/apple/swift/blob/master/docs/StringManifesto.md#string-should-be-a-collection-of-characters-again> case that we shouldn't bother with" to "this is a supported use case that needs to work as-is"? I've never seen anyone start a string with a combining character on purpose, though I'm familiar with just one natural language that needs combining characters. I can imagine that it could be a convenient feature in other natural languages.

For example, keyboard on iOS. To throw in a real-world example, on the Czech keyboard, you have two keys for diacritics - ´ for an acute and ˇ for a caron. The way it works on macOS is that you type the diacritics first and then the letter. On iOS, however, it's the other way around - you type the letter and then the diacritics modifier is applied. So in Swift words, I imagine it may work like this after the key is pressed:

textContainer.text += "\u{030c}"

Which then contains the correct string.

···

On Jan 23, 2017, at 6:54 AM, Félix Cloutier via swift-evolution <swift-evolution@swift.org> wrote:

However, if Swift Strings are now designed for machine processing and less for human language convenience, for me, it's easy enough to justify a safe default in the context of machine processing: `a+b` will not combine the end of `a` with the start of `b`. You could do this by inserting a ◌ that `b` could combine with if necessary. That solution would make half of the cases that I've mentioned work as expected and make the operation always safe, as far as I can tell.

In that world, it would be a good idea to have a `&+` fallback or something like that that will let characters combine. I would think that this is a much less common use case than serializing strings, though.

My second concern is with how easy it is to convert an Int to a String index. I've been vocal about this before: I'm concerned that Swift developers will adequate Ints to random-access String iterators, which they are emphatically not. String.Index(100) is proposed as a constant-time operation

No, that has not been proposed. It would be

String.Index(codeUnitOffset: 100)

It's hard to strike a balance between keeping programmers from making mistakes and making the important use-cases easy. Do you have any suggestions for improving on what we've proposed?

That's still one extension away from String.Index(100), and one function away from an even more convenient form. I don't have a great solution, but I don't have a great understanding of the problem that this is solving either. I'm leaving it here because, AFAIK, Swift 3 imposes constraints that are hard to ignore and mostly beneficial to people outside of the English bubble, and it seems that the proposed index regresses on this.

I'm perfectly happy with interchanging indices between the different views of a String. It's getting the offset in or out of the index that I think lets people do incorrect assumptions about strings.

For the record, I'm not a great fan of the extendedASCII view either. I think that the problem that extendedASCII wants to solve is also solved by better pattern-matching, and the proposal lays a foundation for it. Mixing pretend-ASCII and Unicode is what gets you in the kind of trouble that I described in my first message.

Félix
Le 19 janv. 2017 à 18:56, Ben Cohen via swift-evolution <swift-evolution@swift.org <mailto:swift-evolution@swift.org>> a écrit :

Hi all,

Below is our take on a design manifesto for Strings in Swift 4 and beyond.

Probably best read in rendered markdown on GitHub:
https://github.com/apple/swift/blob/master/docs/StringManifesto.md

We’re eager to hear everyone’s thoughts.

Regards,
Ben and Dave

# String Processing For Swift 4

* Authors: [Dave Abrahams](https://github.com/dabrahams\), [Ben Cohen](https://github.com/airspeedswift\)

The goal of re-evaluating Strings for Swift 4 has been fairly ill-defined thus
far, with just this short blurb in the
[list of goals](https://lists.swift.org/pipermail/swift-evolution/Week-of-Mon-20160725/025676.html\):

**String re-evaluation**: String is one of the most important fundamental
types in the language. The standard library leads have numerous ideas of how
to improve the programming model for it, without jeopardizing the goals of
providing a unicode-correct-by-default model. Our goal is to be better at
string processing than Perl!

For Swift 4 and beyond we want to improve three dimensions of text processing:

1. Ergonomics
2. Correctness
3. Performance

This document is meant to both provide a sense of the long-term vision
(including undecided issues and possible approaches), and to define the scope of
work that could be done in the Swift 4 timeframe.

## General Principles

### Ergonomics

It's worth noting that ergonomics and correctness are mutually-reinforcing. An
API that is easy to use—but incorrectly—cannot be considered an ergonomic
success. Conversely, an API that's simply hard to use is also hard to use
correctly. Acheiving optimal performance without compromising ergonomics or
correctness is a greater challenge.

Consistency with the Swift language and idioms is also important for
ergonomics. There are several places both in the standard library and in the
foundation additions to `String` where patterns and practices found elsewhere
could be applied to improve usability and familiarity.

### API Surface Area

Primary data types such as `String` should have APIs that are easily understood
given a signature and a one-line summary. Today, `String` fails that test. As
you can see, the Standard Library and Foundation both contribute significantly to
its overall complexity.

**Method Arity** | **Standard Library** | **Foundation**
---|:---:|:---:
0: `ƒ()` | 5 | 7
1: `ƒ(:)` | 19 | 48
2: `ƒ(::)` | 13 | 19
3: `ƒ(:::)` | 5 | 11
4: `ƒ(::::)` | 1 | 7
5: `ƒ(:::::)` | - | 2
6: `ƒ(::::::)` | - | 1

**API Kind** | **Standard Library** | **Foundation**
---|:---:|:---:
`init` | 41 | 18
`func` | 42 | 55
`subscript` | 9 | 0
`var` | 26 | 14

**Total: 205 APIs**

By contrast, `Int` has 80 APIs, none with more than two parameters.[0] String processing is complex enough; users shouldn't have
to press through physical API sprawl just to get started.

Many of the choices detailed below contribute to solving this problem,
including:

* Restoring `Collection` conformance and dropping the `.characters` view.
* Providing a more general, composable slicing syntax.
* Altering `Comparable` so that parameterized
 (e.g. case-insensitive) comparison fits smoothly into the basic syntax.
* Clearly separating language-dependent operations on text produced
 by and for humans from language-independent
 operations on text produced by and for machine processing.
* Relocating APIs that fall outside the domain of basic string processing and
 discouraging the proliferation of ad-hoc extensions.

### Batteries Included

While `String` is available to all programs out-of-the-box, crucial APIs for
basic string processing tasks are still inaccessible until `Foundation` is
imported. While it makes sense that `Foundation` is needed for domain-specific
jobs such as
[linguistic tagging](https://developer.apple.com/reference/foundation/nslinguistictagger\),
one should not need to import anything to, for example, do case-insensitive
comparison.

### Unicode Compliance and Platform Support

The Unicode standard provides a crucial objective reference point for what
constitutes correct behavior in an extremely complex domain, so
Unicode-correctness is, and will remain, a fundamental design principle behind
Swift's `String`. That said, the Unicode standard is an evolving document, so
this objective reference-point is not fixed.[1] While
many of the most important operations—e.g. string hashing, equality, and
non-localized comparison—will be stable, the semantics
of others, such as grapheme breaking and localized comparison and case
conversion, are expected to change as platforms are updated, so programs should
be written so their correctness does not depend on precise stability of these
semantics across OS versions or platforms. Although it may be possible to
imagine static and/or dynamic analysis tools that will help users find such
errors, the only sure way to deal with this fact of life is to educate users.

## Design Points

### Internationalization

There is strong evidence that developers cannot determine how to use
internationalization APIs correctly. Although documentation could and should be
improved, the sheer size, complexity, and diversity of these APIs is a major
contributor to the problem, causing novices to tune out, and more experienced
programmers to make avoidable mistakes.

The first step in improving this situation is to regularize all localized
operations as invocations of normal string operations with extra
parameters. Among other things, this means:

1. Doing away with `localizedXXX` methods
2. Providing a terse way to name the current locale as a parameter
3. Automatically adjusting defaults for options such
 as case sensitivity based on whether the operation is localized.
4. Removing correctness traps like `localizedCaseInsensitiveCompare` (see
 guidance in the
 [Internationalization and Localization Guide](https://developer.apple.com/library/content/documentation/MacOSX/Conceptual/BPInternational/InternationalizingYourCode/InternationalizingYourCode.html\).

Along with appropriate documentation updates, these changes will make localized
operations more teachable, comprehensible, and approachable, thereby lowering a
barrier that currently leads some developers to ignore localization issues
altogether.

#### The Default Behavior of `String`

Although this isn't well-known, the most accessible form of many operations on
Swift `String` (and `NSString`) are really only appropriate for text that is
intended to be processed for, and consumed by, machines. The semantics of the
operations with the simplest spellings are always non-localized and
language-agnostic.

Two major factors play into this design choice:

1. Machine processing of text is important, so we should have first-class,
 accessible functions appropriate to that use case.

2. The most general localized operations require a locale parameter not required
 by their un-localized counterparts. This naturally skews complexity towards
 localized operations.

Reaffirming that `String`'s simplest APIs have
language-independent/machine-processed semantics has the benefit of clarifying
the proper default behavior of operations such as comparison, and allows us to
make [significant optimizations](#collation-semantics) that were previously
thought to conflict with Unicode.

#### Future Directions

One of the most common internationalization errors is the unintentional
presentation to users of text that has not been localized, but regularizing APIs
and improving documentation can go only so far in preventing this error.
Combined with the fact that `String` operations are non-localized by default,
the environment for processing human-readable text may still be somewhat
error-prone in Swift 4.

For an audience of mostly non-experts, it is especially important that naïve
code is very likely to be correct if it compiles, and that more sophisticated
issues can be revealed progressively. For this reason, we intend to
specifically and separately target localization and internationalization
problems in the Swift 5 timeframe.

### Operations With Options

There are three categories of common string operation that commonly need to be
tuned in various dimensions:

**Operation**|**Applicable Options**
---|---
sort ordering | locale, case/diacritic/width-insensitivity
case conversion | locale
pattern matching | locale, case/diacritic/width-insensitivity

The defaults for case-, diacritic-, and width-insensitivity are different for
localized operations than for non-localized operations, so for example a
localized sort should be case-insensitive by default, and a non-localized sort
should be case-sensitive by default. We propose a standard “language” of
defaulted parameters to be used for these purposes, with usage roughly like this:
 x.compared(to: y, case: .sensitive, in: swissGerman)

 x.lowercased(in: .currentLocale)

 x.allMatches(
 somePattern, case: .insensitive, diacritic: .insensitive)
This usage might be supported by code like this:
enum StringSensitivity {
case sensitive
case insensitive
}

extension Locale {
 static var currentLocale: Locale { ... }
}

extension Unicode {
 // An example of the option language in declaration context,
 // with nil defaults indicating unspecified, so defaults can be
 // driven by the presence/absence of a specific Locale
 func frobnicated(
 case caseSensitivity: StringSensitivity? = nil,
 diacritic diacriticSensitivity: StringSensitivity? = nil,
 width widthSensitivity: StringSensitivity? = nil,
 in locale: Locale? = nil
 ) -> Self { ... }
}
### Comparing and Hashing Strings

#### Collation Semantics

What Unicode says about collation—which is used in `<`, `==`, and hashing— turns
out to be quite interesting, once you pick it apart. The full Unicode Collation
Algorithm (UCA) works like this:

1. Fully normalize both strings
2. Convert each string to a sequence of numeric triples to form a collation key
3. “Flatten” the key by concatenating the sequence of first elements to the
sequence of second elements to the sequence of third elements
4. Lexicographically compare the flattened keys

While step 1 can usually
be [done quickly](UAX #15: Unicode Normalization Forms) and
incrementally, step 2 uses a collation table that maps matching *sequences* of
unicode scalars in the normalized string to *sequences* of triples, which get
accumulated into a collation key. Predictably, this is where the real costs
lie.

*However*, there are some bright spots to this story. First, as it turns out,
string sorting (localized or not) should be done down to what's called
the
[“identical” level](UTS #10: Unicode Collation Algorithm),
which adds a step 3a: append the string's normalized form to the flattened
collation key. At first blush this just adds work, but consider what it does
for equality: two strings that normalize the same, naturally, will collate the
same. But also, *strings that normalize differently will always collate
differently*. In other words, for equality, it is sufficient to compare the
strings' normalized forms and see if they are the same. We can therefore
entirely skip the expensive part of collation for equality comparison.

Next, naturally, anything that applies to equality also applies to hashing: it
is sufficient to hash the string's normalized form, bypassing collation keys.
This should provide significant speedups over the current implementation.
Perhaps more importantly, since comparison down to the “identical” level applies
even to localized strings, it means that hashing and equality can be implemented
exactly the same way for localized and non-localized text, and hash tables with
localized keys will remain valid across current-locale changes.

Finally, once it is agreed that the *default* role for `String` is to handle
machine-generated and machine-readable text, the default ordering of `String`s
need no longer use the UCA at all. It is sufficient to order them in any way
that's consistent with equality, so `String` ordering can simply be a
lexicographical comparison of normalized forms,[4]
(which is equivalent to lexicographically comparing the sequences of grapheme
clusters), again bypassing step 2 and offering another speedup.

This leaves us executing the full UCA *only* for localized sorting, and ICU's
implementation has apparently been very well optimized.

Following this scheme everywhere would also allow us to make sorting behavior
consistent across platforms. Currently, we sort `String` according to the UCA,
except that—*only on Apple platforms*—pairs of ASCII characters are ordered by
unicode scalar value.

#### Syntax

Because the current `Comparable` protocol expresses all comparisons with binary
operators, string comparisons—which may require
additional [options](#operations-with-options)—do not fit smoothly into the
existing syntax. At the same time, we'd like to solve other problems with
comparison, as outlined
in
[this proposal](https://gist.github.com/CodaFi/f0347bd37f1c407bf7ea0c429ead380e\)
(implemented by changes at the head
of
[this branch](https://github.com/CodaFi/swift/commits/space-the-final-frontier\)).
We should adopt a modification of that proposal that uses a method rather than
an operator `<=>`:
enum SortOrder { case before, same, after }

protocol Comparable : Equatable {
func compared(to: Self) -> SortOrder
...
}
This change will give us a syntactic platform on which to implement methods with
additional, defaulted arguments, thereby unifying and regularizing comparison
across the library.
extension String {
func compared(to: Self) -> SortOrder

}
**Note:** `SortOrder` should bridge to `NSComparisonResult`. It's also possible
that the standard library simply adopts Foundation's `ComparisonResult` as is,
but we believe the community should at least consider alternate naming before
that happens. There will be an opportunity to discuss the choices in detail
when the modified
[Comparison Proposal](https://gist.github.com/CodaFi/f0347bd37f1c407bf7ea0c429ead380e\) comes
up for review.

### `String` should be a `Collection` of `Character`s Again

In Swift 2.0, `String`'s `Collection` conformance was dropped, because we
convinced ourselves that its semantics differed from those of `Collection` too
significantly.

It was always well understood that if strings were treated as sequences of
`UnicodeScalar`s, algorithms such as `lexicographicalCompare`, `elementsEqual`,
and `reversed` would produce nonsense results. Thus, in Swift 1.0, `String` was
a collection of `Character` (extended grapheme clusters). During 2.0
development, though, we realized that correct string concatenation could
occasionally merge distinct grapheme clusters at the start and end of combined
strings.

This quirk aside, every aspect of strings-as-collections-of-graphemes appears to
comport perfectly with Unicode. We think the concatenation problem is tolerable,
because the cases where it occurs all represent partially-formed constructs. The
largest class—isolated combining characters such as ◌́ (U+0301 COMBINING ACUTE
ACCENT)—are explicitly called out in the Unicode standard as
“[degenerate](UAX #29: Unicode Text Segmentation)” or
“[defective](http://www.unicode.org/versions/Unicode9.0.0/ch03.pdf\)”. The other
cases—such as a string ending in a zero-width joiner or half of a regional
indicator—appear to be equally transient and unlikely outside of a text editor.

Admitting these cases encourages exploration of grapheme composition and is
consistent with what appears to be an overall Unicode philosophy that “no
special provisions are made to get marginally better behavior for… cases that
never occur in practice.”[2] Furthermore, it seems
unlikely to disturb the semantics of any plausible algorithms. We can handle
these cases by documenting them, explicitly stating that the elements of a
`String` are an emergent property based on Unicode rules.

The benefits of restoring `Collection` conformance are substantial:

* Collection-like operations encourage experimentation with strings to
 investigate and understand their behavior. This is useful for teaching new
 programmers, but also good for experienced programmers who want to
 understand more about strings/unicode.

* Extended grapheme clusters form a natural element boundary for Unicode
 strings. For example, searching and matching operations will always produce
 results that line up on grapheme cluster boundaries.

* Character-by-character processing is a legitimate thing to do in many real
 use-cases, including parsing, pattern matching, and language-specific
 transformations such as transliteration.

* `Collection` conformance makes a wide variety of powerful operations
 available that are appropriate to `String`'s default role as the vehicle for
 machine processed text.

 The methods `String` would inherit from `Collection`, where similar to
 higher-level string algorithms, have the right semantics. For example,
 grapheme-wise `lexicographicalCompare`, `elementsEqual`, and application of
 `flatMap` with case-conversion, produce the same results one would expect
 from whole-string ordering comparison, equality comparison, and
 case-conversion, respectively. `reverse` operates correctly on graphemes,
 keeping diacritics moored to their base characters and leaving emoji intact.
 Other methods such as `indexOf` and `contains` make obvious sense. A few
 `Collection` methods, like `min` and `max`, may not be particularly useful
 on `String`, but we don't consider that to be a problem worth solving, in
 the same way that we wouldn't try to suppress `min` and `max` on a
 `Set([UInt8])` that was used to store IP addresses.

* Many of the higher-level operations that we want to provide for `String`s,
 such as parsing and pattern matching, should apply to any `Collection`, and
 many of the benefits we want for `Collections`, such
 as unified slicing, should accrue
 equally to `String`. Making `String` part of the same protocol hierarchy
 allows us to write these operations once and not worry about keeping the
 benefits in sync.

* Slicing strings into substrings is a crucial part of the vocabulary of
 string processing, and all other sliceable things are `Collection`s.
 Because of its collection-like behavior, users naturally think of `String`
 in collection terms, but run into frustrating limitations where it fails to
 conform and are left to wonder where all the differences lie. Many simply
 “correct” this limitation by declaring a trivial conformance:
 extension String : BidirectionalCollection {}
Even if we removed indexing-by-element from `String`, users could still do
this:
 extension String : BidirectionalCollection {
 subscript(i: Index) -> Character { return characters[i] }
 }
 It would be much better to legitimize the conformance to `Collection` and
 simply document the oddity of any concatenation corner-cases, than to deny
 users the benefits on the grounds that a few cases are confusing.

Note that the fact that `String` is a collection of graphemes does *not* mean
that string operations will necessarily have to do grapheme boundary
recognition. See the Unicode protocol section for details.

### `Character` and `CharacterSet`

`Character`, which represents a
Unicode
[extended grapheme cluster](UAX #29: Unicode Text Segmentation),
is a bit of a black box, requiring conversion to `String` in order to
do any introspection, including interoperation with ASCII. To fix this, we should:

- Add a `unicodeScalars` view much like `String`'s, so that the sub-structure
 of grapheme clusters is discoverable.
- Add a failable `init` from sequences of scalars (returning nil for sequences
 that contain 0 or 2+ graphemes).
- (Lower priority) expose some operations, such as `func uppercase() ->
 String`, `var isASCII: Bool`, and, to the extent they can be sensibly
 generalized, queries of unicode properties that should also be exposed on
 `UnicodeScalar` such as `isAlphabetic` and `isGraphemeBase` .

Despite its name, `CharacterSet` currently operates on the Swift `UnicodeScalar`
type. This means it is usable on `String`, but only by going through the unicode
scalar view. To deal with this clash in the short term, `CharacterSet` should be
renamed to `UnicodeScalarSet`. In the longer term, it may be appropriate to
introduce a `CharacterSet` that provides similar functionality for extended
grapheme clusters.[5]

### Unification of Slicing Operations

Creating substrings is a basic part of String processing, but the slicing
operations that we have in Swift are inconsistent in both their spelling and
their naming:

* Slices with two explicit endpoints are done with subscript, and support
 in-place mutation:
 s[i..<j].mutate()
* Slicing from an index to the end, or from the start to an index, is done
with a method and does not support in-place mutation:
 s.prefix(upTo: i).readOnly()
Prefix and suffix operations should be migrated to be subscripting operations
with one-sided ranges i.e. `s.prefix(upTo: i)` should become `s[..<i]`, as
in
[this proposal](https://github.com/apple/swift-evolution/blob/9cf2685293108ea3efcbebb7ee6a8618b83d4a90/proposals/0132-sequence-end-ops.md\).
With generic subscripting in the language, that will allow us to collapse a wide
variety of methods and subscript overloads into a single implementation, and
give users an easy-to-use and composable way to describe subranges.

Further extending this EDSL to integrate use-cases like `s.prefix(maxLength: 5)`
is an ongoing research project that can be considered part of the potential
long-term vision of text (and collection) processing.

### Substrings

When implementing substring slicing, languages are faced with three options:

1. Make the substrings the same type as string, and share storage.
2. Make the substrings the same type as string, and copy storage when making the substring.
3. Make substrings a different type, with a storage copy on conversion to string.

We think number 3 is the best choice. A walk-through of the tradeoffs follows.

#### Same type, shared storage

In Swift 3.0, slicing a `String` produces a new `String` that is a view into a
subrange of the original `String`'s storage. This is why `String` is 3 words in
size (the start, length and buffer owner), unlike the similar `Array` type
which is only one.

This is a simple model with big efficiency gains when chopping up strings into
multiple smaller strings. But it does mean that a stored substring keeps the
entire original string buffer alive even after it would normally have been
released.

This arrangement has proven to be problematic in other programming languages,
because applications sometimes extract small strings from large ones and keep
those small strings long-term. That is considered a memory leak and was enough
of a problem in Java that they changed from substrings sharing storage to
making a copy in 1.7.

#### Same type, copied storage

Copying of substrings is also the choice made in C#, and in the default
`NSString` implementation. This approach avoids the memory leak issue, but has
obvious performance overhead in performing the copies.

This in turn encourages trafficking in string/range pairs instead of in
substrings, for performance reasons, leading to API challenges. For example:
foo.compare(bar, range: start..<end)
Here, it is not clear whether `range` applies to `foo` or `bar`. This
relationship is better expressed in Swift as a slicing operation:
foo[start..<end].compare(bar)
Not only does this clarify to which string the range applies, it also brings
this sub-range capability to any API that operates on `String` "for free". So
these other combinations also work equally well:
// apply range on argument rather than target
foo.compare(bar[start..<end])
// apply range on both
foo[start..<end].compare(bar[start1..<end1])
// compare two strings ignoring first character
foo.dropFirst().compare(bar.dropFirst())
In all three cases, an explicit range argument need not appear on the `compare`
method itself. The implementation of `compare` does not need to know anything
about ranges. Methods need only take range arguments when that was an
integral part of their purpose (for example, setting the start and end of a
user's current selection in a text box).

#### Different type, shared storage

The desire to share underlying storage while preventing accidental memory leaks
occurs with slices of `Array`. For this reason we have an `ArraySlice` type.
The inconvenience of a separate type is mitigated by most operations used on
`Array` from the standard library being generic over `Sequence` or `Collection`.

We should apply the same approach for `String` by introducing a distinct
`SubSequence` type, `Substring`. Similar advice given for `ArraySlice` would apply to `Substring`:

Important: Long-term storage of `Substring` instances is discouraged. A
substring holds a reference to the entire storage of a larger string, not
just to the portion it presents, even after the original string's lifetime
ends. Long-term storage of a `Substring` may therefore prolong the lifetime
of large strings that are no longer otherwise accessible, which can appear
to be memory leakage.

When assigning a `Substring` to a longer-lived variable (usually a stored
property) explicitly of type `String`, a type conversion will be performed, and
at this point the substring buffer is copied and the original string's storage
can be released.

A `String` that was not its own `Substring` could be one word—a single tagged
pointer—without requiring additional allocations. `Substring`s would be a view
onto a `String`, so are 3 words - pointer to owner, pointer to start, and a
length. The small string optimization for `Substring` would take advantage of
the larger size, probably with a less compressed encoding for speed.

The downside of having two types is the inconvenience of sometimes having a
`Substring` when you need a `String`, and vice-versa. It is likely this would
be a significantly bigger problem than with `Array` and `ArraySlice`, as
slicing of `String` is such a common operation. It is especially relevant to
existing code that assumes `String` is the currency type. To ease the pain of
type mismatches, `Substring` should be a subtype of `String` in the same way
that `Int` is a subtype of `Optional<Int>`. This would give users an implicit
conversion from `Substring` to `String`, as well as the usual implicit
conversions such as `[Substring]` to `[String]` that other subtype
relationships receive.

In most cases, type inference combined with the subtype relationship should
make the type difference a non-issue and users will not care which type they
are using. For flexibility and optimizability, most operations from the
standard library will traffic in generic models of
[`Unicode`](#the--code-unicode--code--protocol).

##### Guidance for API Designers

In this model, **if a user is unsure about which type to use, `String` is always
a reasonable default**. A `Substring` passed where `String` is expected will be
implicitly copied. When compared to the “same type, copied storage” model, we
have effectively deferred the cost of copying from the point where a substring
is created until it must be converted to `String` for use with an API.

A user who needs to optimize away copies altogether should use this guideline:
if for performance reasons you are tempted to add a `Range` argument to your
method as well as a `String` to avoid unnecessary copies, you should instead
use `Substring`.

##### The “Empty Subscript”

To make it easy to call such an optimized API when you only have a `String` (or
to call any API that takes a `Collection`'s `SubSequence` when all you have is
the `Collection`), we propose the following “empty subscript” operation,
extension Collection {
 subscript() -> SubSequence { 
 return self[startIndex..<endIndex] 
 }
}
which allows the following usage:
funcThatIsJustLooking(at: person.name[]) // pass person.name as Substring
The `` syntax can be offered as a fixit when needed, similar to `&` for an
`inout` argument. While it doesn't help a user to convert `[String]` to
`[Substring]`, the need for such conversions is extremely rare, can be done with
a simple `map` (which could also be offered by a fixit):
takesAnArrayOfSubstring(arrayOfString.map { $0[] })
#### Other Options Considered

As we have seen, all three options above have downsides, but it's possible
these downsides could be eliminated/mitigated by the compiler. We are proposing
one such mitigation—implicit conversion—as part of the the "different type,
shared storage" option, to help avoid the cognitive load on developers of
having to deal with a separate `Substring` type.

To avoid the memory leak issues of a "same type, shared storage" substring
option, we considered whether the compiler could perform an implicit copy of
the underlying storage when it detects the string is being "stored" for long
term usage, say when it is assigned to a stored property. The trouble with this
approach is it is very difficult for the compiler to distinguish between
long-term storage versus short-term in the case of abstractions that rely on
stored properties. For example, should the storing of a substring inside an
`Optional` be considered long-term? Or the storing of multiple substrings
inside an array? The latter would not work well in the case of a
`components(separatedBy:)` implementation that intended to return an array of
substrings. It would also be difficult to distinguish intentional medium-term
storage of substrings, say by a lexer. There does not appear to be an effective
consistent rule that could be applied in the general case for detecting when a
substring is truly being stored long-term.

To avoid the cost of copying substrings under "same type, copied storage", the
optimizer could be enhanced to to reduce the impact of some of those copies.
For example, this code could be optimized to pull the invariant substring out
of the loop:
for _ in 0..<lots { 
 someFunc(takingString: bigString[bigRange]) 
}
It's worth noting that a similar optimization is needed to avoid an equivalent
problem with implicit conversion in the "different type, shared storage" case:
let substring = bigString[bigRange]
for _ in 0..<lots { someFunc(takingString: substring) }
However, in the case of "same type, copied storage" there are many use cases
that cannot be optimized as easily. Consider the following simple definition of
a recursive `contains` algorithm, which when substring slicing is linear makes
the overall algorithm quadratic:
extension String {
 func containsChar(_ x: Character) -> Bool {
 return !isEmpty && (first == x || dropFirst().containsChar(x))
 }
}
For the optimizer to eliminate this problem is unrealistic, forcing the user to
remember to optimize the code to not use string slicing if they want it to be
efficient (assuming they remember):
extension String {
 // add optional argument tracking progress through the string
 func containsCharacter(_ x: Character, atOrAfter idx: Index? = nil) -> Bool {
 let idx = idx ?? startIndex
 return idx != endIndex
 && (self[idx] == x || containsCharacter(x, atOrAfter: index(after: idx)))
 }
}
#### Substrings, Ranges and Objective-C Interop

The pattern of passing a string/range pair is common in several Objective-C
APIs, and is made especially awkward in Swift by the non-interchangeability of
`Range<String.Index>` and `NSRange`.
s2.find(s2, sourceRange: NSRange(j..<s2.endIndex, in: s2))
In general, however, the Swift idiom for operating on a sub-range of a
`Collection` is to *slice* the collection and operate on that:
s2.find(s2[j..<s2.endIndex])
Therefore, APIs that operate on an `NSString`/`NSRange` pair should be imported
without the `NSRange` argument. The Objective-C importer should be changed to
give these APIs special treatment so that when a `Substring` is passed, instead
of being converted to a `String`, the full `NSString` and range are passed to
the Objective-C method, thereby avoiding a copy.

As a result, you would never need to pass an `NSRange` to these APIs, which
solves the impedance problem by eliminating the argument, resulting in more
idiomatic Swift code while retaining the performance benefit. To help users
manually handle any cases that remain, Foundation should be augmented to allow
the following syntax for converting to and from `NSRange`:
let nsr = NSRange(i..<j, in: s) // An NSRange corresponding to s[i..<j]
let iToJ = Range(nsr, in: s) // Equivalent to i..<j
### The `Unicode` protocol

With `Substring` and `String` being distinct types and sharing almost all
interface and semantics, and with the highest-performance string processing
requiring knowledge of encoding and layout that the currency types can't
provide, it becomes important to capture the common “string API” in a protocol.
Since Unicode conformance is a key feature of string processing in swift, we
call that protocol `Unicode`:

**Note:** The following assumes several features that are planned but not yet implemented in
Swift, and should be considered a sketch rather than a final design.
protocol Unicode 
 : Comparable, BidirectionalCollection where Element == Character {

 associatedtype Encoding : UnicodeEncoding
 var encoding: Encoding { get }

 associatedtype CodeUnits 
 : RandomAccessCollection where Element == Encoding.CodeUnit
 var codeUnits: CodeUnits { get }

 associatedtype UnicodeScalars 
 : BidirectionalCollection where Element == UnicodeScalar
 var unicodeScalars: UnicodeScalars { get }

 associatedtype ExtendedASCII 
 : BidirectionalCollection where Element == UInt32
 var extendedASCII: ExtendedASCII { get }

 var unicodeScalars: UnicodeScalars { get }
}

extension Unicode {
 // ... define high-level non-mutating string operations, e.g. search ...

 func compared<Other: Unicode>(
 to rhs: Other,
 case caseSensitivity: StringSensitivity? = nil,
 diacritic diacriticSensitivity: StringSensitivity? = nil,
 width widthSensitivity: StringSensitivity? = nil,
 in locale: Locale? = nil
 ) -> SortOrder { ... }
}

extension Unicode : RangeReplaceableCollection where CodeUnits :
 RangeReplaceableCollection {
 // Satisfy protocol requirement
 mutating func replaceSubrange<C : Collection>(_: Range<Index>, with: C) 
 where C.Element == Element

 // ... define high-level mutating string operations, e.g. replace ...
}
The goal is that `Unicode` exposes the underlying encoding and code units in
such a way that for types with a known representation (e.g. a high-performance
`UTF8String`) that information can be known at compile-time and can be used to
generate a single path, while still allowing types like `String` that admit
multiple representations to use runtime queries and branches to fast path
specializations.

**Note:** `Unicode` would make a fantastic namespace for much of
what's in this proposal if we could get the ability to nest types and
protocols in protocols.

### Scanning, Matching, and Tokenization

#### Low-Level Textual Analysis

We should provide convenient APIs processing strings by character. For example,
it should be easy to cleanly express, “if this string starts with `"f"`, process
the rest of the string as follows…” Swift is well-suited to expressing this
common pattern beautifully, but we need to add the APIs. Here are two examples
of the sort of code that might be possible given such APIs:
if let firstLetter = input.droppingPrefix(alphabeticCharacter) {
 somethingWith(input) // process the rest of input
}

if let (number, restOfInput) = input.parsingPrefix(Int.self) {
 ...
}
The specific spelling and functionality of APIs like this are TBD. The larger
point is to make sure matching-and-consuming jobs are well-supported.

#### Unified Pattern Matcher Protocol

Many of the current methods that do matching are overloaded to do the same
logical operations in different ways, with the following axes:

- Logical Operation: `find`, `split`, `replace`, match at start
- Kind of pattern: `CharacterSet`, `String`, a regex, a closure
- Options, e.g. case/diacritic sensitivity, locale. Sometimes a part of
the method name, and sometimes an argument
- Whole string or subrange.

We should represent these aspects as orthogonal, composable components,
abstracting pattern matchers into a protocol like
[this one](https://github.com/apple/swift/blob/master/test/Prototypes/PatternMatching.swift#L33\),
that can allow us to define logical operations once, without introducing
overloads, and massively reducing API surface area.

For example, using the strawman prefix `%` syntax to turn string literals into
patterns, the following pairs would all invoke the same generic methods:
if let found = s.firstMatch(%"searchString") { ... }
if let found = s.firstMatch(someRegex) { ... }

for m in s.allMatches((%"searchString"), case: .insensitive) { ... }
for m in s.allMatches(someRegex) { ... }

let items = s.split(separatedBy: ", ")
let tokens = s.split(separatedBy: CharacterSet.whitespace)
Note that, because Swift requires the indices of a slice to match the indices of
the range from which it was sliced, operations like `firstMatch` can return a
`Substring?` in lieu of a `Range<String.Index>?`: the indices of the match in
the string being searched, if needed, can easily be recovered as the
`startIndex` and `endIndex` of the `Substring`.

Note also that matching operations are useful for collections in general, and
would fall out of this proposal:
// replace subsequences of contiguous NaNs with zero
forces.replace(oneOrMore([Float.nan]), [0.0])
#### Regular Expressions

Addressing regular expressions is out of scope for this proposal.
That said, it is important that to note the pattern matching protocol mentioned
above provides a suitable foundation for regular expressions, and types such as
`NSRegularExpression` can easily be retrofitted to conform to it. In the
future, support for regular expression literals in the compiler could allow for
compile-time syntax checking and optimization.

### String Indices

`String` currently has four views—`characters`, `unicodeScalars`, `utf8`, and
`utf16`—each with its own opaque index type. The APIs used to translate indices
between views add needless complexity, and the opacity of indices makes them
difficult to serialize.

The index translation problem has two aspects:

1. `String` views cannot consume one anothers' indices without a cumbersome
 conversion step. An index into a `String`'s `characters` must be translated
 before it can be used as a position in its `unicodeScalars`. Although these
 translations are rarely needed, they add conceptual and API complexity.
2. Many APIs in the core libraries and other frameworks still expose `String`
 positions as `Int`s and regions as `NSRange`s, which can only reference a
 `utf16` view and interoperate poorly with `String` itself.

#### Index Interchange Among Views

String's need for flexible backing storage and reasonably-efficient indexing
(i.e. without dynamically allocating and reference-counting the indices
themselves) means indices need an efficient underlying storage type. Although
we do not wish to expose `String`'s indices *as* integers, `Int` offsets into
underlying code unit storage makes a good underlying storage type, provided
`String`'s underlying storage supports random-access. We think random-access
*code-unit storage* is a reasonable requirement to impose on all `String`
instances.

Making these `Int` code unit offsets conveniently accessible and constructible
solves the serialization problem:
clipboard.write(s.endIndex.codeUnitOffset)
let offset = clipboard.read(Int.self)
let i = String.Index(codeUnitOffset: offset)
Index interchange between `String` and its `unicodeScalars`, `codeUnits`,
and [`extendedASCII`](#parsing-ascii-structure) views can be made entirely
seamless by having them share an index type (semantics of indexing a `String`
between grapheme cluster boundaries are TBD—it can either trap or be forgiving).
Having a common index allows easy traversal into the interior of graphemes,
something that is often needed, without making it likely that someone will do it
by accident.

- `String.index(after:)` should advance to the next grapheme, even when the
index points partway through a grapheme.

- `String.index(before:)` should move to the start of the grapheme before
the current position.

Seamless index interchange between `String` and its UTF-8 or UTF-16 views is not
crucial, as the specifics of encoding should not be a concern for most use
cases, and would impose needless costs on the indices of other views. That
said, we can make translation much more straightforward by exposing simple
bidirectional converting `init`s on both index types:
let u8Position = String.UTF8.Index(someStringIndex)
let originalPosition = String.Index(u8Position)
#### Index Interchange with Cocoa

We intend to address `NSRange`s that denote substrings in Cocoa APIs as
described [later in this document](#substrings--ranges-and-objective-c-interop).
That leaves the interchange of bare indices with Cocoa APIs trafficking in
`Int`. Hopefully such APIs will be rare, but when needed, the following
extension, which would be useful for all `Collections`, can help:
extension Collection {
 func index(offset: IndexDistance) -> Index {
 return index(startIndex, offsetBy: offset)
 }
 func offset(of i: Index) -> IndexDistance {
 return distance(from: startIndex, to: i)
 }
}
Then integers can easily be translated into offsets into a `String`'s `utf16`
view for consumption by Cocoa:
let cocoaIndex = s.utf16.offset(of: String.UTF16Index(i))
let swiftIndex = s.utf16.index(offset: cocoaIndex)
### Formatting

A full treatment of formatting is out of scope of this proposal, but
we believe it's crucial for completing the text processing picture. This
section details some of the existing issues and thinking that may guide future
development.

#### Printf-Style Formatting

`String.format` is designed on the `printf` model: it takes a format string with
textual placeholders for substitution, and an arbitrary list of other arguments.
The syntax and meaning of these placeholders has a long history in
C, but for anyone who doesn't use them regularly they are cryptic and complex,
as the `printf (3)` man page attests.

Aside from complexity, this style of API has two major problems: First, the
spelling of these placeholders must match up to the types of the arguments, in
the right order, or the behavior is undefined. Some limited support for
compile-time checking of this correspondence could be implemented, but only for
the cases where the format string is a literal. Second, there's no reasonable
way to extend the formatting vocabulary to cover the needs of new types: you are
stuck with what's in the box.

#### Foundation Formatters

The formatters supplied by Foundation are highly capable and versatile, offering
both formatting and parsing services. When used for formatting, though, the
design pattern demands more from users than it should:

* Matching the type of data being formatted to a formatter type
* Creating an instance of that type
* Setting stateful options (`currency`, `dateStyle`) on the type. Note: the
 need for this step prevents the instance from being used and discarded in
 the same expression where it is created.
* Overall, introduction of needless verbosity into source

These may seem like small issues, but the experience of Apple localization
experts is that the total drag of these factors on programmers is such that they
tend to reach for `String.format` instead.

#### String Interpolation

Swift string interpolation provides a user-friendly alternative to printf's
domain-specific language (just write ordinary swift code!) and its type safety
problems (put the data right where it belongs!) but the following issues prevent
it from being useful for localized formatting (among other jobs):

* [SR-2303](https://bugs.swift.org/browse/SR-2303\) We are unable to restrict
 types used in string interpolation.
* [SR-1260](https://bugs.swift.org/browse/SR-1260\) String interpolation can't
 distinguish (fragments of) the base string from the string substitutions.

In the long run, we should improve Swift string interpolation to the point where
it can participate in most any formatting job. Mostly this centers around
fixing the interpolation protocols per the previous item, and supporting
localization.

To be able to use formatting effectively inside interpolations, it needs to be
both lightweight (because it all happens in-situ) and discoverable. One
approach would be to standardize on `format` methods, e.g.:
"Column 1: \(n.format(radix:16, width:8)) *** \(message)"

"Something with leading zeroes: \(x.format(fill: zero, width:8))"
### C String Interop

Our support for interoperation with nul-terminated C strings is scattered and
incoherent, with 6 ways to transform a C string into a `String` and four ways to
do the inverse. These APIs should be replaced with the following
extension String {
 /// Constructs a `String` having the same contents as `nulTerminatedUTF8`.
 ///
 /// - Parameter nulTerminatedUTF8: a sequence of contiguous UTF-8 encoded 
 /// bytes ending just before the first zero byte (NUL character).
 init(cString nulTerminatedUTF8: UnsafePointer<CChar>)

 /// Constructs a `String` having the same contents as `nulTerminatedCodeUnits`.
 ///
 /// - Parameter nulTerminatedCodeUnits: a sequence of contiguous code units in
 /// the given `encoding`, ending just before the first zero code unit.
 /// - Parameter encoding: describes the encoding in which the code units
 /// should be interpreted.
 init<Encoding: UnicodeEncoding>(
 cString nulTerminatedCodeUnits: UnsafePointer<Encoding.CodeUnit>,
 encoding: Encoding)

 /// Invokes the given closure on the contents of the string, represented as a
 /// pointer to a null-terminated sequence of UTF-8 code units.
 func withCString<Result>(
 _ body: (UnsafePointer<CChar>) throws -> Result) rethrows -> Result
}
In both of the construction APIs, any invalid encoding sequence detected will
have its longest valid prefix replaced by U+FFFD, the Unicode replacement
character, per Unicode specification. This covers the common case. The
replacement is done *physically* in the underlying storage and the validity of
the result is recorded in the `String`'s `encoding` such that future accesses
need not be slowed down by possible error repair separately.

Construction that is aborted when encoding errors are detected can be
accomplished using APIs on the `encoding`. String types that retain their
physical encoding even in the presence of errors and are repaired on-the-fly can
be built as different instances of the `Unicode` protocol.

### Unicode 9 Conformance

Unicode 9 (and MacOS 10.11) brought us support for family emoji, which changes
the process of properly identifying `Character` boundaries. We need to update
`String` to account for this change.

### High-Performance String Processing

Many strings are short enough to store in 64 bits, many can be stored using only
8 bits per unicode scalar, others are best encoded in UTF-16, and some come to
us already in some other encoding, such as UTF-8, that would be costly to
translate. Supporting these formats while maintaining usability for
general-purpose APIs demands that a single `String` type can be backed by many
different representations.

That said, the highest performance code always requires static knowledge of the
data structures on which it operates, and for this code, dynamic selection of
representation comes at too high a cost. Heavy-duty text processing demands a
way to opt out of dynamism and directly use known encodings. Having this
ability can also make it easy to cleanly specialize code that handles dynamic
cases for maximal efficiency on the most common representations.

To address this need, we can build models of the `Unicode` protocol that encode
representation information into the type, such as `NFCNormalizedUTF16String`.

### Parsing ASCII Structure

Although many machine-readable formats support the inclusion of arbitrary
Unicode text, it is also common that their fundamental structure lies entirely
within the ASCII subset (JSON, YAML, many XML formats). These formats are often
processed most efficiently by recognizing ASCII structural elements as ASCII,
and capturing the arbitrary sections between them in more-general strings. The
current String API offers no way to efficiently recognize ASCII and skip past
everything else without the overhead of full decoding into unicode scalars.

For these purposes, strings should supply an `extendedASCII` view that is a
collection of `UInt32`, where values less than `0x80` represent the
corresponding ASCII character, and other values represent data that is specific
to the underlying encoding of the string.

## Language Support

This proposal depends on two new features in the Swift language:

1. **Generic subscripts**, to
 enable unified slicing syntax.

2. **A subtype relationship** between
 `Substring` and `String`, enabling framework APIs to traffic solely in
 `String` while still making it possible to avoid copies by handling
 `Substring`s where necessary.

Additionally, **the ability to nest types and protocols inside
protocols** could significantly shrink the footprint of this proposal
on the top-level Swift namespace.

## Open Questions

### Must `String` be limited to storing UTF-16 subset encodings?

- The ability to handle `UTF-8`-encoded strings (models of `Unicode`) is not in
question here; this is about what encodings must be storable, without
transcoding, in the common currency type called “`String`”.
- ASCII, Latin-1, UCS-2, and UTF-16 are UTF-16 subsets. UTF-8 is not.
- If we have a way to get at a `String`'s code units, we need a concrete type in
which to express them in the API of `String`, which is a concrete type
- If String needs to be able to represent UTF-32, presumably the code units need
to be `UInt32`.
- Not supporting UTF-32-encoded text seems like one reasonable design choice.
- Maybe we can allow UTF-8 storage in `String` and expose its code units as
`UInt16`, just as we would for Latin-1.
- Supporting only UTF-16-subset encodings would imply that `String` indices can
be serialized without recording the `String`'s underlying encoding.

### Do we need a type-erasable base protocol for UnicodeEncoding?

UnicodeEncoding has an associated type, but it may be important to be able to
traffic in completely dynamic encoding values, e.g. for “tell me the most
efficient encoding for this string.”

### Should there be a string “facade?”

One possible design alternative makes `Unicode` a vehicle for expressing
the storage and encoding of code units, but does not attempt to give it an API
appropriate for `String`. Instead, string APIs would be provided by a generic
wrapper around an instance of `Unicode`:
struct StringFacade<U: Unicode> : BidirectionalCollection {

 // ...APIs for high-level string processing here...

 var unicode: U // access to lower-level unicode details
}

typealias String = StringFacade<StringStorage>
typealias Substring = StringFacade<StringStorage.SubSequence>
This design would allow us to de-emphasize lower-level `String` APIs such as
access to the specific encoding, by putting them behind a `.unicode` property.
A similar effect in a facade-less design would require a new top-level
`StringProtocol` playing the role of the facade with an an `associatedtype
Storage : Unicode`.

An interesting variation on this design is possible if defaulted generic
parameters are introduced to the language:
struct String<U: Unicode = StringStorage> 
 : BidirectionalCollection {

 // ...APIs for high-level string processing here...

 var unicode: U // access to lower-level unicode details
}

typealias Substring = String<StringStorage.SubSequence>
One advantage of such a design is that naïve users will always extend “the right
type” (`String`) without thinking, and the new APIs will show up on `Substring`,
`MyUTF8String`, etc. That said, it also has downsides that should not be
overlooked, not least of which is the confusability of the meaning of the word
“string.” Is it referring to the generic or the concrete type?

### `TextOutputStream` and `TextOutputStreamable`

`TextOutputStreamable` is intended to provide a vehicle for
efficiently transporting formatted representations to an output stream
without forcing the allocation of storage. Its use of `String`, a
type with multiple representations, at the lowest-level unit of
communication, conflicts with this goal. It might be sufficient to
change `TextOutputStream` and `TextOutputStreamable` to traffic in an
associated type conforming to `Unicode`, but that is not yet clear.
This area will require some design work.

### `description` and `debugDescription`

* Should these be creating localized or non-localized representations?
* Is returning a `String` efficient enough?
* Is `debugDescription` pulling the weight of the API surface area it adds?

### `StaticString`

`StaticString` was added as a byproduct of standard library developed and kept
around because it seemed useful, but it was never truly *designed* for client
programmers. We need to decide what happens with it. Presumably *something*
should fill its role, and that should conform to `Unicode`.

## Footnotes

0 The integers rewrite currently underway is expected to
substantially reduce the scope of `Int`'s API by using more
generics. [](#a0)

1 In practice, these semantics will usually be tied to the
version of the installed [ICU](http://icu-project.org <http://icu-project.org/>\) library, which
programmatically encodes the most complex rules of the Unicode Standard and its
de-facto extension, CLDR.[](#a1)

2
See
[UAX #29: Unicode Text Segmentation](UAX #29: Unicode Text Segmentation). Note
that inserting Unicode scalar values to prevent merging of grapheme clusters would
also constitute a kind of misbehavior (one of the clusters at the boundary would
not be found in the result), so would be relatively costly to implement, with
little benefit. [](#a2)

4 The use of non-UCA-compliant ordering is fully sanctioned by
the Unicode standard for this purpose. In fact there's
a [whole chapter](http://www.unicode.org/versions/Unicode9.0.0/ch05.pdf\)
dedicated to it. In particular, §5.17 says:

When comparing text that is visible to end users, a correct linguistic sort
should be used, as described in _Section 5.16, Sorting and
Searching_. However, in many circumstances the only requirement is for a
fast, well-defined ordering. In such cases, a binary ordering can be used.

[](#a4)

5 The queries supported by `NSCharacterSet` map directly onto
properties in a table that's indexed by unicode scalar value. This table is
part of the Unicode standard. Some of these queries (e.g., “is this an
uppercase character?”) may have fairly obvious generalizations to grapheme
clusters, but exactly how to do it is a research topic and *ideally* we'd either
establish the existing practice that the Unicode committee would standardize, or
the Unicode committee would do the research and we'd implement their
result.[](#a5)

_______________________________________________
swift-evolution mailing list
swift-evolution@swift.org <mailto:swift-evolution@swift.org>
https://lists.swift.org/mailman/listinfo/swift-evolution
_______________________________________________
swift-evolution mailing list
swift-evolution@swift.org
https://lists.swift.org/mailman/listinfo/swift-evolution

ben-cohen · January 23, 2017, 5:50pm

doesn't necessarily mean that ignoring that case is the right thing to do. In fact, it means that Unicode won't do anything to protect programs against these, and if Swift doesn't, chances are that no one will. Isolated combining characters break a number of expectations that developers could reasonably have:

(a + b).count == a.count + b.count
(a + b).startsWith(a)
(a + b).endsWith(b)
(a + b).find(a) // or .find(b)

Of course, this can be documented, but people want easy, and documentation is hard.

These rules, while intuitive for some collections like Array, are not documented requirements of RangeReplaceableCollection & Equatable on which they rely.

The fact is, Unicode and human language in general are not mathematically consistent things and trying to apply universal rules or find the downside-free “right” solution like this is bound to end up with some irreconcilable problems. Throughout the document, we tried take the line of presenting these inconsistencies as engineering trade-offs, because that is what we are faced with in the real world.

I’m also not sure how useful being able to rely on (a + b).count == a.count + b.count is when you are dealing with a non-fixed-width non-random-access collection. It’s definitely useful for something like array, where you might want to do this kind of math to determine, for example, the size of a target buffer. For strings, where you cannot make that kind of assumption because of their variable-width elements, not so much.

Yes. Unfortunately they also want the ability to append a string consisiting of a combining character to another string and have it append. And they don't want to be prevented from forming valid-but-defective Unicode strings.

[…]

Can you suggest an alternative that doesn't violate the Unicode standard and supports the expected use-cases, somehow?

I'm not sure I understand. Did we go from "this is a degenerate/defective <https://github.com/apple/swift/blob/master/docs/StringManifesto.md#string-should-be-a-collection-of-characters-again> case that we shouldn't bother with" to "this is a supported use case that needs to work as-is"? I've never seen anyone start a string with a combining character on purpose, though I'm familiar with just one natural language that needs combining characters. I can imagine that it could be a convenient feature in other natural languages.

Today, if you want to add a combining mark, you can insert it into the string after the character you want it glommed on to. We could, perhaps, migrate this capability into the Character type. But that would complicate Character (which currently I think would be better kept simple), probably requiring it to present its unicode scalars as a range-replaceable collection – while still needing it to retain its invariants of only ever containing exactly one grapheme.

The goal of referring to the standard’s use of the term degenerate was to stay in-line with it’s suggestion that “No special provisions are made to get marginally better behavior for degenerate cases that never occur in practice”.

Your point about combining marks and string sanitization is a good one. But there is a lot more to string sanitization than just this one example, and not a topic we’re addressing in general.

However, if Swift Strings are now designed for machine processing and less for human language convenience,

This isn’t what the document says. Rather, it acknowledges that String needs to be able to handle both well, as opposed to neither which is the case in some places today. In the Swift 5 timeframe, we would like to explore separation of the two further but in Swift 4, String must perform double duty. Restoring Collection conformance is part of acknowledging that strings are used for both machine and human language purposes, and that conformance is a key part of the former use case (it also has valid, if fewer, uses when processing human-readable text).

But as part of the redesign, the String API needs to make a cleaner separation between machine and human processing in the API, hence some of the recommendations for how compared(to:) will work. And you have to have a default, which we think ought to be the machine one, based on our guesses about people’s expectations and the most common use cases.

for me, it's easy enough to justify a safe default in the context of machine processing: `a+b` will not combine the end of `a` with the start of `b`. You could do this by inserting a ◌ that `b` could combine with if necessary. That solution would make half of the cases that I've mentioned work as expected and make the operation always safe, as far as I can tell.

This would violate another unwritten assumption of many, that a + b shouldn’t under the hood end up being a + c + b where c is some “just make it all OK” value. I’m not sure why violating this unwritten rule is better than violating the other unwritten ones.

Here’s another unwritten rule: a collection, split in two at a certain point, and then recombined, should be the same as when you started. If that point is halfway through a grapheme, should we violate that rule?

In that world, it would be a good idea to have a `&+` fallback or something like that that will let characters combine. I would think that this is a much less common use case than serializing strings, though.

My second concern is with how easy it is to convert an Int to a String index. I've been vocal about this before: I'm concerned that Swift developers will adequate Ints to random-access String iterators, which they are emphatically not. String.Index(100) is proposed as a constant-time operation

No, that has not been proposed. It would be

String.Index(codeUnitOffset: 100)

It's hard to strike a balance between keeping programmers from making mistakes and making the important use-cases easy. Do you have any suggestions for improving on what we've proposed?

That's still one extension away from String.Index(100), and one function away from an even more convenient form.

A non-random-access collection can be given integer indexing trivially with an extension. You can even do it for every collection in one shot:

extension Collection {
    subscript(i: IndexDistance) -> Iterator.Element {
        return self[index(startIndex, offsetBy: i)]
    }
}

Given this, I’m not sure why giving indices the stand-alone ability to extract their offset value is particularly worse. The above extension would be a terrible thing to do, but we shouldn’t jump through hoops and make the language harder to use just to prevent it – that is to descend into a “this is why we can’t have nice things” way of thinking where correct usage suffers excessively because it’s trying to prevent incorrect usage.

I don't have a great solution, but I don't have a great understanding of the problem that this is solving either.

Swift is unusual in it’s use of graphemes as the element of strings. It needs to interoperate smoothly with other languages, where the element is utf16. Some APIs take arguments in terms of a utf16 offset into a string, and we need to make those APIs less painful to use.

···

On Jan 22, 2017, at 9:54 PM, Félix Cloutier <felixcca@yahoo.ca> wrote:

I'm leaving it here because, AFAIK, Swift 3 imposes constraints that are hard to ignore and mostly beneficial to people outside of the English bubble, and it seems that the proposed index regresses on this.

I'm perfectly happy with interchanging indices between the different views of a String. It's getting the offset in or out of the index that I think lets people do incorrect assumptions about strings.

For the record, I'm not a great fan of the extendedASCII view either. I think that the problem that extendedASCII wants to solve is also solved by better pattern-matching, and the proposal lays a foundation for it. Mixing pretend-ASCII and Unicode is what gets you in the kind of trouble that I described in my first message.

Félix
Le 19 janv. 2017 à 18:56, Ben Cohen via swift-evolution <swift-evolution@swift.org <mailto:swift-evolution@swift.org>> a écrit :

Hi all,

Below is our take on a design manifesto for Strings in Swift 4 and beyond.

Probably best read in rendered markdown on GitHub:
https://github.com/apple/swift/blob/master/docs/StringManifesto.md

We’re eager to hear everyone’s thoughts.

Regards,
Ben and Dave

# String Processing For Swift 4

* Authors: [Dave Abrahams](https://github.com/dabrahams\), [Ben Cohen](https://github.com/airspeedswift\)

The goal of re-evaluating Strings for Swift 4 has been fairly ill-defined thus
far, with just this short blurb in the
[list of goals](https://lists.swift.org/pipermail/swift-evolution/Week-of-Mon-20160725/025676.html\):

**String re-evaluation**: String is one of the most important fundamental
types in the language. The standard library leads have numerous ideas of how
to improve the programming model for it, without jeopardizing the goals of
providing a unicode-correct-by-default model. Our goal is to be better at
string processing than Perl!

For Swift 4 and beyond we want to improve three dimensions of text processing:

1. Ergonomics
2. Correctness
3. Performance

This document is meant to both provide a sense of the long-term vision
(including undecided issues and possible approaches), and to define the scope of
work that could be done in the Swift 4 timeframe.

## General Principles

### Ergonomics

It's worth noting that ergonomics and correctness are mutually-reinforcing. An
API that is easy to use—but incorrectly—cannot be considered an ergonomic
success. Conversely, an API that's simply hard to use is also hard to use
correctly. Acheiving optimal performance without compromising ergonomics or
correctness is a greater challenge.

Consistency with the Swift language and idioms is also important for
ergonomics. There are several places both in the standard library and in the
foundation additions to `String` where patterns and practices found elsewhere
could be applied to improve usability and familiarity.

### API Surface Area

Primary data types such as `String` should have APIs that are easily understood
given a signature and a one-line summary. Today, `String` fails that test. As
you can see, the Standard Library and Foundation both contribute significantly to
its overall complexity.

**Method Arity** | **Standard Library** | **Foundation**
---|:---:|:---:
0: `ƒ()` | 5 | 7
1: `ƒ(:)` | 19 | 48
2: `ƒ(::)` | 13 | 19
3: `ƒ(:::)` | 5 | 11
4: `ƒ(::::)` | 1 | 7
5: `ƒ(:::::)` | - | 2
6: `ƒ(::::::)` | - | 1

**API Kind** | **Standard Library** | **Foundation**
---|:---:|:---:
`init` | 41 | 18
`func` | 42 | 55
`subscript` | 9 | 0
`var` | 26 | 14

**Total: 205 APIs**

By contrast, `Int` has 80 APIs, none with more than two parameters.[0] String processing is complex enough; users shouldn't have
to press through physical API sprawl just to get started.

Many of the choices detailed below contribute to solving this problem,
including:

* Restoring `Collection` conformance and dropping the `.characters` view.
* Providing a more general, composable slicing syntax.
* Altering `Comparable` so that parameterized
 (e.g. case-insensitive) comparison fits smoothly into the basic syntax.
* Clearly separating language-dependent operations on text produced
 by and for humans from language-independent
 operations on text produced by and for machine processing.
* Relocating APIs that fall outside the domain of basic string processing and
 discouraging the proliferation of ad-hoc extensions.

### Batteries Included

While `String` is available to all programs out-of-the-box, crucial APIs for
basic string processing tasks are still inaccessible until `Foundation` is
imported. While it makes sense that `Foundation` is needed for domain-specific
jobs such as
[linguistic tagging](https://developer.apple.com/reference/foundation/nslinguistictagger\),
one should not need to import anything to, for example, do case-insensitive
comparison.

### Unicode Compliance and Platform Support

The Unicode standard provides a crucial objective reference point for what
constitutes correct behavior in an extremely complex domain, so
Unicode-correctness is, and will remain, a fundamental design principle behind
Swift's `String`. That said, the Unicode standard is an evolving document, so
this objective reference-point is not fixed.[1] While
many of the most important operations—e.g. string hashing, equality, and
non-localized comparison—will be stable, the semantics
of others, such as grapheme breaking and localized comparison and case
conversion, are expected to change as platforms are updated, so programs should
be written so their correctness does not depend on precise stability of these
semantics across OS versions or platforms. Although it may be possible to
imagine static and/or dynamic analysis tools that will help users find such
errors, the only sure way to deal with this fact of life is to educate users.

## Design Points

### Internationalization

There is strong evidence that developers cannot determine how to use
internationalization APIs correctly. Although documentation could and should be
improved, the sheer size, complexity, and diversity of these APIs is a major
contributor to the problem, causing novices to tune out, and more experienced
programmers to make avoidable mistakes.

The first step in improving this situation is to regularize all localized
operations as invocations of normal string operations with extra
parameters. Among other things, this means:

1. Doing away with `localizedXXX` methods
2. Providing a terse way to name the current locale as a parameter
3. Automatically adjusting defaults for options such
 as case sensitivity based on whether the operation is localized.
4. Removing correctness traps like `localizedCaseInsensitiveCompare` (see
 guidance in the
 [Internationalization and Localization Guide](https://developer.apple.com/library/content/documentation/MacOSX/Conceptual/BPInternational/InternationalizingYourCode/InternationalizingYourCode.html\).

Along with appropriate documentation updates, these changes will make localized
operations more teachable, comprehensible, and approachable, thereby lowering a
barrier that currently leads some developers to ignore localization issues
altogether.

#### The Default Behavior of `String`

Although this isn't well-known, the most accessible form of many operations on
Swift `String` (and `NSString`) are really only appropriate for text that is
intended to be processed for, and consumed by, machines. The semantics of the
operations with the simplest spellings are always non-localized and
language-agnostic.

Two major factors play into this design choice:

1. Machine processing of text is important, so we should have first-class,
 accessible functions appropriate to that use case.

2. The most general localized operations require a locale parameter not required
 by their un-localized counterparts. This naturally skews complexity towards
 localized operations.

Reaffirming that `String`'s simplest APIs have
language-independent/machine-processed semantics has the benefit of clarifying
the proper default behavior of operations such as comparison, and allows us to
make [significant optimizations](#collation-semantics) that were previously
thought to conflict with Unicode.

#### Future Directions

One of the most common internationalization errors is the unintentional
presentation to users of text that has not been localized, but regularizing APIs
and improving documentation can go only so far in preventing this error.
Combined with the fact that `String` operations are non-localized by default,
the environment for processing human-readable text may still be somewhat
error-prone in Swift 4.

For an audience of mostly non-experts, it is especially important that naïve
code is very likely to be correct if it compiles, and that more sophisticated
issues can be revealed progressively. For this reason, we intend to
specifically and separately target localization and internationalization
problems in the Swift 5 timeframe.

### Operations With Options

There are three categories of common string operation that commonly need to be
tuned in various dimensions:

**Operation**|**Applicable Options**
---|---
sort ordering | locale, case/diacritic/width-insensitivity
case conversion | locale
pattern matching | locale, case/diacritic/width-insensitivity

The defaults for case-, diacritic-, and width-insensitivity are different for
localized operations than for non-localized operations, so for example a
localized sort should be case-insensitive by default, and a non-localized sort
should be case-sensitive by default. We propose a standard “language” of
defaulted parameters to be used for these purposes, with usage roughly like this:
 x.compared(to: y, case: .sensitive, in: swissGerman)

 x.lowercased(in: .currentLocale)

 x.allMatches(
 somePattern, case: .insensitive, diacritic: .insensitive)
This usage might be supported by code like this:
enum StringSensitivity {
case sensitive
case insensitive
}

extension Locale {
 static var currentLocale: Locale { ... }
}

extension Unicode {
 // An example of the option language in declaration context,
 // with nil defaults indicating unspecified, so defaults can be
 // driven by the presence/absence of a specific Locale
 func frobnicated(
 case caseSensitivity: StringSensitivity? = nil,
 diacritic diacriticSensitivity: StringSensitivity? = nil,
 width widthSensitivity: StringSensitivity? = nil,
 in locale: Locale? = nil
 ) -> Self { ... }
}
### Comparing and Hashing Strings

#### Collation Semantics

What Unicode says about collation—which is used in `<`, `==`, and hashing— turns
out to be quite interesting, once you pick it apart. The full Unicode Collation
Algorithm (UCA) works like this:

1. Fully normalize both strings
2. Convert each string to a sequence of numeric triples to form a collation key
3. “Flatten” the key by concatenating the sequence of first elements to the
sequence of second elements to the sequence of third elements
4. Lexicographically compare the flattened keys

While step 1 can usually
be [done quickly](UAX #15: Unicode Normalization Forms) and
incrementally, step 2 uses a collation table that maps matching *sequences* of
unicode scalars in the normalized string to *sequences* of triples, which get
accumulated into a collation key. Predictably, this is where the real costs
lie.

*However*, there are some bright spots to this story. First, as it turns out,
string sorting (localized or not) should be done down to what's called
the
[“identical” level](UTS #10: Unicode Collation Algorithm),
which adds a step 3a: append the string's normalized form to the flattened
collation key. At first blush this just adds work, but consider what it does
for equality: two strings that normalize the same, naturally, will collate the
same. But also, *strings that normalize differently will always collate
differently*. In other words, for equality, it is sufficient to compare the
strings' normalized forms and see if they are the same. We can therefore
entirely skip the expensive part of collation for equality comparison.

Next, naturally, anything that applies to equality also applies to hashing: it
is sufficient to hash the string's normalized form, bypassing collation keys.
This should provide significant speedups over the current implementation.
Perhaps more importantly, since comparison down to the “identical” level applies
even to localized strings, it means that hashing and equality can be implemented
exactly the same way for localized and non-localized text, and hash tables with
localized keys will remain valid across current-locale changes.

Finally, once it is agreed that the *default* role for `String` is to handle
machine-generated and machine-readable text, the default ordering of `String`s
need no longer use the UCA at all. It is sufficient to order them in any way
that's consistent with equality, so `String` ordering can simply be a
lexicographical comparison of normalized forms,[4]
(which is equivalent to lexicographically comparing the sequences of grapheme
clusters), again bypassing step 2 and offering another speedup.

This leaves us executing the full UCA *only* for localized sorting, and ICU's
implementation has apparently been very well optimized.

Following this scheme everywhere would also allow us to make sorting behavior
consistent across platforms. Currently, we sort `String` according to the UCA,
except that—*only on Apple platforms*—pairs of ASCII characters are ordered by
unicode scalar value.

#### Syntax

Because the current `Comparable` protocol expresses all comparisons with binary
operators, string comparisons—which may require
additional [options](#operations-with-options)—do not fit smoothly into the
existing syntax. At the same time, we'd like to solve other problems with
comparison, as outlined
in
[this proposal](https://gist.github.com/CodaFi/f0347bd37f1c407bf7ea0c429ead380e\)
(implemented by changes at the head
of
[this branch](https://github.com/CodaFi/swift/commits/space-the-final-frontier\)).
We should adopt a modification of that proposal that uses a method rather than
an operator `<=>`:
enum SortOrder { case before, same, after }

protocol Comparable : Equatable {
func compared(to: Self) -> SortOrder
...
}
This change will give us a syntactic platform on which to implement methods with
additional, defaulted arguments, thereby unifying and regularizing comparison
across the library.
extension String {
func compared(to: Self) -> SortOrder

}
**Note:** `SortOrder` should bridge to `NSComparisonResult`. It's also possible
that the standard library simply adopts Foundation's `ComparisonResult` as is,
but we believe the community should at least consider alternate naming before
that happens. There will be an opportunity to discuss the choices in detail
when the modified
[Comparison Proposal](https://gist.github.com/CodaFi/f0347bd37f1c407bf7ea0c429ead380e\) comes
up for review.

### `String` should be a `Collection` of `Character`s Again

In Swift 2.0, `String`'s `Collection` conformance was dropped, because we
convinced ourselves that its semantics differed from those of `Collection` too
significantly.

It was always well understood that if strings were treated as sequences of
`UnicodeScalar`s, algorithms such as `lexicographicalCompare`, `elementsEqual`,
and `reversed` would produce nonsense results. Thus, in Swift 1.0, `String` was
a collection of `Character` (extended grapheme clusters). During 2.0
development, though, we realized that correct string concatenation could
occasionally merge distinct grapheme clusters at the start and end of combined
strings.

This quirk aside, every aspect of strings-as-collections-of-graphemes appears to
comport perfectly with Unicode. We think the concatenation problem is tolerable,
because the cases where it occurs all represent partially-formed constructs. The
largest class—isolated combining characters such as ◌́ (U+0301 COMBINING ACUTE
ACCENT)—are explicitly called out in the Unicode standard as
“[degenerate](UAX #29: Unicode Text Segmentation)” or
“[defective](http://www.unicode.org/versions/Unicode9.0.0/ch03.pdf\)”. The other
cases—such as a string ending in a zero-width joiner or half of a regional
indicator—appear to be equally transient and unlikely outside of a text editor.

Admitting these cases encourages exploration of grapheme composition and is
consistent with what appears to be an overall Unicode philosophy that “no
special provisions are made to get marginally better behavior for… cases that
never occur in practice.”[2] Furthermore, it seems
unlikely to disturb the semantics of any plausible algorithms. We can handle
these cases by documenting them, explicitly stating that the elements of a
`String` are an emergent property based on Unicode rules.

The benefits of restoring `Collection` conformance are substantial:

* Collection-like operations encourage experimentation with strings to
 investigate and understand their behavior. This is useful for teaching new
 programmers, but also good for experienced programmers who want to
 understand more about strings/unicode.

* Extended grapheme clusters form a natural element boundary for Unicode
 strings. For example, searching and matching operations will always produce
 results that line up on grapheme cluster boundaries.

* Character-by-character processing is a legitimate thing to do in many real
 use-cases, including parsing, pattern matching, and language-specific
 transformations such as transliteration.

* `Collection` conformance makes a wide variety of powerful operations
 available that are appropriate to `String`'s default role as the vehicle for
 machine processed text.

 The methods `String` would inherit from `Collection`, where similar to
 higher-level string algorithms, have the right semantics. For example,
 grapheme-wise `lexicographicalCompare`, `elementsEqual`, and application of
 `flatMap` with case-conversion, produce the same results one would expect
 from whole-string ordering comparison, equality comparison, and
 case-conversion, respectively. `reverse` operates correctly on graphemes,
 keeping diacritics moored to their base characters and leaving emoji intact.
 Other methods such as `indexOf` and `contains` make obvious sense. A few
 `Collection` methods, like `min` and `max`, may not be particularly useful
 on `String`, but we don't consider that to be a problem worth solving, in
 the same way that we wouldn't try to suppress `min` and `max` on a
 `Set([UInt8])` that was used to store IP addresses.

* Many of the higher-level operations that we want to provide for `String`s,
 such as parsing and pattern matching, should apply to any `Collection`, and
 many of the benefits we want for `Collections`, such
 as unified slicing, should accrue
 equally to `String`. Making `String` part of the same protocol hierarchy
 allows us to write these operations once and not worry about keeping the
 benefits in sync.

* Slicing strings into substrings is a crucial part of the vocabulary of
 string processing, and all other sliceable things are `Collection`s.
 Because of its collection-like behavior, users naturally think of `String`
 in collection terms, but run into frustrating limitations where it fails to
 conform and are left to wonder where all the differences lie. Many simply
 “correct” this limitation by declaring a trivial conformance:
 extension String : BidirectionalCollection {}
Even if we removed indexing-by-element from `String`, users could still do
this:
 extension String : BidirectionalCollection {
 subscript(i: Index) -> Character { return characters[i] }
 }
 It would be much better to legitimize the conformance to `Collection` and
 simply document the oddity of any concatenation corner-cases, than to deny
 users the benefits on the grounds that a few cases are confusing.

Note that the fact that `String` is a collection of graphemes does *not* mean
that string operations will necessarily have to do grapheme boundary
recognition. See the Unicode protocol section for details.

### `Character` and `CharacterSet`

`Character`, which represents a
Unicode
[extended grapheme cluster](UAX #29: Unicode Text Segmentation),
is a bit of a black box, requiring conversion to `String` in order to
do any introspection, including interoperation with ASCII. To fix this, we should:

- Add a `unicodeScalars` view much like `String`'s, so that the sub-structure
 of grapheme clusters is discoverable.
- Add a failable `init` from sequences of scalars (returning nil for sequences
 that contain 0 or 2+ graphemes).
- (Lower priority) expose some operations, such as `func uppercase() ->
 String`, `var isASCII: Bool`, and, to the extent they can be sensibly
 generalized, queries of unicode properties that should also be exposed on
 `UnicodeScalar` such as `isAlphabetic` and `isGraphemeBase` .

Despite its name, `CharacterSet` currently operates on the Swift `UnicodeScalar`
type. This means it is usable on `String`, but only by going through the unicode
scalar view. To deal with this clash in the short term, `CharacterSet` should be
renamed to `UnicodeScalarSet`. In the longer term, it may be appropriate to
introduce a `CharacterSet` that provides similar functionality for extended
grapheme clusters.[5]

### Unification of Slicing Operations

Creating substrings is a basic part of String processing, but the slicing
operations that we have in Swift are inconsistent in both their spelling and
their naming:

* Slices with two explicit endpoints are done with subscript, and support
 in-place mutation:
 s[i..<j].mutate()
* Slicing from an index to the end, or from the start to an index, is done
with a method and does not support in-place mutation:
 s.prefix(upTo: i).readOnly()
Prefix and suffix operations should be migrated to be subscripting operations
with one-sided ranges i.e. `s.prefix(upTo: i)` should become `s[..<i]`, as
in
[this proposal](https://github.com/apple/swift-evolution/blob/9cf2685293108ea3efcbebb7ee6a8618b83d4a90/proposals/0132-sequence-end-ops.md\).
With generic subscripting in the language, that will allow us to collapse a wide
variety of methods and subscript overloads into a single implementation, and
give users an easy-to-use and composable way to describe subranges.

Further extending this EDSL to integrate use-cases like `s.prefix(maxLength: 5)`
is an ongoing research project that can be considered part of the potential
long-term vision of text (and collection) processing.

### Substrings

When implementing substring slicing, languages are faced with three options:

1. Make the substrings the same type as string, and share storage.
2. Make the substrings the same type as string, and copy storage when making the substring.
3. Make substrings a different type, with a storage copy on conversion to string.

We think number 3 is the best choice. A walk-through of the tradeoffs follows.

#### Same type, shared storage

In Swift 3.0, slicing a `String` produces a new `String` that is a view into a
subrange of the original `String`'s storage. This is why `String` is 3 words in
size (the start, length and buffer owner), unlike the similar `Array` type
which is only one.

This is a simple model with big efficiency gains when chopping up strings into
multiple smaller strings. But it does mean that a stored substring keeps the
entire original string buffer alive even after it would normally have been
released.

This arrangement has proven to be problematic in other programming languages,
because applications sometimes extract small strings from large ones and keep
those small strings long-term. That is considered a memory leak and was enough
of a problem in Java that they changed from substrings sharing storage to
making a copy in 1.7.

#### Same type, copied storage

Copying of substrings is also the choice made in C#, and in the default
`NSString` implementation. This approach avoids the memory leak issue, but has
obvious performance overhead in performing the copies.

This in turn encourages trafficking in string/range pairs instead of in
substrings, for performance reasons, leading to API challenges. For example:
foo.compare(bar, range: start..<end)
Here, it is not clear whether `range` applies to `foo` or `bar`. This
relationship is better expressed in Swift as a slicing operation:
foo[start..<end].compare(bar)
Not only does this clarify to which string the range applies, it also brings
this sub-range capability to any API that operates on `String` "for free". So
these other combinations also work equally well:
// apply range on argument rather than target
foo.compare(bar[start..<end])
// apply range on both
foo[start..<end].compare(bar[start1..<end1])
// compare two strings ignoring first character
foo.dropFirst().compare(bar.dropFirst())
In all three cases, an explicit range argument need not appear on the `compare`
method itself. The implementation of `compare` does not need to know anything
about ranges. Methods need only take range arguments when that was an
integral part of their purpose (for example, setting the start and end of a
user's current selection in a text box).

#### Different type, shared storage

The desire to share underlying storage while preventing accidental memory leaks
occurs with slices of `Array`. For this reason we have an `ArraySlice` type.
The inconvenience of a separate type is mitigated by most operations used on
`Array` from the standard library being generic over `Sequence` or `Collection`.

We should apply the same approach for `String` by introducing a distinct
`SubSequence` type, `Substring`. Similar advice given for `ArraySlice` would apply to `Substring`:

Important: Long-term storage of `Substring` instances is discouraged. A
substring holds a reference to the entire storage of a larger string, not
just to the portion it presents, even after the original string's lifetime
ends. Long-term storage of a `Substring` may therefore prolong the lifetime
of large strings that are no longer otherwise accessible, which can appear
to be memory leakage.

When assigning a `Substring` to a longer-lived variable (usually a stored
property) explicitly of type `String`, a type conversion will be performed, and
at this point the substring buffer is copied and the original string's storage
can be released.

A `String` that was not its own `Substring` could be one word—a single tagged
pointer—without requiring additional allocations. `Substring`s would be a view
onto a `String`, so are 3 words - pointer to owner, pointer to start, and a
length. The small string optimization for `Substring` would take advantage of
the larger size, probably with a less compressed encoding for speed.

The downside of having two types is the inconvenience of sometimes having a
`Substring` when you need a `String`, and vice-versa. It is likely this would
be a significantly bigger problem than with `Array` and `ArraySlice`, as
slicing of `String` is such a common operation. It is especially relevant to
existing code that assumes `String` is the currency type. To ease the pain of
type mismatches, `Substring` should be a subtype of `String` in the same way
that `Int` is a subtype of `Optional<Int>`. This would give users an implicit
conversion from `Substring` to `String`, as well as the usual implicit
conversions such as `[Substring]` to `[String]` that other subtype
relationships receive.

In most cases, type inference combined with the subtype relationship should
make the type difference a non-issue and users will not care which type they
are using. For flexibility and optimizability, most operations from the
standard library will traffic in generic models of
[`Unicode`](#the--code-unicode--code--protocol).

##### Guidance for API Designers

In this model, **if a user is unsure about which type to use, `String` is always
a reasonable default**. A `Substring` passed where `String` is expected will be
implicitly copied. When compared to the “same type, copied storage” model, we
have effectively deferred the cost of copying from the point where a substring
is created until it must be converted to `String` for use with an API.

A user who needs to optimize away copies altogether should use this guideline:
if for performance reasons you are tempted to add a `Range` argument to your
method as well as a `String` to avoid unnecessary copies, you should instead
use `Substring`.

##### The “Empty Subscript”

To make it easy to call such an optimized API when you only have a `String` (or
to call any API that takes a `Collection`'s `SubSequence` when all you have is
the `Collection`), we propose the following “empty subscript” operation,
extension Collection {
 subscript() -> SubSequence { 
 return self[startIndex..<endIndex] 
 }
}
which allows the following usage:
funcThatIsJustLooking(at: person.name[]) // pass person.name as Substring
The `` syntax can be offered as a fixit when needed, similar to `&` for an
`inout` argument. While it doesn't help a user to convert `[String]` to
`[Substring]`, the need for such conversions is extremely rare, can be done with
a simple `map` (which could also be offered by a fixit):
takesAnArrayOfSubstring(arrayOfString.map { $0[] })
#### Other Options Considered

As we have seen, all three options above have downsides, but it's possible
these downsides could be eliminated/mitigated by the compiler. We are proposing
one such mitigation—implicit conversion—as part of the the "different type,
shared storage" option, to help avoid the cognitive load on developers of
having to deal with a separate `Substring` type.

To avoid the memory leak issues of a "same type, shared storage" substring
option, we considered whether the compiler could perform an implicit copy of
the underlying storage when it detects the string is being "stored" for long
term usage, say when it is assigned to a stored property. The trouble with this
approach is it is very difficult for the compiler to distinguish between
long-term storage versus short-term in the case of abstractions that rely on
stored properties. For example, should the storing of a substring inside an
`Optional` be considered long-term? Or the storing of multiple substrings
inside an array? The latter would not work well in the case of a
`components(separatedBy:)` implementation that intended to return an array of
substrings. It would also be difficult to distinguish intentional medium-term
storage of substrings, say by a lexer. There does not appear to be an effective
consistent rule that could be applied in the general case for detecting when a
substring is truly being stored long-term.

To avoid the cost of copying substrings under "same type, copied storage", the
optimizer could be enhanced to to reduce the impact of some of those copies.
For example, this code could be optimized to pull the invariant substring out
of the loop:
for _ in 0..<lots { 
 someFunc(takingString: bigString[bigRange]) 
}
It's worth noting that a similar optimization is needed to avoid an equivalent
problem with implicit conversion in the "different type, shared storage" case:
let substring = bigString[bigRange]
for _ in 0..<lots { someFunc(takingString: substring) }
However, in the case of "same type, copied storage" there are many use cases
that cannot be optimized as easily. Consider the following simple definition of
a recursive `contains` algorithm, which when substring slicing is linear makes
the overall algorithm quadratic:
extension String {
 func containsChar(_ x: Character) -> Bool {
 return !isEmpty && (first == x || dropFirst().containsChar(x))
 }
}
For the optimizer to eliminate this problem is unrealistic, forcing the user to
remember to optimize the code to not use string slicing if they want it to be
efficient (assuming they remember):
extension String {
 // add optional argument tracking progress through the string
 func containsCharacter(_ x: Character, atOrAfter idx: Index? = nil) -> Bool {
 let idx = idx ?? startIndex
 return idx != endIndex
 && (self[idx] == x || containsCharacter(x, atOrAfter: index(after: idx)))
 }
}
#### Substrings, Ranges and Objective-C Interop

The pattern of passing a string/range pair is common in several Objective-C
APIs, and is made especially awkward in Swift by the non-interchangeability of
`Range<String.Index>` and `NSRange`.
s2.find(s2, sourceRange: NSRange(j..<s2.endIndex, in: s2))
In general, however, the Swift idiom for operating on a sub-range of a
`Collection` is to *slice* the collection and operate on that:
s2.find(s2[j..<s2.endIndex])
Therefore, APIs that operate on an `NSString`/`NSRange` pair should be imported
without the `NSRange` argument. The Objective-C importer should be changed to
give these APIs special treatment so that when a `Substring` is passed, instead
of being converted to a `String`, the full `NSString` and range are passed to
the Objective-C method, thereby avoiding a copy.

As a result, you would never need to pass an `NSRange` to these APIs, which
solves the impedance problem by eliminating the argument, resulting in more
idiomatic Swift code while retaining the performance benefit. To help users
manually handle any cases that remain, Foundation should be augmented to allow
the following syntax for converting to and from `NSRange`:
let nsr = NSRange(i..<j, in: s) // An NSRange corresponding to s[i..<j]
let iToJ = Range(nsr, in: s) // Equivalent to i..<j
### The `Unicode` protocol

With `Substring` and `String` being distinct types and sharing almost all
interface and semantics, and with the highest-performance string processing
requiring knowledge of encoding and layout that the currency types can't
provide, it becomes important to capture the common “string API” in a protocol.
Since Unicode conformance is a key feature of string processing in swift, we
call that protocol `Unicode`:

**Note:** The following assumes several features that are planned but not yet implemented in
Swift, and should be considered a sketch rather than a final design.
protocol Unicode 
 : Comparable, BidirectionalCollection where Element == Character {

 associatedtype Encoding : UnicodeEncoding
 var encoding: Encoding { get }

 associatedtype CodeUnits 
 : RandomAccessCollection where Element == Encoding.CodeUnit
 var codeUnits: CodeUnits { get }

 associatedtype UnicodeScalars 
 : BidirectionalCollection where Element == UnicodeScalar
 var unicodeScalars: UnicodeScalars { get }

 associatedtype ExtendedASCII 
 : BidirectionalCollection where Element == UInt32
 var extendedASCII: ExtendedASCII { get }

 var unicodeScalars: UnicodeScalars { get }
}

extension Unicode {
 // ... define high-level non-mutating string operations, e.g. search ...

 func compared<Other: Unicode>(
 to rhs: Other,
 case caseSensitivity: StringSensitivity? = nil,
 diacritic diacriticSensitivity: StringSensitivity? = nil,
 width widthSensitivity: StringSensitivity? = nil,
 in locale: Locale? = nil
 ) -> SortOrder { ... }
}

extension Unicode : RangeReplaceableCollection where CodeUnits :
 RangeReplaceableCollection {
 // Satisfy protocol requirement
 mutating func replaceSubrange<C : Collection>(_: Range<Index>, with: C) 
 where C.Element == Element

 // ... define high-level mutating string operations, e.g. replace ...
}
The goal is that `Unicode` exposes the underlying encoding and code units in
such a way that for types with a known representation (e.g. a high-performance
`UTF8String`) that information can be known at compile-time and can be used to
generate a single path, while still allowing types like `String` that admit
multiple representations to use runtime queries and branches to fast path
specializations.

**Note:** `Unicode` would make a fantastic namespace for much of
what's in this proposal if we could get the ability to nest types and
protocols in protocols.

### Scanning, Matching, and Tokenization

#### Low-Level Textual Analysis

We should provide convenient APIs processing strings by character. For example,
it should be easy to cleanly express, “if this string starts with `"f"`, process
the rest of the string as follows…” Swift is well-suited to expressing this
common pattern beautifully, but we need to add the APIs. Here are two examples
of the sort of code that might be possible given such APIs:
if let firstLetter = input.droppingPrefix(alphabeticCharacter) {
 somethingWith(input) // process the rest of input
}

if let (number, restOfInput) = input.parsingPrefix(Int.self) {
 ...
}
The specific spelling and functionality of APIs like this are TBD. The larger
point is to make sure matching-and-consuming jobs are well-supported.

#### Unified Pattern Matcher Protocol

Many of the current methods that do matching are overloaded to do the same
logical operations in different ways, with the following axes:

- Logical Operation: `find`, `split`, `replace`, match at start
- Kind of pattern: `CharacterSet`, `String`, a regex, a closure
- Options, e.g. case/diacritic sensitivity, locale. Sometimes a part of
the method name, and sometimes an argument
- Whole string or subrange.

We should represent these aspects as orthogonal, composable components,
abstracting pattern matchers into a protocol like
[this one](https://github.com/apple/swift/blob/master/test/Prototypes/PatternMatching.swift#L33\),
that can allow us to define logical operations once, without introducing
overloads, and massively reducing API surface area.

For example, using the strawman prefix `%` syntax to turn string literals into
patterns, the following pairs would all invoke the same generic methods:
if let found = s.firstMatch(%"searchString") { ... }
if let found = s.firstMatch(someRegex) { ... }

for m in s.allMatches((%"searchString"), case: .insensitive) { ... }
for m in s.allMatches(someRegex) { ... }

let items = s.split(separatedBy: ", ")
let tokens = s.split(separatedBy: CharacterSet.whitespace)
Note that, because Swift requires the indices of a slice to match the indices of
the range from which it was sliced, operations like `firstMatch` can return a
`Substring?` in lieu of a `Range<String.Index>?`: the indices of the match in
the string being searched, if needed, can easily be recovered as the
`startIndex` and `endIndex` of the `Substring`.

Note also that matching operations are useful for collections in general, and
would fall out of this proposal:
// replace subsequences of contiguous NaNs with zero
forces.replace(oneOrMore([Float.nan]), [0.0])
#### Regular Expressions

Addressing regular expressions is out of scope for this proposal.
That said, it is important that to note the pattern matching protocol mentioned
above provides a suitable foundation for regular expressions, and types such as
`NSRegularExpression` can easily be retrofitted to conform to it. In the
future, support for regular expression literals in the compiler could allow for
compile-time syntax checking and optimization.

### String Indices

`String` currently has four views—`characters`, `unicodeScalars`, `utf8`, and
`utf16`—each with its own opaque index type. The APIs used to translate indices
between views add needless complexity, and the opacity of indices makes them
difficult to serialize.

The index translation problem has two aspects:

1. `String` views cannot consume one anothers' indices without a cumbersome
 conversion step. An index into a `String`'s `characters` must be translated
 before it can be used as a position in its `unicodeScalars`. Although these
 translations are rarely needed, they add conceptual and API complexity.
2. Many APIs in the core libraries and other frameworks still expose `String`
 positions as `Int`s and regions as `NSRange`s, which can only reference a
 `utf16` view and interoperate poorly with `String` itself.

#### Index Interchange Among Views

String's need for flexible backing storage and reasonably-efficient indexing
(i.e. without dynamically allocating and reference-counting the indices
themselves) means indices need an efficient underlying storage type. Although
we do not wish to expose `String`'s indices *as* integers, `Int` offsets into
underlying code unit storage makes a good underlying storage type, provided
`String`'s underlying storage supports random-access. We think random-access
*code-unit storage* is a reasonable requirement to impose on all `String`
instances.

Making these `Int` code unit offsets conveniently accessible and constructible
solves the serialization problem:
clipboard.write(s.endIndex.codeUnitOffset)
let offset = clipboard.read(Int.self)
let i = String.Index(codeUnitOffset: offset)
Index interchange between `String` and its `unicodeScalars`, `codeUnits`,
and [`extendedASCII`](#parsing-ascii-structure) views can be made entirely
seamless by having them share an index type (semantics of indexing a `String`
between grapheme cluster boundaries are TBD—it can either trap or be forgiving).
Having a common index allows easy traversal into the interior of graphemes,
something that is often needed, without making it likely that someone will do it
by accident.

- `String.index(after:)` should advance to the next grapheme, even when the
index points partway through a grapheme.

- `String.index(before:)` should move to the start of the grapheme before
the current position.

Seamless index interchange between `String` and its UTF-8 or UTF-16 views is not
crucial, as the specifics of encoding should not be a concern for most use
cases, and would impose needless costs on the indices of other views. That
said, we can make translation much more straightforward by exposing simple
bidirectional converting `init`s on both index types:
let u8Position = String.UTF8.Index(someStringIndex)
let originalPosition = String.Index(u8Position)
#### Index Interchange with Cocoa

We intend to address `NSRange`s that denote substrings in Cocoa APIs as
described [later in this document](#substrings--ranges-and-objective-c-interop).
That leaves the interchange of bare indices with Cocoa APIs trafficking in
`Int`. Hopefully such APIs will be rare, but when needed, the following
extension, which would be useful for all `Collections`, can help:
extension Collection {
 func index(offset: IndexDistance) -> Index {
 return index(startIndex, offsetBy: offset)
 }
 func offset(of i: Index) -> IndexDistance {
 return distance(from: startIndex, to: i)
 }
}
Then integers can easily be translated into offsets into a `String`'s `utf16`
view for consumption by Cocoa:
let cocoaIndex = s.utf16.offset(of: String.UTF16Index(i))
let swiftIndex = s.utf16.index(offset: cocoaIndex)
### Formatting

A full treatment of formatting is out of scope of this proposal, but
we believe it's crucial for completing the text processing picture. This
section details some of the existing issues and thinking that may guide future
development.

#### Printf-Style Formatting

`String.format` is designed on the `printf` model: it takes a format string with
textual placeholders for substitution, and an arbitrary list of other arguments.
The syntax and meaning of these placeholders has a long history in
C, but for anyone who doesn't use them regularly they are cryptic and complex,
as the `printf (3)` man page attests.

Aside from complexity, this style of API has two major problems: First, the
spelling of these placeholders must match up to the types of the arguments, in
the right order, or the behavior is undefined. Some limited support for
compile-time checking of this correspondence could be implemented, but only for
the cases where the format string is a literal. Second, there's no reasonable
way to extend the formatting vocabulary to cover the needs of new types: you are
stuck with what's in the box.

#### Foundation Formatters

The formatters supplied by Foundation are highly capable and versatile, offering
both formatting and parsing services. When used for formatting, though, the
design pattern demands more from users than it should:

* Matching the type of data being formatted to a formatter type
* Creating an instance of that type
* Setting stateful options (`currency`, `dateStyle`) on the type. Note: the
 need for this step prevents the instance from being used and discarded in
 the same expression where it is created.
* Overall, introduction of needless verbosity into source

These may seem like small issues, but the experience of Apple localization
experts is that the total drag of these factors on programmers is such that they
tend to reach for `String.format` instead.

#### String Interpolation

Swift string interpolation provides a user-friendly alternative to printf's
domain-specific language (just write ordinary swift code!) and its type safety
problems (put the data right where it belongs!) but the following issues prevent
it from being useful for localized formatting (among other jobs):

* [SR-2303](https://bugs.swift.org/browse/SR-2303\) We are unable to restrict
 types used in string interpolation.
* [SR-1260](https://bugs.swift.org/browse/SR-1260\) String interpolation can't
 distinguish (fragments of) the base string from the string substitutions.

In the long run, we should improve Swift string interpolation to the point where
it can participate in most any formatting job. Mostly this centers around
fixing the interpolation protocols per the previous item, and supporting
localization.

To be able to use formatting effectively inside interpolations, it needs to be
both lightweight (because it all happens in-situ) and discoverable. One
approach would be to standardize on `format` methods, e.g.:
"Column 1: \(n.format(radix:16, width:8)) *** \(message)"

"Something with leading zeroes: \(x.format(fill: zero, width:8))"
### C String Interop

Our support for interoperation with nul-terminated C strings is scattered and
incoherent, with 6 ways to transform a C string into a `String` and four ways to
do the inverse. These APIs should be replaced with the following
extension String {
 /// Constructs a `String` having the same contents as `nulTerminatedUTF8`.
 ///
 /// - Parameter nulTerminatedUTF8: a sequence of contiguous UTF-8 encoded 
 /// bytes ending just before the first zero byte (NUL character).
 init(cString nulTerminatedUTF8: UnsafePointer<CChar>)

 /// Constructs a `String` having the same contents as `nulTerminatedCodeUnits`.
 ///
 /// - Parameter nulTerminatedCodeUnits: a sequence of contiguous code units in
 /// the given `encoding`, ending just before the first zero code unit.
 /// - Parameter encoding: describes the encoding in which the code units
 /// should be interpreted.
 init<Encoding: UnicodeEncoding>(
 cString nulTerminatedCodeUnits: UnsafePointer<Encoding.CodeUnit>,
 encoding: Encoding)

 /// Invokes the given closure on the contents of the string, represented as a
 /// pointer to a null-terminated sequence of UTF-8 code units.
 func withCString<Result>(
 _ body: (UnsafePointer<CChar>) throws -> Result) rethrows -> Result
}
In both of the construction APIs, any invalid encoding sequence detected will
have its longest valid prefix replaced by U+FFFD, the Unicode replacement
character, per Unicode specification. This covers the common case. The
replacement is done *physically* in the underlying storage and the validity of
the result is recorded in the `String`'s `encoding` such that future accesses
need not be slowed down by possible error repair separately.

Construction that is aborted when encoding errors are detected can be
accomplished using APIs on the `encoding`. String types that retain their
physical encoding even in the presence of errors and are repaired on-the-fly can
be built as different instances of the `Unicode` protocol.

### Unicode 9 Conformance

Unicode 9 (and MacOS 10.11) brought us support for family emoji, which changes
the process of properly identifying `Character` boundaries. We need to update
`String` to account for this change.

### High-Performance String Processing

Many strings are short enough to store in 64 bits, many can be stored using only
8 bits per unicode scalar, others are best encoded in UTF-16, and some come to
us already in some other encoding, such as UTF-8, that would be costly to
translate. Supporting these formats while maintaining usability for
general-purpose APIs demands that a single `String` type can be backed by many
different representations.

That said, the highest performance code always requires static knowledge of the
data structures on which it operates, and for this code, dynamic selection of
representation comes at too high a cost. Heavy-duty text processing demands a
way to opt out of dynamism and directly use known encodings. Having this
ability can also make it easy to cleanly specialize code that handles dynamic
cases for maximal efficiency on the most common representations.

To address this need, we can build models of the `Unicode` protocol that encode
representation information into the type, such as `NFCNormalizedUTF16String`.

### Parsing ASCII Structure

Although many machine-readable formats support the inclusion of arbitrary
Unicode text, it is also common that their fundamental structure lies entirely
within the ASCII subset (JSON, YAML, many XML formats). These formats are often
processed most efficiently by recognizing ASCII structural elements as ASCII,
and capturing the arbitrary sections between them in more-general strings. The
current String API offers no way to efficiently recognize ASCII and skip past
everything else without the overhead of full decoding into unicode scalars.

For these purposes, strings should supply an `extendedASCII` view that is a
collection of `UInt32`, where values less than `0x80` represent the
corresponding ASCII character, and other values represent data that is specific
to the underlying encoding of the string.

## Language Support

This proposal depends on two new features in the Swift language:

1. **Generic subscripts**, to
 enable unified slicing syntax.

2. **A subtype relationship** between
 `Substring` and `String`, enabling framework APIs to traffic solely in
 `String` while still making it possible to avoid copies by handling
 `Substring`s where necessary.

Additionally, **the ability to nest types and protocols inside
protocols** could significantly shrink the footprint of this proposal
on the top-level Swift namespace.

## Open Questions

### Must `String` be limited to storing UTF-16 subset encodings?

- The ability to handle `UTF-8`-encoded strings (models of `Unicode`) is not in
question here; this is about what encodings must be storable, without
transcoding, in the common currency type called “`String`”.
- ASCII, Latin-1, UCS-2, and UTF-16 are UTF-16 subsets. UTF-8 is not.
- If we have a way to get at a `String`'s code units, we need a concrete type in
which to express them in the API of `String`, which is a concrete type
- If String needs to be able to represent UTF-32, presumably the code units need
to be `UInt32`.
- Not supporting UTF-32-encoded text seems like one reasonable design choice.
- Maybe we can allow UTF-8 storage in `String` and expose its code units as
`UInt16`, just as we would for Latin-1.
- Supporting only UTF-16-subset encodings would imply that `String` indices can
be serialized without recording the `String`'s underlying encoding.

### Do we need a type-erasable base protocol for UnicodeEncoding?

UnicodeEncoding has an associated type, but it may be important to be able to
traffic in completely dynamic encoding values, e.g. for “tell me the most
efficient encoding for this string.”

### Should there be a string “facade?”

One possible design alternative makes `Unicode` a vehicle for expressing
the storage and encoding of code units, but does not attempt to give it an API
appropriate for `String`. Instead, string APIs would be provided by a generic
wrapper around an instance of `Unicode`:
struct StringFacade<U: Unicode> : BidirectionalCollection {

 // ...APIs for high-level string processing here...

 var unicode: U // access to lower-level unicode details
}

typealias String = StringFacade<StringStorage>
typealias Substring = StringFacade<StringStorage.SubSequence>
This design would allow us to de-emphasize lower-level `String` APIs such as
access to the specific encoding, by putting them behind a `.unicode` property.
A similar effect in a facade-less design would require a new top-level
`StringProtocol` playing the role of the facade with an an `associatedtype
Storage : Unicode`.

An interesting variation on this design is possible if defaulted generic
parameters are introduced to the language:
struct String<U: Unicode = StringStorage> 
 : BidirectionalCollection {

 // ...APIs for high-level string processing here...

 var unicode: U // access to lower-level unicode details
}

typealias Substring = String<StringStorage.SubSequence>
One advantage of such a design is that naïve users will always extend “the right
type” (`String`) without thinking, and the new APIs will show up on `Substring`,
`MyUTF8String`, etc. That said, it also has downsides that should not be
overlooked, not least of which is the confusability of the meaning of the word
“string.” Is it referring to the generic or the concrete type?

### `TextOutputStream` and `TextOutputStreamable`

`TextOutputStreamable` is intended to provide a vehicle for
efficiently transporting formatted representations to an output stream
without forcing the allocation of storage. Its use of `String`, a
type with multiple representations, at the lowest-level unit of
communication, conflicts with this goal. It might be sufficient to
change `TextOutputStream` and `TextOutputStreamable` to traffic in an
associated type conforming to `Unicode`, but that is not yet clear.
This area will require some design work.

### `description` and `debugDescription`

* Should these be creating localized or non-localized representations?
* Is returning a `String` efficient enough?
* Is `debugDescription` pulling the weight of the API surface area it adds?

### `StaticString`

`StaticString` was added as a byproduct of standard library developed and kept
around because it seemed useful, but it was never truly *designed* for client
programmers. We need to decide what happens with it. Presumably *something*
should fill its role, and that should conform to `Unicode`.

## Footnotes

0 The integers rewrite currently underway is expected to
substantially reduce the scope of `Int`'s API by using more
generics. [](#a0)

1 In practice, these semantics will usually be tied to the
version of the installed [ICU](http://icu-project.org <http://icu-project.org/>\) library, which
programmatically encodes the most complex rules of the Unicode Standard and its
de-facto extension, CLDR.[](#a1)

2
See
[UAX #29: Unicode Text Segmentation](UAX #29: Unicode Text Segmentation). Note
that inserting Unicode scalar values to prevent merging of grapheme clusters would
also constitute a kind of misbehavior (one of the clusters at the boundary would
not be found in the result), so would be relatively costly to implement, with
little benefit. [](#a2)

4 The use of non-UCA-compliant ordering is fully sanctioned by
the Unicode standard for this purpose. In fact there's
a [whole chapter](http://www.unicode.org/versions/Unicode9.0.0/ch05.pdf\)
dedicated to it. In particular, §5.17 says:

When comparing text that is visible to end users, a correct linguistic sort
should be used, as described in _Section 5.16, Sorting and
Searching_. However, in many circumstances the only requirement is for a
fast, well-defined ordering. In such cases, a binary ordering can be used.

[](#a4)

5 The queries supported by `NSCharacterSet` map directly onto
properties in a table that's indexed by unicode scalar value. This table is
part of the Unicode standard. Some of these queries (e.g., “is this an
uppercase character?”) may have fairly obvious generalizations to grapheme
clusters, but exactly how to do it is a research topic and *ideally* we'd either
establish the existing practice that the Unicode committee would standardize, or
the Unicode committee would do the research and we'd implement their
result.[](#a5)

_______________________________________________
swift-evolution mailing list
swift-evolution@swift.org <mailto:swift-evolution@swift.org>
https://lists.swift.org/mailman/listinfo/swift-evolution

Karl · January 23, 2017, 12:08pm

We could have a pair of helper functions to search for the grapheme cluster boundary relative to a given CodeUnit.Index:

/// Returns the index at the start of the grapheme-cluster containing the given code-unit.
func indexOfCharacterBoundary(at i: CodeUnits.Index) -> CodeUnits.Index

/// Returns the index at the start of the grapheme-cluster following the given code-unit.
func indexOfCharacterBoundary(after i: CodeUnits.Index) -> CodeUnits.Index

Actually, if we do forgiving conversion when sharing indexes between String views, it might be nice to expose these explicit index-adjusting functions anyway.

···

On 23 Jan 2017, at 06:54, Félix Cloutier via swift-evolution <swift-evolution@swift.org> wrote:

doesn't necessarily mean that ignoring that case is the right thing to do. In fact, it means that Unicode won't do anything to protect programs against these, and if Swift doesn't, chances are that no one will. Isolated combining characters break a number of expectations that developers could reasonably have:

(a + b).count == a.count + b.count
(a + b).startsWith(a)
(a + b).endsWith(b)
(a + b).find(a) // or .find(b)

Of course, this can be documented, but people want easy, and documentation is hard.

Yes. Unfortunately they also want the ability to append a string consisiting of a combining character to another string and have it append. And they don't want to be prevented from forming valid-but-defective Unicode strings.

[…]

Can you suggest an alternative that doesn't violate the Unicode standard and supports the expected use-cases, somehow?

I'm not sure I understand. Did we go from "this is a degenerate/defective <https://github.com/apple/swift/blob/master/docs/StringManifesto.md#string-should-be-a-collection-of-characters-again> case that we shouldn't bother with" to "this is a supported use case that needs to work as-is"? I've never seen anyone start a string with a combining character on purpose, though I'm familiar with just one natural language that needs combining characters. I can imagine that it could be a convenient feature in other natural languages.

However, if Swift Strings are now designed for machine processing and less for human language convenience, for me, it's easy enough to justify a safe default in the context of machine processing: `a+b` will not combine the end of `a` with the start of `b`. You could do this by inserting a ◌ that `b` could combine with if necessary. That solution would make half of the cases that I've mentioned work as expected and make the operation always safe, as far as I can tell.

In that world, it would be a good idea to have a `&+` fallback or something like that that will let characters combine. I would think that this is a much less common use case than serializing strings, though.

My second concern is with how easy it is to convert an Int to a String index. I've been vocal about this before: I'm concerned that Swift developers will adequate Ints to random-access String iterators, which they are emphatically not. String.Index(100) is proposed as a constant-time operation

No, that has not been proposed. It would be

String.Index(codeUnitOffset: 100)

It's hard to strike a balance between keeping programmers from making mistakes and making the important use-cases easy. Do you have any suggestions for improving on what we've proposed?

That's still one extension away from String.Index(100), and one function away from an even more convenient form. I don't have a great solution, but I don't have a great understanding of the problem that this is solving either. I'm leaving it here because, AFAIK, Swift 3 imposes constraints that are hard to ignore and mostly beneficial to people outside of the English bubble, and it seems that the proposed index regresses on this.

I'm perfectly happy with interchanging indices between the different views of a String. It's getting the offset in or out of the index that I think lets people do incorrect assumptions about strings.

anandabits · January 23, 2017, 11:50pm

Taken from NSHipster <http://nshipster.com/nsregularexpression/>:
Happily, on one thing we can all agree. In NSRegularExpression, Cocoa has the most long-winded and byzantine regular expression interface you’re ever likely to come across.

There is no way to achieve the goal of being better at string processing than Perl without regular expressions being addressed. It just should not be ignored.

We’re certainly not ignoring the importance of regexes. But if there’s a key takeaway from your experiences with NSRegularExpression, it’s that a good regex implementation matters, a lot. That’s why we don’t want to rush one in alongside the rest of the overhaul of String. Instead, we should take our time to make it really great, and building on a solid foundation of a good String API that’s already in place should help ensure that.

I do think that there's some danger to focusing too narrowly on regular expressions as they appear in languages today. I think the industry has largely moved on to fully-structured formats that require proper parsing beyond what traditional regexes can handle. The decades of experience with Perl shows that making regexes too easy to use without an easy ramp up to more sophisticated string processing leads to people cutting corners trying to make regex-based designs kind-of work. The Perl 6 folks recognized this and developed their "regular expression" support into something that supported arbitrary grammars; I think we'd do well to start at that level by looking at what they've done.

Big +1 to this, which is why I fully support deferring this to the future. Let’s wait until we can devote the attention required to do it right.

···

On Jan 23, 2017, at 4:27 PM, Joe Groff via swift-evolution <swift-evolution@swift.org> wrote:

On Jan 23, 2017, at 2:06 PM, Ben Cohen via swift-evolution <swift-evolution@swift.org <mailto:swift-evolution@swift.org>> wrote:

On Jan 23, 2017, at 7:49 AM, Joshua Alvarado <alvaradojoshua0@gmail.com <mailto:alvaradojoshua0@gmail.com>> wrote:

-Joe

_______________________________________________
swift-evolution mailing list
swift-evolution@swift.org
https://lists.swift.org/mailman/listinfo/swift-evolution

dabrahams · January 24, 2017, 4:45am

doesn't necessarily mean that ignoring that case is the right thing to do. In fact, it means that Unicode won't do anything to protect programs against these, and if Swift doesn't, chances are that no one will. Isolated combining characters break a number of expectations that developers could reasonably have:

(a + b).count == a.count + b.count
(a + b).startsWith(a)
(a + b).endsWith(b)
(a + b).find(a) // or .find(b)

Of course, this can be documented, but people want easy, and documentation is hard.

Yes. Unfortunately they also want the ability to append a string consisiting of a combining character to another string and have it append. And they don't want to be prevented from forming valid-but-defective Unicode strings.

[…]

Can you suggest an alternative that doesn't violate the Unicode standard and supports the expected use-cases, somehow?

I'm not sure I understand. Did we go from "this is a degenerate/defective case that we shouldn't bother with" to "this is a supported use case that needs to work as-is"?

No. The Unicode standard says it's a valid string, so we shouldn't prohibit it. The standard also says it's a corner case for which it isn't worth making heroic efforts to create sensible semantics. It's totally in keeping with the Unicode standards that we treat it as proposed.

In a domain as complex as String processing, we need a guiding star, and that star is the Unicode standard. I'm very reluctant to do anything that clashes with the spirit of the standard.

I've never seen anyone start a string with a combining character on purpose,

It will occur as a byproduct of the process of attaching a diacritic to a base character.

though I'm familiar with just one natural language that needs combining characters. I can imagine that it could be a convenient feature in other natural languages.

However, if Swift Strings are now designed for machine processing and less for human language convenience, for me, it's easy enough to justify a safe default in the context of machine processing: `a+b` will not combine the end of `a` with the start of `b`. You could do this by inserting a ◌ that `b` could combine with if necessary.

You can do it, but it trades one semantic problem for a usability problem, without solving all the semantic problems: you end up with a.count + b.count == (a+b).count, sure, but you still don't satisfy the usual law of collections that (a+b).contains(b.first!) if b is non-empty, and now you've made it difficult to attach diacritics to base characters.

That solution would make half of the cases that I've mentioned work as expected and make the operation always safe, as far as I can tell.

In that world, it would be a good idea to have a `&+` fallback or something like that that will let characters combine. I would think that this is a much less common use case than serializing strings, though.

My second concern is with how easy it is to convert an Int to a String index. I've been vocal about this before: I'm concerned that Swift developers will adequate Ints to random-access String iterators, which they are emphatically not. String.Index(100) is proposed as a constant-time operation

No, that has not been proposed. It would be

String.Index(codeUnitOffset: 100)

It's hard to strike a balance between keeping programmers from making mistakes and making the important use-cases easy. Do you have any suggestions for improving on what we've proposed?

That's still one extension away from String.Index(100), and one function away from an even more convenient form.

There's nothing we can do to prevent programmers from making inefficient things look efficient, and it never has to take more than a single function or extension, and we *do* need to be able to serialize and deserialize string indices.

I don't have a great solution, but I don't have a great understanding of the problem that this is solving either. I'm leaving it here because, AFAIK, Swift 3 imposes constraints that are hard to ignore and mostly beneficial to people outside of the English bubble, and it seems that the proposed index regresses on this.

I'm perfectly happy with interchanging indices between the different views of a String. It's getting the offset in or out of the index that I think lets people do incorrect assumptions about strings.

There's nothing we can do to prevent people getting that offset:

   let n = s.codeUnits.distance(from: s.codeUnits.startIndex, to: p)
   let p2 = s.codeUnits.index(s.codeUnits.startIndex, offsetBy: n)
   assert(p == p2)

As you say, these are only one function or extension away from being convenient.

For the record, I'm not a great fan of the extendedASCII view either. I think that the problem that extendedASCII wants to solve is also solved by better pattern-matching, and the proposal lays a foundation for it.

extendedASCII is an essential part of that foundation. When you know your pattern is entirely ASCII—which is very common—you can take advantage of extendedASCII to make pattern matching against general Unicode both correct and efficient. If you don't do something like this, pattern matching will never have efficiency competitive with hand-written parsers, and people will continue to use Array<CChar> instead of String/Unicode in order to get efficiency.

···

Sent from my iPad

On Jan 22, 2017, at 9:54 PM, Félix Cloutier <felixcca@yahoo.ca> wrote:

Mixing pretend-ASCII and Unicode is what gets you in the kind of trouble that I described in my first message.

Félix
Le 19 janv. 2017 à 18:56, Ben Cohen via swift-evolution <swift-evolution@swift.org> a écrit :

Hi all,

Below is our take on a design manifesto for Strings in Swift 4 and beyond.

Probably best read in rendered markdown on GitHub:
https://github.com/apple/swift/blob/master/docs/StringManifesto.md

We’re eager to hear everyone’s thoughts.

Regards,
Ben and Dave

# String Processing For Swift 4

* Authors: [Dave Abrahams](https://github.com/dabrahams\), [Ben Cohen](https://github.com/airspeedswift\)

The goal of re-evaluating Strings for Swift 4 has been fairly ill-defined thus
far, with just this short blurb in the
[list of goals](https://lists.swift.org/pipermail/swift-evolution/Week-of-Mon-20160725/025676.html\):

**String re-evaluation**: String is one of the most important fundamental
types in the language. The standard library leads have numerous ideas of how
to improve the programming model for it, without jeopardizing the goals of
providing a unicode-correct-by-default model. Our goal is to be better at
string processing than Perl!

For Swift 4 and beyond we want to improve three dimensions of text processing:

1. Ergonomics
2. Correctness
3. Performance

This document is meant to both provide a sense of the long-term vision
(including undecided issues and possible approaches), and to define the scope of
work that could be done in the Swift 4 timeframe.

## General Principles

### Ergonomics

It's worth noting that ergonomics and correctness are mutually-reinforcing. An
API that is easy to use—but incorrectly—cannot be considered an ergonomic
success. Conversely, an API that's simply hard to use is also hard to use
correctly. Acheiving optimal performance without compromising ergonomics or
correctness is a greater challenge.

Consistency with the Swift language and idioms is also important for
ergonomics. There are several places both in the standard library and in the
foundation additions to `String` where patterns and practices found elsewhere
could be applied to improve usability and familiarity.

### API Surface Area

Primary data types such as `String` should have APIs that are easily understood
given a signature and a one-line summary. Today, `String` fails that test. As
you can see, the Standard Library and Foundation both contribute significantly to
its overall complexity.

**Method Arity** | **Standard Library** | **Foundation**
---|:---:|:---:
0: `ƒ()` | 5 | 7
1: `ƒ(:)` | 19 | 48
2: `ƒ(::)` | 13 | 19
3: `ƒ(:::)` | 5 | 11
4: `ƒ(::::)` | 1 | 7
5: `ƒ(:::::)` | - | 2
6: `ƒ(::::::)` | - | 1

**API Kind** | **Standard Library** | **Foundation**
---|:---:|:---:
`init` | 41 | 18
`func` | 42 | 55
`subscript` | 9 | 0
`var` | 26 | 14

**Total: 205 APIs**

By contrast, `Int` has 80 APIs, none with more than two parameters.[0] String processing is complex enough; users shouldn't have
to press through physical API sprawl just to get started.

Many of the choices detailed below contribute to solving this problem,
including:

* Restoring `Collection` conformance and dropping the `.characters` view.
* Providing a more general, composable slicing syntax.
* Altering `Comparable` so that parameterized
 (e.g. case-insensitive) comparison fits smoothly into the basic syntax.
* Clearly separating language-dependent operations on text produced
 by and for humans from language-independent
 operations on text produced by and for machine processing.
* Relocating APIs that fall outside the domain of basic string processing and
 discouraging the proliferation of ad-hoc extensions.

### Batteries Included

While `String` is available to all programs out-of-the-box, crucial APIs for
basic string processing tasks are still inaccessible until `Foundation` is
imported. While it makes sense that `Foundation` is needed for domain-specific
jobs such as
[linguistic tagging](https://developer.apple.com/reference/foundation/nslinguistictagger\),
one should not need to import anything to, for example, do case-insensitive
comparison.

### Unicode Compliance and Platform Support

The Unicode standard provides a crucial objective reference point for what
constitutes correct behavior in an extremely complex domain, so
Unicode-correctness is, and will remain, a fundamental design principle behind
Swift's `String`. That said, the Unicode standard is an evolving document, so
this objective reference-point is not fixed.[1] While
many of the most important operations—e.g. string hashing, equality, and
non-localized comparison—will be stable, the semantics
of others, such as grapheme breaking and localized comparison and case
conversion, are expected to change as platforms are updated, so programs should
be written so their correctness does not depend on precise stability of these
semantics across OS versions or platforms. Although it may be possible to
imagine static and/or dynamic analysis tools that will help users find such
errors, the only sure way to deal with this fact of life is to educate users.

## Design Points

### Internationalization

There is strong evidence that developers cannot determine how to use
internationalization APIs correctly. Although documentation could and should be
improved, the sheer size, complexity, and diversity of these APIs is a major
contributor to the problem, causing novices to tune out, and more experienced
programmers to make avoidable mistakes.

The first step in improving this situation is to regularize all localized
operations as invocations of normal string operations with extra
parameters. Among other things, this means:

1. Doing away with `localizedXXX` methods
2. Providing a terse way to name the current locale as a parameter
3. Automatically adjusting defaults for options such
 as case sensitivity based on whether the operation is localized.
4. Removing correctness traps like `localizedCaseInsensitiveCompare` (see
 guidance in the
 [Internationalization and Localization Guide](https://developer.apple.com/library/content/documentation/MacOSX/Conceptual/BPInternational/InternationalizingYourCode/InternationalizingYourCode.html\).

Along with appropriate documentation updates, these changes will make localized
operations more teachable, comprehensible, and approachable, thereby lowering a
barrier that currently leads some developers to ignore localization issues
altogether.

#### The Default Behavior of `String`

Although this isn't well-known, the most accessible form of many operations on
Swift `String` (and `NSString`) are really only appropriate for text that is
intended to be processed for, and consumed by, machines. The semantics of the
operations with the simplest spellings are always non-localized and
language-agnostic.

Two major factors play into this design choice:

1. Machine processing of text is important, so we should have first-class,
 accessible functions appropriate to that use case.

2. The most general localized operations require a locale parameter not required
 by their un-localized counterparts. This naturally skews complexity towards
 localized operations.

Reaffirming that `String`'s simplest APIs have
language-independent/machine-processed semantics has the benefit of clarifying
the proper default behavior of operations such as comparison, and allows us to
make [significant optimizations](#collation-semantics) that were previously
thought to conflict with Unicode.

#### Future Directions

One of the most common internationalization errors is the unintentional
presentation to users of text that has not been localized, but regularizing APIs
and improving documentation can go only so far in preventing this error.
Combined with the fact that `String` operations are non-localized by default,
the environment for processing human-readable text may still be somewhat
error-prone in Swift 4.

For an audience of mostly non-experts, it is especially important that naïve
code is very likely to be correct if it compiles, and that more sophisticated
issues can be revealed progressively. For this reason, we intend to
specifically and separately target localization and internationalization
problems in the Swift 5 timeframe.

### Operations With Options

There are three categories of common string operation that commonly need to be
tuned in various dimensions:

**Operation**|**Applicable Options**
---|---
sort ordering | locale, case/diacritic/width-insensitivity
case conversion | locale
pattern matching | locale, case/diacritic/width-insensitivity

The defaults for case-, diacritic-, and width-insensitivity are different for
localized operations than for non-localized operations, so for example a
localized sort should be case-insensitive by default, and a non-localized sort
should be case-sensitive by default. We propose a standard “language” of
defaulted parameters to be used for these purposes, with usage roughly like this:
 x.compared(to: y, case: .sensitive, in: swissGerman)

 x.lowercased(in: .currentLocale)

 x.allMatches(
 somePattern, case: .insensitive, diacritic: .insensitive)
This usage might be supported by code like this:
enum StringSensitivity {
case sensitive
case insensitive
}

extension Locale {
 static var currentLocale: Locale { ... }
}

extension Unicode {
 // An example of the option language in declaration context,
 // with nil defaults indicating unspecified, so defaults can be
 // driven by the presence/absence of a specific Locale
 func frobnicated(
 case caseSensitivity: StringSensitivity? = nil,
 diacritic diacriticSensitivity: StringSensitivity? = nil,
 width widthSensitivity: StringSensitivity? = nil,
 in locale: Locale? = nil
 ) -> Self { ... }
}
### Comparing and Hashing Strings

#### Collation Semantics

What Unicode says about collation—which is used in `<`, `==`, and hashing— turns
out to be quite interesting, once you pick it apart. The full Unicode Collation
Algorithm (UCA) works like this:

1. Fully normalize both strings
2. Convert each string to a sequence of numeric triples to form a collation key
3. “Flatten” the key by concatenating the sequence of first elements to the
sequence of second elements to the sequence of third elements
4. Lexicographically compare the flattened keys

While step 1 can usually
be [done quickly](UAX #15: Unicode Normalization Forms) and
incrementally, step 2 uses a collation table that maps matching *sequences* of
unicode scalars in the normalized string to *sequences* of triples, which get
accumulated into a collation key. Predictably, this is where the real costs
lie.

*However*, there are some bright spots to this story. First, as it turns out,
string sorting (localized or not) should be done down to what's called
the
[“identical” level](UTS #10: Unicode Collation Algorithm),
which adds a step 3a: append the string's normalized form to the flattened
collation key. At first blush this just adds work, but consider what it does
for equality: two strings that normalize the same, naturally, will collate the
same. But also, *strings that normalize differently will always collate
differently*. In other words, for equality, it is sufficient to compare the
strings' normalized forms and see if they are the same. We can therefore
entirely skip the expensive part of collation for equality comparison.

Next, naturally, anything that applies to equality also applies to hashing: it
is sufficient to hash the string's normalized form, bypassing collation keys.
This should provide significant speedups over the current implementation.
Perhaps more importantly, since comparison down to the “identical” level applies
even to localized strings, it means that hashing and equality can be implemented
exactly the same way for localized and non-localized text, and hash tables with
localized keys will remain valid across current-locale changes.

Finally, once it is agreed that the *default* role for `String` is to handle
machine-generated and machine-readable text, the default ordering of `String`s
need no longer use the UCA at all. It is sufficient to order them in any way
that's consistent with equality, so `String` ordering can simply be a
lexicographical comparison of normalized forms,[4]
(which is equivalent to lexicographically comparing the sequences of grapheme
clusters), again bypassing step 2 and offering another speedup.

This leaves us executing the full UCA *only* for localized sorting, and ICU's
implementation has apparently been very well optimized.

Following this scheme everywhere would also allow us to make sorting behavior
consistent across platforms. Currently, we sort `String` according to the UCA,
except that—*only on Apple platforms*—pairs of ASCII characters are ordered by
unicode scalar value.

#### Syntax

Because the current `Comparable` protocol expresses all comparisons with binary
operators, string comparisons—which may require
additional [options](#operations-with-options)—do not fit smoothly into the
existing syntax. At the same time, we'd like to solve other problems with
comparison, as outlined
in
[this proposal](https://gist.github.com/CodaFi/f0347bd37f1c407bf7ea0c429ead380e\)
(implemented by changes at the head
of
[this branch](https://github.com/CodaFi/swift/commits/space-the-final-frontier\)).
We should adopt a modification of that proposal that uses a method rather than
an operator `<=>`:
enum SortOrder { case before, same, after }

protocol Comparable : Equatable {
func compared(to: Self) -> SortOrder
...
}
This change will give us a syntactic platform on which to implement methods with
additional, defaulted arguments, thereby unifying and regularizing comparison
across the library.
extension String {
func compared(to: Self) -> SortOrder

}
**Note:** `SortOrder` should bridge to `NSComparisonResult`. It's also possible
that the standard library simply adopts Foundation's `ComparisonResult` as is,
but we believe the community should at least consider alternate naming before
that happens. There will be an opportunity to discuss the choices in detail
when the modified
[Comparison Proposal](https://gist.github.com/CodaFi/f0347bd37f1c407bf7ea0c429ead380e\) comes
up for review.

### `String` should be a `Collection` of `Character`s Again

In Swift 2.0, `String`'s `Collection` conformance was dropped, because we
convinced ourselves that its semantics differed from those of `Collection` too
significantly.

It was always well understood that if strings were treated as sequences of
`UnicodeScalar`s, algorithms such as `lexicographicalCompare`, `elementsEqual`,
and `reversed` would produce nonsense results. Thus, in Swift 1.0, `String` was
a collection of `Character` (extended grapheme clusters). During 2.0
development, though, we realized that correct string concatenation could
occasionally merge distinct grapheme clusters at the start and end of combined
strings.

This quirk aside, every aspect of strings-as-collections-of-graphemes appears to
comport perfectly with Unicode. We think the concatenation problem is tolerable,
because the cases where it occurs all represent partially-formed constructs. The
largest class—isolated combining characters such as ◌́ (U+0301 COMBINING ACUTE
ACCENT)—are explicitly called out in the Unicode standard as
“[degenerate](UAX #29: Unicode Text Segmentation)” or
“[defective](http://www.unicode.org/versions/Unicode9.0.0/ch03.pdf\)”. The other
cases—such as a string ending in a zero-width joiner or half of a regional
indicator—appear to be equally transient and unlikely outside of a text editor.

Admitting these cases encourages exploration of grapheme composition and is
consistent with what appears to be an overall Unicode philosophy that “no
special provisions are made to get marginally better behavior for… cases that
never occur in practice.”[2] Furthermore, it seems
unlikely to disturb the semantics of any plausible algorithms. We can handle
these cases by documenting them, explicitly stating that the elements of a
`String` are an emergent property based on Unicode rules.

The benefits of restoring `Collection` conformance are substantial:

* Collection-like operations encourage experimentation with strings to
 investigate and understand their behavior. This is useful for teaching new
 programmers, but also good for experienced programmers who want to
 understand more about strings/unicode.

* Extended grapheme clusters form a natural element boundary for Unicode
 strings. For example, searching and matching operations will always produce
 results that line up on grapheme cluster boundaries.

* Character-by-character processing is a legitimate thing to do in many real
 use-cases, including parsing, pattern matching, and language-specific
 transformations such as transliteration.

* `Collection` conformance makes a wide variety of powerful operations
 available that are appropriate to `String`'s default role as the vehicle for
 machine processed text.

 The methods `String` would inherit from `Collection`, where similar to
 higher-level string algorithms, have the right semantics. For example,
 grapheme-wise `lexicographicalCompare`, `elementsEqual`, and application of
 `flatMap` with case-conversion, produce the same results one would expect
 from whole-string ordering comparison, equality comparison, and
 case-conversion, respectively. `reverse` operates correctly on graphemes,
 keeping diacritics moored to their base characters and leaving emoji intact.
 Other methods such as `indexOf` and `contains` make obvious sense. A few
 `Collection` methods, like `min` and `max`, may not be particularly useful
 on `String`, but we don't consider that to be a problem worth solving, in
 the same way that we wouldn't try to suppress `min` and `max` on a
 `Set([UInt8])` that was used to store IP addresses.

* Many of the higher-level operations that we want to provide for `String`s,
 such as parsing and pattern matching, should apply to any `Collection`, and
 many of the benefits we want for `Collections`, such
 as unified slicing, should accrue
 equally to `String`. Making `String` part of the same protocol hierarchy
 allows us to write these operations once and not worry about keeping the
 benefits in sync.

* Slicing strings into substrings is a crucial part of the vocabulary of
 string processing, and all other sliceable things are `Collection`s.
 Because of its collection-like behavior, users naturally think of `String`
 in collection terms, but run into frustrating limitations where it fails to
 conform and are left to wonder where all the differences lie. Many simply
 “correct” this limitation by declaring a trivial conformance:
 extension String : BidirectionalCollection {}
Even if we removed indexing-by-element from `String`, users could still do
this:
 extension String : BidirectionalCollection {
 subscript(i: Index) -> Character { return characters[i] }
 }
 It would be much better to legitimize the conformance to `Collection` and
 simply document the oddity of any concatenation corner-cases, than to deny
 users the benefits on the grounds that a few cases are confusing.

Note that the fact that `String` is a collection of graphemes does *not* mean
that string operations will necessarily have to do grapheme boundary
recognition. See the Unicode protocol section for details.

### `Character` and `CharacterSet`

`Character`, which represents a
Unicode
[extended grapheme cluster](UAX #29: Unicode Text Segmentation),
is a bit of a black box, requiring conversion to `String` in order to
do any introspection, including interoperation with ASCII. To fix this, we should:

- Add a `unicodeScalars` view much like `String`'s, so that the sub-structure
 of grapheme clusters is discoverable.
- Add a failable `init` from sequences of scalars (returning nil for sequences
 that contain 0 or 2+ graphemes).
- (Lower priority) expose some operations, such as `func uppercase() ->
 String`, `var isASCII: Bool`, and, to the extent they can be sensibly
 generalized, queries of unicode properties that should also be exposed on
 `UnicodeScalar` such as `isAlphabetic` and `isGraphemeBase` .

Despite its name, `CharacterSet` currently operates on the Swift `UnicodeScalar`
type. This means it is usable on `String`, but only by going through the unicode
scalar view. To deal with this clash in the short term, `CharacterSet` should be
renamed to `UnicodeScalarSet`. In the longer term, it may be appropriate to
introduce a `CharacterSet` that provides similar functionality for extended
grapheme clusters.[5]

### Unification of Slicing Operations

Creating substrings is a basic part of String processing, but the slicing
operations that we have in Swift are inconsistent in both their spelling and
their naming:

* Slices with two explicit endpoints are done with subscript, and support
 in-place mutation:
 s[i..<j].mutate()
* Slicing from an index to the end, or from the start to an index, is done
with a method and does not support in-place mutation:
 s.prefix(upTo: i).readOnly()
Prefix and suffix operations should be migrated to be subscripting operations
with one-sided ranges i.e. `s.prefix(upTo: i)` should become `s[..<i]`, as
in
[this proposal](https://github.com/apple/swift-evolution/blob/9cf2685293108ea3efcbebb7ee6a8618b83d4a90/proposals/0132-sequence-end-ops.md\).
With generic subscripting in the language, that will allow us to collapse a wide
variety of methods and subscript overloads into a single implementation, and
give users an easy-to-use and composable way to describe subranges.

Further extending this EDSL to integrate use-cases like `s.prefix(maxLength: 5)`
is an ongoing research project that can be considered part of the potential
long-term vision of text (and collection) processing.

### Substrings

When implementing substring slicing, languages are faced with three options:

1. Make the substrings the same type as string, and share storage.
2. Make the substrings the same type as string, and copy storage when making the substring.
3. Make substrings a different type, with a storage copy on conversion to string.

We think number 3 is the best choice. A walk-through of the tradeoffs follows.

#### Same type, shared storage

In Swift 3.0, slicing a `String` produces a new `String` that is a view into a
subrange of the original `String`'s storage. This is why `String` is 3 words in
size (the start, length and buffer owner), unlike the similar `Array` type
which is only one.

This is a simple model with big efficiency gains when chopping up strings into
multiple smaller strings. But it does mean that a stored substring keeps the
entire original string buffer alive even after it would normally have been
released.

This arrangement has proven to be problematic in other programming languages,
because applications sometimes extract small strings from large ones and keep
those small strings long-term. That is considered a memory leak and was enough
of a problem in Java that they changed from substrings sharing storage to
making a copy in 1.7.

#### Same type, copied storage

Copying of substrings is also the choice made in C#, and in the default
`NSString` implementation. This approach avoids the memory leak issue, but has
obvious performance overhead in performing the copies.

This in turn encourages trafficking in string/range pairs instead of in
substrings, for performance reasons, leading to API challenges. For example:
foo.compare(bar, range: start..<end)
Here, it is not clear whether `range` applies to `foo` or `bar`. This
relationship is better expressed in Swift as a slicing operation:
foo[start..<end].compare(bar)
Not only does this clarify to which string the range applies, it also brings
this sub-range capability to any API that operates on `String` "for free". So
these other combinations also work equally well:
// apply range on argument rather than target
foo.compare(bar[start..<end])
// apply range on both
foo[start..<end].compare(bar[start1..<end1])
// compare two strings ignoring first character
foo.dropFirst().compare(bar.dropFirst())
In all three cases, an explicit range argument need not appear on the `compare`
method itself. The implementation of `compare` does not need to know anything
about ranges. Methods need only take range arguments when that was an
integral part of their purpose (for example, setting the start and end of a
user's current selection in a text box).

#### Different type, shared storage

The desire to share underlying storage while preventing accidental memory leaks
occurs with slices of `Array`. For this reason we have an `ArraySlice` type.
The inconvenience of a separate type is mitigated by most operations used on
`Array` from the standard library being generic over `Sequence` or `Collection`.

We should apply the same approach for `String` by introducing a distinct
`SubSequence` type, `Substring`. Similar advice given for `ArraySlice` would apply to `Substring`:

Important: Long-term storage of `Substring` instances is discouraged. A
substring holds a reference to the entire storage of a larger string, not
just to the portion it presents, even after the original string's lifetime
ends. Long-term storage of a `Substring` may therefore prolong the lifetime
of large strings that are no longer otherwise accessible, which can appear
to be memory leakage.

When assigning a `Substring` to a longer-lived variable (usually a stored
property) explicitly of type `String`, a type conversion will be performed, and
at this point the substring buffer is copied and the original string's storage
can be released.

A `String` that was not its own `Substring` could be one word—a single tagged
pointer—without requiring additional allocations. `Substring`s would be a view
onto a `String`, so are 3 words - pointer to owner, pointer to start, and a
length. The small string optimization for `Substring` would take advantage of
the larger size, probably with a less compressed encoding for speed.

The downside of having two types is the inconvenience of sometimes having a
`Substring` when you need a `String`, and vice-versa. It is likely this would
be a significantly bigger problem than with `Array` and `ArraySlice`, as
slicing of `String` is such a common operation. It is especially relevant to
existing code that assumes `String` is the currency type. To ease the pain of
type mismatches, `Substring` should be a subtype of `String` in the same way
that `Int` is a subtype of `Optional<Int>`. This would give users an implicit
conversion from `Substring` to `String`, as well as the usual implicit
conversions such as `[Substring]` to `[String]` that other subtype
relationships receive.

In most cases, type inference combined with the subtype relationship should
make the type difference a non-issue and users will not care which type they
are using. For flexibility and optimizability, most operations from the
standard library will traffic in generic models of
[`Unicode`](#the--code-unicode--code--protocol).

##### Guidance for API Designers

In this model, **if a user is unsure about which type to use, `String` is always
a reasonable default**. A `Substring` passed where `String` is expected will be
implicitly copied. When compared to the “same type, copied storage” model, we
have effectively deferred the cost of copying from the point where a substring
is created until it must be converted to `String` for use with an API.

A user who needs to optimize away copies altogether should use this guideline:
if for performance reasons you are tempted to add a `Range` argument to your
method as well as a `String` to avoid unnecessary copies, you should instead
use `Substring`.

##### The “Empty Subscript”

To make it easy to call such an optimized API when you only have a `String` (or
to call any API that takes a `Collection`'s `SubSequence` when all you have is
the `Collection`), we propose the following “empty subscript” operation,
extension Collection {
 subscript() -> SubSequence { 
 return self[startIndex..<endIndex] 
 }
}
which allows the following usage:
funcThatIsJustLooking(at: person.name[]) // pass person.name as Substring
The `` syntax can be offered as a fixit when needed, similar to `&` for an
`inout` argument. While it doesn't help a user to convert `[String]` to
`[Substring]`, the need for such conversions is extremely rare, can be done with
a simple `map` (which could also be offered by a fixit):
takesAnArrayOfSubstring(arrayOfString.map { $0[] })
#### Other Options Considered

As we have seen, all three options above have downsides, but it's possible
these downsides could be eliminated/mitigated by the compiler. We are proposing
one such mitigation—implicit conversion—as part of the the "different type,
shared storage" option, to help avoid the cognitive load on developers of
having to deal with a separate `Substring` type.

To avoid the memory leak issues of a "same type, shared storage" substring
option, we considered whether the compiler could perform an implicit copy of
the underlying storage when it detects the string is being "stored" for long
term usage, say when it is assigned to a stored property. The trouble with this
approach is it is very difficult for the compiler to distinguish between
long-term storage versus short-term in the case of abstractions that rely on
stored properties. For example, should the storing of a substring inside an
`Optional` be considered long-term? Or the storing of multiple substrings
inside an array? The latter would not work well in the case of a
`components(separatedBy:)` implementation that intended to return an array of
substrings. It would also be difficult to distinguish intentional medium-term
storage of substrings, say by a lexer. There does not appear to be an effective
consistent rule that could be applied in the general case for detecting when a
substring is truly being stored long-term.

To avoid the cost of copying substrings under "same type, copied storage", the
optimizer could be enhanced to to reduce the impact of some of those copies.
For example, this code could be optimized to pull the invariant substring out
of the loop:
for _ in 0..<lots { 
 someFunc(takingString: bigString[bigRange]) 
}
It's worth noting that a similar optimization is needed to avoid an equivalent
problem with implicit conversion in the "different type, shared storage" case:
let substring = bigString[bigRange]
for _ in 0..<lots { someFunc(takingString: substring) }
However, in the case of "same type, copied storage" there are many use cases
that cannot be optimized as easily. Consider the following simple definition of
a recursive `contains` algorithm, which when substring slicing is linear makes
the overall algorithm quadratic:
extension String {
 func containsChar(_ x: Character) -> Bool {
 return !isEmpty && (first == x || dropFirst().containsChar(x))
 }
}
For the optimizer to eliminate this problem is unrealistic, forcing the user to
remember to optimize the code to not use string slicing if they want it to be
efficient (assuming they remember):
extension String {
 // add optional argument tracking progress through the string
 func containsCharacter(_ x: Character, atOrAfter idx: Index? = nil) -> Bool {
 let idx = idx ?? startIndex
 return idx != endIndex
 && (self[idx] == x || containsCharacter(x, atOrAfter: index(after: idx)))
 }
}
#### Substrings, Ranges and Objective-C Interop

The pattern of passing a string/range pair is common in several Objective-C
APIs, and is made especially awkward in Swift by the non-interchangeability of
`Range<String.Index>` and `NSRange`.
s2.find(s2, sourceRange: NSRange(j..<s2.endIndex, in: s2))
In general, however, the Swift idiom for operating on a sub-range of a
`Collection` is to *slice* the collection and operate on that:
s2.find(s2[j..<s2.endIndex])
Therefore, APIs that operate on an `NSString`/`NSRange` pair should be imported
without the `NSRange` argument. The Objective-C importer should be changed to
give these APIs special treatment so that when a `Substring` is passed, instead
of being converted to a `String`, the full `NSString` and range are passed to
the Objective-C method, thereby avoiding a copy.

As a result, you would never need to pass an `NSRange` to these APIs, which
solves the impedance problem by eliminating the argument, resulting in more
idiomatic Swift code while retaining the performance benefit. To help users
manually handle any cases that remain, Foundation should be augmented to allow
the following syntax for converting to and from `NSRange`:
let nsr = NSRange(i..<j, in: s) // An NSRange corresponding to s[i..<j]
let iToJ = Range(nsr, in: s) // Equivalent to i..<j
### The `Unicode` protocol

With `Substring` and `String` being distinct types and sharing almost all
interface and semantics, and with the highest-performance string processing
requiring knowledge of encoding and layout that the currency types can't
provide, it becomes important to capture the common “string API” in a protocol.
Since Unicode conformance is a key feature of string processing in swift, we
call that protocol `Unicode`:

**Note:** The following assumes several features that are planned but not yet implemented in
Swift, and should be considered a sketch rather than a final design.
protocol Unicode 
 : Comparable, BidirectionalCollection where Element == Character {

 associatedtype Encoding : UnicodeEncoding
 var encoding: Encoding { get }

 associatedtype CodeUnits 
 : RandomAccessCollection where Element == Encoding.CodeUnit
 var codeUnits: CodeUnits { get }

 associatedtype UnicodeScalars 
 : BidirectionalCollection where Element == UnicodeScalar
 var unicodeScalars: UnicodeScalars { get }

 associatedtype ExtendedASCII 
 : BidirectionalCollection where Element == UInt32
 var extendedASCII: ExtendedASCII { get }

 var unicodeScalars: UnicodeScalars { get }
}

extension Unicode {
 // ... define high-level non-mutating string operations, e.g. search ...

 func compared<Other: Unicode>(
 to rhs: Other,
 case caseSensitivity: StringSensitivity? = nil,
 diacritic diacriticSensitivity: StringSensitivity? = nil,
 width widthSensitivity: StringSensitivity? = nil,
 in locale: Locale? = nil
 ) -> SortOrder { ... }
}

extension Unicode : RangeReplaceableCollection where CodeUnits :
 RangeReplaceableCollection {
 // Satisfy protocol requirement
 mutating func replaceSubrange<C : Collection>(_: Range<Index>, with: C) 
 where C.Element == Element

 // ... define high-level mutating string operations, e.g. replace ...
}
The goal is that `Unicode` exposes the underlying encoding and code units in
such a way that for types with a known representation (e.g. a high-performance
`UTF8String`) that information can be known at compile-time and can be used to
generate a single path, while still allowing types like `String` that admit
multiple representations to use runtime queries and branches to fast path
specializations.

**Note:** `Unicode` would make a fantastic namespace for much of
what's in this proposal if we could get the ability to nest types and
protocols in protocols.

### Scanning, Matching, and Tokenization

#### Low-Level Textual Analysis

We should provide convenient APIs processing strings by character. For example,
it should be easy to cleanly express, “if this string starts with `"f"`, process
the rest of the string as follows…” Swift is well-suited to expressing this
common pattern beautifully, but we need to add the APIs. Here are two examples
of the sort of code that might be possible given such APIs:
if let firstLetter = input.droppingPrefix(alphabeticCharacter) {
 somethingWith(input) // process the rest of input
}

if let (number, restOfInput) = input.parsingPrefix(Int.self) {
 ...
}
The specific spelling and functionality of APIs like this are TBD. The larger
point is to make sure matching-and-consuming jobs are well-supported.

#### Unified Pattern Matcher Protocol

Many of the current methods that do matching are overloaded to do the same
logical operations in different ways, with the following axes:

- Logical Operation: `find`, `split`, `replace`, match at start
- Kind of pattern: `CharacterSet`, `String`, a regex, a closure
- Options, e.g. case/diacritic sensitivity, locale. Sometimes a part of
the method name, and sometimes an argument
- Whole string or subrange.

We should represent these aspects as orthogonal, composable components,
abstracting pattern matchers into a protocol like
[this one](https://github.com/apple/swift/blob/master/test/Prototypes/PatternMatching.swift#L33\),
that can allow us to define logical operations once, without introducing
overloads, and massively reducing API surface area.

For example, using the strawman prefix `%` syntax to turn string literals into
patterns, the following pairs would all invoke the same generic methods:
if let found = s.firstMatch(%"searchString") { ... }
if let found = s.firstMatch(someRegex) { ... }

for m in s.allMatches((%"searchString"), case: .insensitive) { ... }
for m in s.allMatches(someRegex) { ... }

let items = s.split(separatedBy: ", ")
let tokens = s.split(separatedBy: CharacterSet.whitespace)
Note that, because Swift requires the indices of a slice to match the indices of
the range from which it was sliced, operations like `firstMatch` can return a
`Substring?` in lieu of a `Range<String.Index>?`: the indices of the match in
the string being searched, if needed, can easily be recovered as the
`startIndex` and `endIndex` of the `Substring`.

Note also that matching operations are useful for collections in general, and
would fall out of this proposal:
// replace subsequences of contiguous NaNs with zero
forces.replace(oneOrMore([Float.nan]), [0.0])
#### Regular Expressions

Addressing regular expressions is out of scope for this proposal.
That said, it is important that to note the pattern matching protocol mentioned
above provides a suitable foundation for regular expressions, and types such as
`NSRegularExpression` can easily be retrofitted to conform to it. In the
future, support for regular expression literals in the compiler could allow for
compile-time syntax checking and optimization.

### String Indices

`String` currently has four views—`characters`, `unicodeScalars`, `utf8`, and
`utf16`—each with its own opaque index type. The APIs used to translate indices
between views add needless complexity, and the opacity of indices makes them
difficult to serialize.

The index translation problem has two aspects:

1. `String` views cannot consume one anothers' indices without a cumbersome
 conversion step. An index into a `String`'s `characters` must be translated
 before it can be used as a position in its `unicodeScalars`. Although these
 translations are rarely needed, they add conceptual and API complexity.
2. Many APIs in the core libraries and other frameworks still expose `String`
 positions as `Int`s and regions as `NSRange`s, which can only reference a
 `utf16` view and interoperate poorly with `String` itself.

#### Index Interchange Among Views

String's need for flexible backing storage and reasonably-efficient indexing
(i.e. without dynamically allocating and reference-counting the indices
themselves) means indices need an efficient underlying storage type. Although
we do not wish to expose `String`'s indices *as* integers, `Int` offsets into
underlying code unit storage makes a good underlying storage type, provided
`String`'s underlying storage supports random-access. We think random-access
*code-unit storage* is a reasonable requirement to impose on all `String`
instances.

Making these `Int` code unit offsets conveniently accessible and constructible
solves the serialization problem:
clipboard.write(s.endIndex.codeUnitOffset)
let offset = clipboard.read(Int.self)
let i = String.Index(codeUnitOffset: offset)
Index interchange between `String` and its `unicodeScalars`, `codeUnits`,
and [`extendedASCII`](#parsing-ascii-structure) views can be made entirely
seamless by having them share an index type (semantics of indexing a `String`
between grapheme cluster boundaries are TBD—it can either trap or be forgiving).
Having a common index allows easy traversal into the interior of graphemes,
something that is often needed, without making it likely that someone will do it
by accident.

- `String.index(after:)` should advance to the next grapheme, even when the
index points partway through a grapheme.

- `String.index(before:)` should move to the start of the grapheme before
the current position.

Seamless index interchange between `String` and its UTF-8 or UTF-16 views is not
crucial, as the specifics of encoding should not be a concern for most use
cases, and would impose needless costs on the indices of other views. That
said, we can make translation much more straightforward by exposing simple
bidirectional converting `init`s on both index types:
let u8Position = String.UTF8.Index(someStringIndex)
let originalPosition = String.Index(u8Position)
#### Index Interchange with Cocoa

We intend to address `NSRange`s that denote substrings in Cocoa APIs as
described [later in this document](#substrings--ranges-and-objective-c-interop).
That leaves the interchange of bare indices with Cocoa APIs trafficking in
`Int`. Hopefully such APIs will be rare, but when needed, the following
extension, which would be useful for all `Collections`, can help:
extension Collection {
 func index(offset: IndexDistance) -> Index {
 return index(startIndex, offsetBy: offset)
 }
 func offset(of i: Index) -> IndexDistance {
 return distance(from: startIndex, to: i)
 }
}
Then integers can easily be translated into offsets into a `String`'s `utf16`
view for consumption by Cocoa:
let cocoaIndex = s.utf16.offset(of: String.UTF16Index(i))
let swiftIndex = s.utf16.index(offset: cocoaIndex)
### Formatting

A full treatment of formatting is out of scope of this proposal, but
we believe it's crucial for completing the text processing picture. This
section details some of the existing issues and thinking that may guide future
development.

#### Printf-Style Formatting

`String.format` is designed on the `printf` model: it takes a format string with
textual placeholders for substitution, and an arbitrary list of other arguments.
The syntax and meaning of these placeholders has a long history in
C, but for anyone who doesn't use them regularly they are cryptic and complex,
as the `printf (3)` man page attests.

Aside from complexity, this style of API has two major problems: First, the
spelling of these placeholders must match up to the types of the arguments, in
the right order, or the behavior is undefined. Some limited support for
compile-time checking of this correspondence could be implemented, but only for
the cases where the format string is a literal. Second, there's no reasonable
way to extend the formatting vocabulary to cover the needs of new types: you are
stuck with what's in the box.

#### Foundation Formatters

The formatters supplied by Foundation are highly capable and versatile, offering
both formatting and parsing services. When used for formatting, though, the
design pattern demands more from users than it should:

* Matching the type of data being formatted to a formatter type
* Creating an instance of that type
* Setting stateful options (`currency`, `dateStyle`) on the type. Note: the
 need for this step prevents the instance from being used and discarded in
 the same expression where it is created.
* Overall, introduction of needless verbosity into source

These may seem like small issues, but the experience of Apple localization
experts is that the total drag of these factors on programmers is such that they
tend to reach for `String.format` instead.

#### String Interpolation

Swift string interpolation provides a user-friendly alternative to printf's
domain-specific language (just write ordinary swift code!) and its type safety
problems (put the data right where it belongs!) but the following issues prevent
it from being useful for localized formatting (among other jobs):

* [SR-2303](https://bugs.swift.org/browse/SR-2303\) We are unable to restrict
 types used in string interpolation.
* [SR-1260](https://bugs.swift.org/browse/SR-1260\) String interpolation can't
 distinguish (fragments of) the base string from the string substitutions.

In the long run, we should improve Swift string interpolation to the point where
it can participate in most any formatting job. Mostly this centers around
fixing the interpolation protocols per the previous item, and supporting
localization.

To be able to use formatting effectively inside interpolations, it needs to be
both lightweight (because it all happens in-situ) and discoverable. One
approach would be to standardize on `format` methods, e.g.:
"Column 1: \(n.format(radix:16, width:8)) *** \(message)"

"Something with leading zeroes: \(x.format(fill: zero, width:8))"
### C String Interop

Our support for interoperation with nul-terminated C strings is scattered and
incoherent, with 6 ways to transform a C string into a `String` and four ways to
do the inverse. These APIs should be replaced with the following
extension String {
 /// Constructs a `String` having the same contents as `nulTerminatedUTF8`.
 ///
 /// - Parameter nulTerminatedUTF8: a sequence of contiguous UTF-8 encoded 
 /// bytes ending just before the first zero byte (NUL character).
 init(cString nulTerminatedUTF8: UnsafePointer<CChar>)

 /// Constructs a `String` having the same contents as `nulTerminatedCodeUnits`.
 ///
 /// - Parameter nulTerminatedCodeUnits: a sequence of contiguous code units in
 /// the given `encoding`, ending just before the first zero code unit.
 /// - Parameter encoding: describes the encoding in which the code units
 /// should be interpreted.
 init<Encoding: UnicodeEncoding>(
 cString nulTerminatedCodeUnits: UnsafePointer<Encoding.CodeUnit>,
 encoding: Encoding)

 /// Invokes the given closure on the contents of the string, represented as a
 /// pointer to a null-terminated sequence of UTF-8 code units.
 func withCString<Result>(
 _ body: (UnsafePointer<CChar>) throws -> Result) rethrows -> Result
}
In both of the construction APIs, any invalid encoding sequence detected will
have its longest valid prefix replaced by U+FFFD, the Unicode replacement
character, per Unicode specification. This covers the common case. The
replacement is done *physically* in the underlying storage and the validity of
the result is recorded in the `String`'s `encoding` such that future accesses
need not be slowed down by possible error repair separately.

Construction that is aborted when encoding errors are detected can be
accomplished using APIs on the `encoding`. String types that retain their
physical encoding even in the presence of errors and are repaired on-the-fly can
be built as different instances of the `Unicode` protocol.

### Unicode 9 Conformance

Unicode 9 (and MacOS 10.11) brought us support for family emoji, which changes
the process of properly identifying `Character` boundaries. We need to update
`String` to account for this change.

### High-Performance String Processing

Many strings are short enough to store in 64 bits, many can be stored using only
8 bits per unicode scalar, others are best encoded in UTF-16, and some come to
us already in some other encoding, such as UTF-8, that would be costly to
translate. Supporting these formats while maintaining usability for
general-purpose APIs demands that a single `String` type can be backed by many
different representations.

That said, the highest performance code always requires static knowledge of the
data structures on which it operates, and for this code, dynamic selection of
representation comes at too high a cost. Heavy-duty text processing demands a
way to opt out of dynamism and directly use known encodings. Having this
ability can also make it easy to cleanly specialize code that handles dynamic
cases for maximal efficiency on the most common representations.

To address this need, we can build models of the `Unicode` protocol that encode
representation information into the type, such as `NFCNormalizedUTF16String`.

### Parsing ASCII Structure

Although many machine-readable formats support the inclusion of arbitrary
Unicode text, it is also common that their fundamental structure lies entirely
within the ASCII subset (JSON, YAML, many XML formats). These formats are often
processed most efficiently by recognizing ASCII structural elements as ASCII,
and capturing the arbitrary sections between them in more-general strings. The
current String API offers no way to efficiently recognize ASCII and skip past
everything else without the overhead of full decoding into unicode scalars.

For these purposes, strings should supply an `extendedASCII` view that is a
collection of `UInt32`, where values less than `0x80` represent the
corresponding ASCII character, and other values represent data that is specific
to the underlying encoding of the string.

## Language Support

This proposal depends on two new features in the Swift language:

1. **Generic subscripts**, to
 enable unified slicing syntax.

2. **A subtype relationship** between
 `Substring` and `String`, enabling framework APIs to traffic solely in
 `String` while still making it possible to avoid copies by handling
 `Substring`s where necessary.

Additionally, **the ability to nest types and protocols inside
protocols** could significantly shrink the footprint of this proposal
on the top-level Swift namespace.

## Open Questions

### Must `String` be limited to storing UTF-16 subset encodings?

- The ability to handle `UTF-8`-encoded strings (models of `Unicode`) is not in
question here; this is about what encodings must be storable, without
transcoding, in the common currency type called “`String`”.
- ASCII, Latin-1, UCS-2, and UTF-16 are UTF-16 subsets. UTF-8 is not.
- If we have a way to get at a `String`'s code units, we need a concrete type in
which to express them in the API of `String`, which is a concrete type
- If String needs to be able to represent UTF-32, presumably the code units need
to be `UInt32`.
- Not supporting UTF-32-encoded text seems like one reasonable design choice.
- Maybe we can allow UTF-8 storage in `String` and expose its code units as
`UInt16`, just as we would for Latin-1.
- Supporting only UTF-16-subset encodings would imply that `String` indices can
be serialized without recording the `String`'s underlying encoding.

### Do we need a type-erasable base protocol for UnicodeEncoding?

UnicodeEncoding has an associated type, but it may be important to be able to
traffic in completely dynamic encoding values, e.g. for “tell me the most
efficient encoding for this string.”

### Should there be a string “facade?”

One possible design alternative makes `Unicode` a vehicle for expressing
the storage and encoding of code units, but does not attempt to give it an API
appropriate for `String`. Instead, string APIs would be provided by a generic
wrapper around an instance of `Unicode`:
struct StringFacade<U: Unicode> : BidirectionalCollection {

 // ...APIs for high-level string processing here...

 var unicode: U // access to lower-level unicode details
}

typealias String = StringFacade<StringStorage>
typealias Substring = StringFacade<StringStorage.SubSequence>
This design would allow us to de-emphasize lower-level `String` APIs such as
access to the specific encoding, by putting them behind a `.unicode` property.
A similar effect in a facade-less design would require a new top-level
`StringProtocol` playing the role of the facade with an an `associatedtype
Storage : Unicode`.

An interesting variation on this design is possible if defaulted generic
parameters are introduced to the language:
struct String<U: Unicode = StringStorage> 
 : BidirectionalCollection {

 // ...APIs for high-level string processing here...

 var unicode: U // access to lower-level unicode details
}

typealias Substring = String<StringStorage.SubSequence>
One advantage of such a design is that naïve users will always extend “the right
type” (`String`) without thinking, and the new APIs will show up on `Substring`,
`MyUTF8String`, etc. That said, it also has downsides that should not be
overlooked, not least of which is the confusability of the meaning of the word
“string.” Is it referring to the generic or the concrete type?

### `TextOutputStream` and `TextOutputStreamable`

`TextOutputStreamable` is intended to provide a vehicle for
efficiently transporting formatted representations to an output stream
without forcing the allocation of storage. Its use of `String`, a
type with multiple representations, at the lowest-level unit of
communication, conflicts with this goal. It might be sufficient to
change `TextOutputStream` and `TextOutputStreamable` to traffic in an
associated type conforming to `Unicode`, but that is not yet clear.
This area will require some design work.

### `description` and `debugDescription`

* Should these be creating localized or non-localized representations?
* Is returning a `String` efficient enough?
* Is `debugDescription` pulling the weight of the API surface area it adds?

### `StaticString`

`StaticString` was added as a byproduct of standard library developed and kept
around because it seemed useful, but it was never truly *designed* for client
programmers. We need to decide what happens with it. Presumably *something*
should fill its role, and that should conform to `Unicode`.

## Footnotes

0 The integers rewrite currently underway is expected to
substantially reduce the scope of `Int`'s API by using more
generics. [](#a0)

1 In practice, these semantics will usually be tied to the
version of the installed [ICU](http://icu-project.org) library, which
programmatically encodes the most complex rules of the Unicode Standard and its
de-facto extension, CLDR.[](#a1)

2
See
[UAX #29: Unicode Text Segmentation](UAX #29: Unicode Text Segmentation). Note
that inserting Unicode scalar values to prevent merging of grapheme clusters would
also constitute a kind of misbehavior (one of the clusters at the boundary would
not be found in the result), so would be relatively costly to implement, with
little benefit. [](#a2)

4 The use of non-UCA-compliant ordering is fully sanctioned by
the Unicode standard for this purpose. In fact there's
a [whole chapter](http://www.unicode.org/versions/Unicode9.0.0/ch05.pdf\)
dedicated to it. In particular, §5.17 says:

When comparing text that is visible to end users, a correct linguistic sort
should be used, as described in _Section 5.16, Sorting and
Searching_. However, in many circumstances the only requirement is for a
fast, well-defined ordering. In such cases, a binary ordering can be used.

[](#a4)

5 The queries supported by `NSCharacterSet` map directly onto
properties in a table that's indexed by unicode scalar value. This table is
part of the Unicode standard. Some of these queries (e.g., “is this an
uppercase character?”) may have fairly obvious generalizations to grapheme
clusters, but exactly how to do it is a research topic and *ideally* we'd either
establish the existing practice that the Unicode committee would standardize, or
the Unicode committee would do the research and we'd implement their
result.[](#a5)

_______________________________________________
swift-evolution mailing list
swift-evolution@swift.org
https://lists.swift.org/mailman/listinfo/swift-evolution

Russ_Bishop1 · January 24, 2017, 7:46am

I fully agree. I think we could learn something from Perl 6 grammars. As PCREs are to languages without regex, Perl 6 grammars are to languages with PCREs.

A lot of really crappy user interfaces and bad tools come down to half-assed parsers; maybe we can do better? (Another argument against rushing it).

Russ

···

On Jan 23, 2017, at 2:27 PM, Joe Groff via swift-evolution <swift-evolution@swift.org> wrote:

On Jan 23, 2017, at 2:06 PM, Ben Cohen via swift-evolution <swift-evolution@swift.org <mailto:swift-evolution@swift.org>> wrote:

On Jan 23, 2017, at 7:49 AM, Joshua Alvarado <alvaradojoshua0@gmail.com <mailto:alvaradojoshua0@gmail.com>> wrote:

Taken from NSHipster <http://nshipster.com/nsregularexpression/>:
Happily, on one thing we can all agree. In NSRegularExpression, Cocoa has the most long-winded and byzantine regular expression interface you’re ever likely to come across.

There is no way to achieve the goal of being better at string processing than Perl without regular expressions being addressed. It just should not be ignored.

We’re certainly not ignoring the importance of regexes. But if there’s a key takeaway from your experiences with NSRegularExpression, it’s that a good regex implementation matters, a lot. That’s why we don’t want to rush one in alongside the rest of the overhaul of String. Instead, we should take our time to make it really great, and building on a solid foundation of a good String API that’s already in place should help ensure that.

I do think that there's some danger to focusing too narrowly on regular expressions as they appear in languages today. I think the industry has largely moved on to fully-structured formats that require proper parsing beyond what traditional regexes can handle. The decades of experience with Perl shows that making regexes too easy to use without an easy ramp up to more sophisticated string processing leads to people cutting corners trying to make regex-based designs kind-of work. The Perl 6 folks recognized this and developed their "regular expression" support into something that supported arbitrary grammars; I think we'd do well to start at that level by looking at what they've done.

-Joe

gwendal.roue · January 24, 2017, 4:29am

SQL has the `collate` keyword:

-- sort users by email, case insensitive
 select * from users order by email collate nocase
 -- look for a specific email, in a case insensitive way
 select * from users where email = 'foo@example.com <mailto:foo@example.com>' collate nocase

It is used as a decorator that modifies an existing sql snippet (a sort descriptor first, and a comparison last)

When designing an SQL building to Swift, I chose the `nameColumn.collating(.nocase)` approach, because it allowed a common Swift syntax for both use cases:

// sort users by email, case insensitive
 User.order(nameColumn.collating(.nocase))
 // look for a specific email, in a case insensitive way
 User.filter(nameColumn.collating(.nocase) == "foo@example.com <mailto:foo@example.com>")

Yes, it comes with extra operators so that nonsensical comparison are avoided.

But it just works.

Gwendal

···

Le 24 janv. 2017 à 04:31, Brent Royal-Gordon via swift-evolution <swift-evolution@swift.org> a écrit :

The operands and sense of the comparison are kind of lost in all this garbage. You really want to see `foo < bar` in this code somewhere, but you don't.

Yeah, we thought about trying to build a DSL for that, but failed. I think the best possible option would be something like:

foo.comparison(case: .insensitive, locale: .current) < bar

The biggest problem is that you can build things like

 fu = foo.comparison(case: .insensitive, locale: .current)
 br = bar.comparison(case: .sensitive)
 fu < br // what does this mean?

We could even prevent such nonsense from compiling, but the cost in library API surface area is quite large.

Is it? I think we're talking, for each category of operation that can be localized like this:

* One type to carry an operand and its options.
* One method to construct this type.
* One alternate version of each operator which accepts an operand+options parameter. (I'm thinking it should always be the right-hand side, so the long stuff ends up at the end; Larry Wall noted this follows an "end-weight principle" in natural languages.)

I suspect that most solutions will at least require some sort of overload on the comparison operators, so this may be as parsimonious as we can get.

dabrahams · January 24, 2017, 6:26pm

Thanks for all the hard work!

Still digesting, but I definitely support the goal of string processing even better than Perl.

Some random thoughts:

• I also like the suggestion of implicit conversion from substring

slices to strings based on a subtype relationship, since I keep
running into that issue when trying to use array slices.

Interesting. Could you offer some examples?

Nothing catastrophic. Mainly just having to wrap all of my slices in
Array() to actually use them, which obfuscates the purpose of my
code. It also took me an embarrassingly long time to figure out that
was what I had to do to make it work.

Is this because you're calling Cocoa APIs that traffic in Array? An
alternative is to make such APIs generic on Collection instead of
Array-specific.

For the longest time, I couldn’t understand why anyone would use
slices because I couldn’t actually use them with any API… and then
someone mentioned wrapping it in Array() here on Evolution and I
finally got it. Though it still feels like internal details that I
shouldn’t have to worry about leaking out...

Because they have a very real performance impact, they're not strictly
internal details.

It would be nice to be able to specify that conversion behavior with
other types that have a similar subtype relationship.

Indeed.

• One thing that stood out was the interpolation format syntax, which seemed a bit convoluted and

difficult to parse:

"Something with leading zeroes: \(x.format(fill: zero, width:8))"

Have you considered treating the interpolation parenthesis more

like the function call syntax? It should be a familiar pattern and
easily parseable to someone versed in other areas of swift:

  “Something with leading zeroes: \(x, fill: .zero, width: 8)"

Yes, we've considered it

1. "\(f(expr1, label2: expr2, label3: expr3))"

    String(describing: f(expr1, label2: expr2, label3: expr3))

2. "\(expr0 + expr1(label2: expr2, label3: expr3))"

    String(describing: expr0 + expr1(label2: expr2, label3: expr3)

3. "\((expr1, label2: expr2, label3: expr3))"

    String(describing: (expr1, label2: expr2, label3: expr3))

4. "\(expr1, label2: expr2, label3: expr3)"

    String(describing: expr1, label2: expr2, label3: expr3)

I think I'm primarily concerned with the differences among cases 1, 3,
and 4, which are extremely minor. 3 and 4 differ by just a set of
parentheses, though that might be mitigated by the ${...} suggestion someone else posted. The

point of using string interpolation is to improve

readability, and I fear these cases make too many things look alike that
have very different meanings. Using a common term like "format" calls
out what is being done.

It's possible to produce terser versions of the syntax that don't suffer
from this problem by using a dedicated operator:

"Column 1: \(n⛄(radix:16, width:8)) *** \(message)"
"Something with leading zeroes: \(x⛄(fill: zero, width:8))"

or even

"Column 1: \(n⛄radix:16⛄width:8) *** \(message)"
"Something with leading zeroes: \(x⛄fill:zero⛄width:8)”

There is still too much going on here to be readable, though I suppose
you could put part of it on a previous line. I really like Joe’s
suggestion of how to handle this using an
ExpressableByStringInterpolation protocol + sugar. One of my favorite
things about the current \() is how readable it makes my strings
compared to every other language I have used! I definitely want to
make sure we keep that advantage...

+1

I think that should work for the common cases (e.g. padding,

truncating, and alignment), with string-returning methods on the type
(or even formatting objects ala NSNumberFormatter) being used for more
exotic formatting needs (e.g. outputting a number as Hex instead of
Decimal)

• Have you considered having an explicit .machine locale which

means that the function should treat the string as machine readable?
(as opposed to the lack of a locale)

No, we hadn't. What would be the goal of such a design?

Just to be able to explicitly spell the desired behavior (as opposed
to only being able to rely on the lack of something). The end usage
would most likely be the same in most places as .machine could
probably be the default value of the ‘locale’ parameter. However, it
would allow me to express my intention to other programmers (and
future Jon) that this function is explicitly handling the string as
machine readable. It is a difference in feeling and expression as
opposed to a difference in ability.

I think we still want to be able to call sort() on an Array<String>
without providing a predicate, which means there *will* be some default
comparison behavior.

Also, it may help future proof a bit. If we do end up adding the
concept of “human readable” strings down the line, one could see the
current locale being a really good default for them. There may still
be times where you want to override that and treat them as machine
readable for one reason or another. Having .machine allows you to
make that override explicitly, while still providing the reasonable
default.

I think if we have HumanReadableString one day we'll want a lightweight
way to convert between it and String, so people can just use that.

• I almost feel like the machine readableness vs human readableness
of a string is information that should travel with the string
itself. It would be nice to have an extremely terse way to specify
that a string is localizable (strawman syntax below), and that might
also classify the string as human readable.

let myLocalizedStr = $”This is localizable” //This gets used as the comment in the localization
file

Yes, there are also arguments for encoding "human readable" in the
type system. But as noted in
https://github.com/apple/swift/blob/master/docs/StringManifesto.md#future-directions
<https://github.com/apple/swift/blob/master/docs/StringManifesto.md#future-directions>
those ideas are scoped out of Swift 4.

Yes, I could see that too. I guess the main question is: Do we want
to be able to add methods that apply ONLY to human readable strings?
If so, then encoding it in the type system makes more sense.

Well, there exist lots of APIs that should only accept human-readable
strings, e.g. setting the title of a button.

···

on Fri Jan 20 2017, Jonathan Hull <jhull-AT-gbis.com> wrote:

On Jan 20, 2017, at 8:28 AM, Dave Abrahams <dabrahams@apple.com> wrote:

On Jan 20, 2017, at 5:48 AM, Jonathan Hull <jhull@gbis.com <mailto:jhull@gbis.com>> wrote:

--
-Dave

dabrahams · January 24, 2017, 7:22pm

I'm going to trim out the bits where my answer is an uninteresting "Good" or "Okay, we'll leave that
for later" or what-have-you.

The operands and sense of the comparison are kind of lost in all this garbage. You really want to

see `foo < bar` in this code somewhere, but you don't.

Yeah, we thought about trying to build a DSL for that, but failed. I think the best possible

option would be something like:

 foo.comparison(case: .insensitive, locale: .current) < bar

The biggest problem is that you can build things like

 fu = foo.comparison(case: .insensitive, locale: .current)
 br = bar.comparison(case: .sensitive)
 fu < br // what does this mean?

We could even prevent such nonsense from compiling, but the cost in library API surface area is
quite large.

Is it? I think we're talking, for each category of operation that can be localized like this:

* One type to carry an operand and its options.
* One method to construct this type.
* One alternate version of each operator which accepts an
operand+options parameter. (I'm thinking it should always be the
right-hand side, so the long stuff ends up at the end; Larry Wall
noted this follows an "end-weight principle" in natural languages.)

I suspect that most solutions will at least require some sort of overload on the comparison
operators, so this may be as parsimonious as we can get.

Tell you what: why don't you prototype it and see what you can come up
with? Then we can think about the use-cases and see whether your
proposed API carries its weight.

I'm struggling a little with the naming and syntax, but as a general approach, I think we want
people to use something more like this:

if StringOptions(case: .insensitive, locale: .current).compare(foo < bar) { … }

Yeah, we can't do that without making

let a = foo < bar

ambiguous

Yeah, that's true. Perhaps we could introduce an attribute which can
be used to say "disfavor this overload compared to other
possibilities", but that seems disturbingly ad-hoc.

I think we want something a feature like that, some day for other
purposes anyway.

I know you want to defer this for now, so feel free to set this part
of the email aside,

I think I will :-)

but here's a quick list of solutions I've ballparked:

1. Your "one operand carries the options" solution.

2. As I mentioned, do something that effectively overloads comparison operators to return them in a
symbolic form. You're right about the ambiguity problem, though.

3. Like #2, but with slightly modified operators, e.g.:

 if localized(fu &< br, case: .insensitive) { … }

4. Reintroduce something like the old `BooleanType` and have *all* comparisons construct a symbolic
form that can be coerced to boolean. This is crazy, but actually probably useful in other places; I
once experimented with constructing NSPredicates like this.

 protocol BooleanProtocol { var boolValue: Bool { get } }

 struct Comparison<Operand: Comparable> {
 var negated: Bool
 var sortOrder: SortOrder
 var left: Operand
 var right: Operand

 func evaluate(_ actualSortOrder: SortOrder) -> Bool {
 // There's circularity problems here, because `==` would itself return a
`Comparison`,
 // but I think you get the idea.
 return (actualSortOrder == sortOrder) != negated
 }
 }
 extension Comparison: BooleanProtocol {
 var boolValue: Bool {
 return evaluate(left.compared(to: right))
 }
 }

 func < <ComparableType: Comparable>(lhs: ComparableType, rhs: ComparableType) -> Comparison
{
 return Comparison(negated: false, sortOrder: .before, left: lhs, right: rhs)
 }
 func <= <ComparableType: Comparable>(lhs: ComparableType, rhs: ComparableType) -> Comparison
{
 return Comparison(negated: true, sortOrder: .after, left: lhs, right: rhs)
 }
 // etc.

 // Now for our special String comparison thing:
 func localized(_ expr: Comparison<String>, case: StringCaseSensitivity? = nil, …) -> Bool {
 return expr.evaluate(expr.left.compare(expr.right, case: case, …))
 }

5. Actually add some all-new piece of syntax that allows you to add options to an operator. Bad part
is that this is ugly and kind of weird; good part is that this could probably be used in other
places as well. Strawman example:

 // Use:
 if fu Bool { … }

6. Punt on this until we have macros. Once we do, have the function be a macro which alters the
comparisons passed to it. Bad part is that this doesn't give us a solution for at least a version or
two.

However, is there a reason we're talking about using a separate
`Substring` type at all, instead of using `Slice<String>`?

Yes: we couldn't specialize its representation to store short
substrings inline, at least not without introducing an undesirable
level of complexity.

How important is that, though? If you're using a `Substring`, you
expect to keep the top-level `String` around and probably continue
sharing storage with it, so you're probably extending its lifetime
anyway. Or are you thinking of this as a speed optimization, rather
than a memory optimization?

It's both. It's true that it will rarely save space, but sometimes it
will. More importantly perhaps, it eliminates ARC traffic.

And is it worth not being able to have a `base` property on
`Substring` like we've added to `Slice`? I've occasionally thought it
might be useful to allow a slice's start and end indices to be
adjusted, essentially allowing you to "slide" the bounds of the slice
over the underlying collection; that wouldn't be possible with a
`Substring` design which sometimes inlined data.

I can't really picture what you have in mind, but the way I imagine
doing it isn't incompatible with the small string optimization.

ArraySlice is doomed :-)

Good news!

I've seen people struggle with the `Array`/`ArraySlice` issue when
writing recursive algorithms, so personally, I'd like to see a more
general solution that handles all `Collection`s.

The more general solution is "extend Unicode" or "extend Collection"
(and when a String parameter is needed, "make your method generic
over Collection/Unicode").

I know, but I know a lot of people really don't like doing that.

We need to fix that. Hopefully new generics features in Swift 4 will
make it a much more pleasant experience.

My usual practice is to use generics at almost any opportunity—when an
algorithm can work with any of a category of types, I'd rather take a
type parameter than hard-code the arbitrary type I happen to need
right now—but most people don't think that way. They'd prefer to
write:

func doThing(to slice: inout ArraySlice<Int>) { … }
func doThing(to array: inout Array<Int>) { doThing(to: array[0 ..< array.count]) }

(Yes, `array.startIndex ..< array.endIndex` would be slightly more proper, but we're not talking
about *my* style here.)

They're equally proper as long as you're dealing with concrete types.

Rather than:

 func doThing<C: RandomAccessCollection>(to collection: inout C)
 where C: RangeReplaceableCollection
 { … }

I haven't dug into this mindset that much; I suspect it comes from a
combination of believing that generics are difficult and scary

...which they are, a bit, right now, due our inability to state some of
the constraints we want to on protocols, and due to inadequate error messages.

not knowing the Collection protocols well enough to know which ones to
use, and simply not wanting to introduce additional complexity when
they don't need it.

In any case, though, I do understand why you would feel a` T` ->
`T.SubSequence` implicit coercion wouldn't carry its own weight,

It's less that it wouldn't carry weight than that we can only have the
implicit conversion in one direction.

If we had T -> T.SubSequence, the guidance for developers would be “only
store top-level Collections long-term, but otherwise, use SubSequences
everywhere.”

This basically forces everyone to be aware of the distinction between
String and Substring. That would be OK with me personally, but others
disagree with me. What do y'all think?

and `collection` *would* be a definite improvement on the status quo
for these developers.

That's the problem, right there, combined with the fact that we
don't have a terse syntax like s for going the other way. I think
it would be a much more elegant design, personally, but I don't see
the tradeoffs working out. If we can come up with a way to do it
that works, we should. So far, Ben and I have failed.

I guess what I'm saying is "keep trying; it's more valuable than you
might have anticipated". :^)

If you want to have one String type that you can “use everywhere without
thinking about it,” it has to be this way. As long as that is seen to
be important, I think we don't really have a choice.

A user who needs to optimize away copies altogether should use this guideline:
if for performance reasons you are tempted to add a `Range` argument to your
method as well as a `String` to avoid unnecessary copies, you should instead
use `Substring`.

I do like this as a guideline, though. There's definitely room in
the standard library for "a string and a range of that string to
operate upon".

I don't know what you mean. It's our intention that nothing but the
lowest level operations (e.g. replaceRange) would work on ranges
when they could instead be working on slices.

No, all I'm saying is that there's definitely a lot of value in
`Substring` or `Slice<String>`. Talking about a slice of a string is
something quite valuable that we don't currently support very well.

Ah.

**Note:** `Unicode` would make a fantastic namespace for much of
what's in this proposal if we could get the ability to nest types and
protocols in protocols.

I mean, sure, but then you imagine it being used generically:

 func parse<UnicodeType: Unicode>(_ source: UnicodeType) -> UnicodeType
 // which concrete types can `source` be???

All "string" types, including String, Substring, UTF8String, StaticString, etc.

I know that; my point is that it doesn't *read* well here.
Imagine that you are a workaday Swift programmer. You know the syntax
and the basic concrete types, but you have not read the standard
library top-to-bottom, and don't have detailed knowledge of the
protocols that it's built on. You read a source file with these three
declarations:

 func factor<Integer: BinaryInteger>(_ number: Integer) -> [Integer]

 func decode<Encoding: UnicodeEncoding> (_ data: Data, as encoding: Encoding.Type) -> String

 func parse<UnicodeType: Unicode>(_ source: UnicodeType) -> UnicodeType

Well, you chose a suboptimal spelling for the signature. I don't see a
problem with this:

func parse<Source: Unicode>(_ source: Source) -> Source

I think you would be able to understand what `factor(_:)` and
`decode(_:as:)` do, even if you had never seen the `BinaryInteger` and
`UnicodeEncoding` protocols, because their names clearly and simply
say what sort of type would conform to the protocol. You would guess
that familiar types like `Int` could be used with `factor(_:)`, and
you might not know what the concrete `UnicodeEncoding` types were
called, but you'd guess they probably had names with terms of art like
`UTF8` in them somewhere.

But what about `parse(_:)`? Sure, `Unicode` suggests it has something
to do with string handling, but it doesn't suggest *a string*.

Then call the type parameter "Text," "String," or "Str"

As I said, I would assume it has something to do with the Unicode
standard—maybe a type that does Unicode table lookups, for instance. I
get that you're using it as an adjective, but it's such a specific
technical term that using it to describe any chunk of text data is
misleading, even if that text *is* required to be Unicode text.

Perhaps you could call it `StringProtocol`, or `Textual`, or
`UnicodeString`. But I really think just `Unicode` does not do a good
job of conveying the meaning of the type.

OK, noted. I'm not attached to "Unicode," but I think it works well. I
think I'd go with StringProtocol if I had to change it.

We think random-access
*code-unit storage* is a reasonable requirement to impose on all `String`
instances.

Wait, you do? Doesn't that mean either using UTF-32, inventing a
UTF-24 to use, or using some kind of complicated side table that
adjusts for all the multi-unit characters in a UTF-16 or UTF-8
string? None of these sound ideal.

No; I'm not sure why you would think that.

Oh, sorry. I read that as "random-access code-point
[i.e. UnicodeScalar] storage", which I don't think would be a
reasonable requirement. My mistake.
Then integers can easily be translated into offsets into a `String`'s `utf16`
view for consumption by Cocoa:
let cocoaIndex = s.utf16.offset(of: String.UTF16Index(i))
let swiftIndex = s.utf16.index(offset: cocoaIndex)
I worry that this conversion will be too obscure.
I very much hope it will be rare enough that it'll be OK, but if it isn't, we can always have

let cocoaIndex = s.utf16Offset(of: i)

and/or take other measures to simplify it.
I think that would still be too obscure.

To give you an idea of what you're contending with here, take a look at a few Stack Overflow
questions:

swift - Convert String.Index to Int or Range<String.Index> to NSRange - Stack Overflow
ios - Convert Range<Int> to Range<String.Index> - Stack Overflow
string - How to convert "Index" to type "Int" in Swift? - Stack Overflow

Objective-C programmers *do not know* that `NSInteger` and `NSRange`
indices are UTF-16 indices. They don't think about what the
"character" in `-characterAtIndex:` really means; they just take it at
face value. That means putting "UTF-16" in the name will not help them
identify the API as the correct one to use. It'd be like advertising a
clinic to people with colds by saying you do "otolaryngology"—you're
just not speaking the language of your audience.

I see two ways to make it really, really obvious which API is the
right one to use. The first is to explicitly refer to something like
"objc", "cocoa", "foundation", or "ns" in the name. The second is to
use full-width conversions, which people understand are the default
way to convert between two things. (Actually, a lot of developers
literally call these "casts" and assume they're extremely low cost.)

I think that, if there's a `String.Index.init(_: Int)` and an
`Int.init(_: String.Index)`, people will almost certainly identify
these as the right way to convert between Foundation's `Int` indices
as `String.Index`es. They certainly don't seem to be figuring it out
now.

You're bringing me around to agreeing with you on this. There are still
pitfalls, though. The following will trap for some Strings:

assert(Int(s.startIndex) + 1 == Int(s.index(after: s.startIndex)))

1. UTF-16 is the storage format, at least for an "ordinary" `Swift.String`.

It will be, in the common case, but many people seem to want plain String to be able to store
UTF-8, and I'm not yet prepared to rule that out.

I suppose it doesn't matter what the actual storage format is as long
as we get #2 (`UTF16View` indexed by `String.Index`).

If we allow String to store UTF-8, then that won't hold. There *would*
be a full-width conversion from String.Index to String.UTF16View.Index,
but the UTF16View will need to store additional information to track the
UTF-16 code units corresponding to the underlying UTF-8.

If we go with the facade design, I suppose it would simply be that the
default string storage also uses its `UTF16Index` for its
`CodeUnitIndex`. Other string storages could arrange their indices in
other ways.

Right.

2. `String.Index` is used down to the `UTF16View`. It stores a UTF-16 offset.

3. With just the standard library imported, `String.Index` does not
have any obvious way to convert to or from an `Int` offset; you use
`index(_:offsetBy:)` on one of the views. `utf16`'s implementation
is just faster than the others.

This is roughly where we are today.

Yes, except for index interchangeability between `CharacterView`,
`UnicodeScalarView`, and `UTF16View`. But the suggestion that we
provide `init(_:)`s is the key part of this request.

#### String Interpolation

Let's go to a separate thread for this, as you suggested.

Will do.

So you might end up having to wrap it in an `init(cString:)` anyway, just for convenience. Oh

well, it was worth exploring.

I think you ended up where we did.

Unsurprising, I suppose. :^)

Going out of order briefly:

2. I don't really understand how you envision using the "data
specific to the underlying encoding" sections. Presumably you'll
want to convert that data into a string eventually, right?

It already is in a string. The point is that we have a way to scan
the string looking for ASCII patterns without transcoding it.

So, if I understand this properly, you're imagining that
`extendedASCII` has indices interchangeable with `codeUnits`, but
doesn't do any sort of complicated Unicode decoding, so you can rip
through the string with `extendedASCII` and then use the indices to
extract actual, fully decoded Unicode data from substrings of
`codeUnits`?

Yes.

1. It doesn't sound like you anticipate there being any way to
compare an element of the `extendedASCII` view to a character
literal. That seems like it'd be really useful.

We don't have character literals :-)

Excuse me, Unicode scalar literals. :^)

However, I agree that there needs to be a way to do it. The thing
would be to make it easy to construct a UInt8 from a string literal.

Honestly, I might consider having elements which are not plain
`UInt8`s, but `ASCIIScalar?`s, where `ASCIIScalar` looks something
like:

 struct ASCIIScalar {
 // Using a 7-bit integer means `ASCIIScalar?`'s tag bit can fit in the same byte.
 let _value: Builtin.UInt7

 var value: UInt8 {
 return UInt8(Builtin.zext_Int7_Int8(_value))
 }
 init?(_convertingValue: UInt8) {
 let result: (value: Builtin.Int7, error: Builtin.Int1) =
Builtin.u_to_u_checked_trunc_Int8_Int7(value._value)
 guard Bool(result.error) == false else { return nil }
 _value = result.value
 }
 init?<Integer: BinaryInteger>(value: Integer) {
 guard let sizedValue = UInt8(exactly: value) else {
 return nil
 }
 self.init(_convertingValue: sizedValue)
 }
 init?(_ scalar: UnicodeScalar) {
 guard scalar.isASCII else { return nil }
 _value = Builtin.UInt7(scalar.value)
 }
 }
 extension ASCIIScalar: ExpressibleByUnicodeScalarLiteral, ExpressibleByIntegerLiteral {
 // Notional, not necessarily actual, implementation
 init(unicodeScalarLiteral value: UnicodeScalar) {
 self.init(value)!
 }

 init(integerLiteral value: UInt8) {
 self.init(value: value)
 }
 }

Then you could write something like (if I understand what you're envisioning for the
`ExtendedASCIIView`):

 for (char, i) in zip(source.extendedASCII, source.extendedASCII.indices) {
 switch (state, char) {
 …
 // Look for a single or double quote to start the string
 case (.expectingValue, "'"?), (.expectingValue, "\'"?):
 state = .readingStringLiteral(quoteIndex: i)

 // Scan to the end of the string
 case (.readingStringLiteral(let quoteIndex), _):
 // Is this the terminator?
 if char == source.extendedASCII[quoteIndex] {
 let range = source.extendedASCII.index(after: quoteIndex) ..< i
 // Note that we extract the value here with `codeUnits`
 let value = String(source.codeUnits[range])

 consumeValue(value)
 state = .expectingComma
 }
 else {
 // Do nothing; just scan past this character.
 }
 …
 }
 }

Relying on the fact that you're switching against an `ASCIIScalar`,
rather than a `UInt8`, to allow Unicode scalar literals to be used.

(There are other possible designs as well; a generic
`ASCIIScalar<CodeUnit: UnsignedInteger>` which directly wrapped a code
unit without changing its storage at all would be one interesting
example.)

This is a good idea that we should explore.

If it *is* similar to `UnicodeCodec`, one thing I will note is that
the way `UnicodeCodec` works in code units is rather annoying for
I/O. It may make sense to have some sort of type-erasing wrapper
around `UnicodeCodec` which always uses bytes. (You then have to
worry about endianness, of course...)

Take a look at the branch and let me know how this looks like it would work for I/O.

I don't claim to understand everything I'm seeing, but at a quick
glance, I really like the overall design. It's nice to see it
encapsulating a stateless algorithm; I think that will make it more
flexible.

However, there's an important tweak needed for I/O: Having a truncated
character at the end of the collection needs to be detectable as a
condition distinct from other errors, because a buffer might contain
(say) two bytes of a three-byte UTF-8 character, with the third byte
expected to arrive later. For instance, you might have:

 public enum ParseResult<T, Index> {
 case valid(T, resumptionPoint: Index)
 case error(resumptionPoint: Index)
 case partial(resumptionPoint: Index)
 case emptyInput
 }

Or:

 public enum ParseResult<T, Index> {
 case valid(T, resumptionPoint: Index)
 case error(resumptionPoint: Index)
 case nothing(resumptionPoint: Index)
 }

Unlike `error`'s `resumptionPoint`, which is after the garbled character, `partial` or `nothing`'s
would be *before* the partial character.

I thought about this use-case, but I am not convinced we need it. An
algorithm that has to decode from buffers can always be written such
that, when it finds an error whose resumption point is at the end of the
buffer, it takes all the code units from the previous resumption point
to this one and shifts them to the beginning of the buffer. Why isn't
that adequate?

I had a whole bunch of stuff here earlier where I discussed replacing
`Sequence` with a new design that had a `Collection`-like interface,
except that the start index was returned by a `makeStartIndex()`
method which could only be called once. By tracking the lifetimes of
indices, the sequence could figure out when a portion of its data was
no longer accessible and could be discarded. However, I've tweaked
that design a lot in the last day and haven't come up with anything
that's quite satisfactory, so I'll leave that discussion aside for
now.

IMO we really don't want to weigh indices down with anything that heavy,
and the best way to do this is to have a deque-like data structure and
drop parts off the front of it as they are no longer needed.

(A side note related to `UnicodeEncoding`'s all-static-member design:
I've taken advantage of this "types as tables of stateless methods and
associated types" pattern myself (see
https://github.com/brentdax/SQLKit/blob/master/Sources/SQLKit/SQLClient.swift\),
and although it's very useful, it always feels like I'm fighting the
language. For these occasions, I wonder if it might make sense to
introduce a concept of "singleton types" or "static types" where the
instance and type member namespaces are unified, `T.Type` is the same
as `T, `T.init()` is the same as `T.self`, and all stored properties
are treated as static (and thus shared by all instances). That's
properly the topic of a different thread, of course; it just occurred
to me as I was writing this.)

Mmm-hmm.

That way, if you just write `String`, you get something flexible; if
you write `String<NFCNormalizedUTF16StringStorage>`, you get
something fast.

This only works in the "facade" variant where you have a defaulted
generic parameter feature, but yes, that's the idea of that variant.

Yeah, I'm speaking specifically of the defaulted case, which frankly
is the only one I think is *really* extremely promising.

What does that mean for `String.Index` unification?

Not much. We never intended for indices to be interchangeable among
different specific string types (other than a string and its
SubSequence).

I'm more asking, is it possible that different string types would have
different interchangeability rules? For instance:

* When using `UTF8StringStorage`, `String.Index` and `String.UTF8View.Index` are interchangeable.
* When using `UTF16StringStorage` (or `NSString`?), `String.Index` and `String.UTF16View.Index` are
interchangeable.
* When using `UTF32StringStorage`, `String.Index` is *not* interchangeable with either of the
`UTFnView` indices.

It's possible.

`description` would have to change to be
localizable. (Specifically, it would have to take a locale.) This
is doable, of course, but it hasn't been done yet.

Well, it could use the current locale. These things are supposed to
remain lightweight.

I think that, if you're gonna go to the trouble of making your
`description` localizable, there should be a way to inject a
locale. That would make testing your localizations easier, for
instance.

Yes, well it should be possible to change the current locale, but that's
really a topic for another day.

(There's also the small matter of `LosslessStringConvertible`. Oops?)

What in particular are you concerned with here?

···

on Mon Jan 23 2017, Brent Royal-Gordon <swift-evolution@swift.org> wrote:

On Jan 21, 2017, at 3:49 AM, Brent Royal-Gordon <brent@architechies.com> wrote:

### `StaticString`

One complication there is that `Unicode` presumably supports
mutation, which `StaticString` doesn't.

No, Unicode doesn't support mutation. A mutable Unicode will
usually conform to Unicode and RangeReplaceableCollection (but not
MutableCollection, because replacing a grapheme is not an O(1)
operation).

Oh, of course, that makes a lot of sense. Hopefully we won't need
anything special from mutable `StringStorage`s. (That is, members that
are only needed if a type is *both* `StringStorage` *and*
`RangeReplaceableCollection`.)

--
-Dave

zwaldowski · January 25, 2017, 9:39pm

The ultimate model of strings is going to be complicated whether or not it’s on String itself, although I argue that regardless of that complexity, Swift inherently starts from a much better place than f.ex. Java from just having Array vs. 30 different Array-like things. That dovetails into the point I was trying to make up-thread, which is that complicating the overall type space to serve specific use cases practically results in less-experienced users not knowing about or not finding it, even when they need to. Furthermore, “use UTF8String when you need to to be super-fast (and don’t we all want to be super fast???)” is the kind of cargo-culting that sticks, not “when caveats A, B, C, and D apply and you want to be fast and you’ve considered all the Unicode implications and when the optimizer breaks down and you have observed a performance problem you should consider etc etc etc”.

···

On Jan 25, 2017, at 4:21 PM, Ben Cohen <ben_cohen@apple.com> wrote:

On Jan 24, 2017, at 8:16 PM, Zach Waldowski via swift-evolution <swift-evolution@swift.org <mailto:swift-evolution@swift.org>> wrote:

I strongly want Swift to have world-class string processing, but I believe even more strongly in the language's spirit of progressive disclosure. Newcomers to Swift's current String API find it difficult (something I personally disagree with, but that's neither here nor there); I don't think that difficulty is solved by aggressively use-specific type modeling. I instead think it gives rise to the same severe cargo-culting that gets us the scarily prevalent String.Index.init(offset:) extensions in the current model.

This cuts both ways though. In the spirit of progressive disclosure, should we complicate String’s model for users in order for it to accommodate both UTF8 and UTF16 backing stores?

If String can be UTF8-backed, that would mean that we could not tag the UTF16 collection view as conforming to RandomAccessCollection. That would mean you couldn’t use algorithms that relied on random access on it. It would exhibit random access characteristics sometimes – UTF16View.index(:offsetBy) would run in constant time when the string was backed by UTF16, but when backed by UTF8, it would run in linear time. Given, as we’ve discussed here, you need to do these kind of index calculations sometimes to interoperate with APIs that traffic in code unit offsets, what do we need to tell users about performance when they need to do it? That "it’s probably OK unless caveat caveat caveat"?

On the other hand, if we separate UTF8-backed strings into another type, we can keep String simple. Then for those power users who really absolutely must operate on a UTF8-backed string because of their performance needs, they have another type, which they can progressively discover when they find they need it.

I’m not saying this is enough to rule out UTF8-backed strings, but I don’t think “it’ll be a simpler model for most users” is the argument in favor of it.

dabrahams · January 26, 2017, 4:17am

Could you include the latest ICU alongside the Swift standard library?

To what end?

When iOS 10 and macOS 10.12 were released (2016-09-13),
their "libicucore" was based on ICU 57 (2016-03-23),
with support for Unicode 8 (2015-06-17).

They were using a Unicode standard from 15 months ago,
instead of Unicode 9 from 3 months ago (2016-06-21).
This can only be fixed by changing the ICU schedule.

However, the Swift 4 libraries could include ICU 58 now.
They'd have Unicode 9 conformance during implementation,
and also when deployed back to iOS 7 or macOS 10.9.

...and would be inconsistent with Foundation on MacOS and iOS, and until
Swift is embedded in the OS, would grow the size of iOS apps by a lot.
I'm pretty sure that's not an acceptable state of affairs.

···

on Wed Jan 25 2017, Ben Rimmington <me-AT-benrimmington.com> wrote:

On 25 Jan 2017, Dave Abrahams wrote:

on Tue Jan 24 2017, Ben Rimmington wrote:

That's assuming you need ICU 58 for Unicode 9 conformance:
<https://github.com/apple/swift/blob/master/docs/StringManifesto.md#unicode-9-conformance>

If Swift always uses the latest ICU it will sometimes behave
inconsistently with Foundation. If you want to use the latest ICU
yourself, you can always put it in your app bundle.

I think Linux apps can bundle ICU for swift-corelibs-foundation.
But a Swift 4 app deployed to iOS 7 or macOS 10.9 will be using
ICU 51 with Unicode 6.2 support.

-- Ben

--
-Dave

Olivier_Tardieu · February 15, 2017, 1:14am

As suggested, I created a pull request for the String manifesto adding an
unsafe String API discussion.

github.com/apple/swift

Add unsafe String API discussion to String manifesto

apple:master ← tardieu:master

opened 01:06AM - 15 Feb 17 UTC

tardieu

+14 -0

Proposal for an unsafe String API. Unchecked, unvalidated access to the Strin…g buffer can sometimes enable significant performance gains compared to safe String operations. swift-evolution discussion: https://lists.swift.org/pipermail/swift-evolution/Week-of-Mon-20170123/030773.html Motivating example: https://gist.github.com/tardieu/b6a9c4d53d56d089c58089ba8f6274b5 Tentative implementation in Swift 3: https://gist.github.com/tardieu/7ca43d19b6197033dc39b138ba0e500e

I included in the comments a tentative implementation in Swift 3.

gist.github.com

https://gist.github.com/tardieu/7ca43d19b6197033dc39b138ba0e500e

unsafeString.swift

extension String {
    /// The base address of the contiguous buffer containing the String elements
    /// if the String is backed by a contiguous buffer.
    ///
    /// There is no guarantee that the String is backed by a contiguous buffer.
    /// Even if it is, there is no guarantee that the buffer is writable.
    public var baseAddress: UnsafeRawPointer? {
        return UnsafeRawPointer(self._core._baseAddress)
    }

This file has been truncated. show original

I focused for now on the most essential capabilities that, hopefully, are
not too controversial.

Regards,

Olivier

To: Olivier Tardieu/Watson/IBM@IBMUS
Cc: Ben Cohen <ben_cohen@apple.com>, swift-evolution <swift-
evolution@swift.org>
Date: 01/31/2017 02:24 PM
Subject: Re: [swift-evolution] Strings in Swift 4
Sent by: dabrahams@apple.com

> Thanks for the clarifications.
> More comments below.
>
>
>> Maybe it wasn't clear from the document, but the intention is that
>> String would be able to use any model of Unicode as a backing store,

and

>> that you could easily build unsafe models of Unicode... but also that
>> you could use your unsafe model of Unicode directly, in string-ish

ways.

>
> I see. If I understand correctly, it will be possible for instance to
> implement an unsafe model of Unicode with a UInt8 code unit and a
> maxLengthOfEncodedScalar equal to 1 by only keeping the 8 lowest bits

of

> Unicode scalars.

Eh... I think you'd just use an unsafe Latin-1 for that; why waste a
bit?

Here's an example (work very much in-progress):
https://github.com/apple/swift/blob/

9defe9ded43c6f480f82a28d866ec73d803688db/test/Prototypes/Unicode.swift#L877

>> > A lot of machine processing of strings continues to deal with 8-bit
>> > quantities (even 7-bit quantities, not UTF-8). Swift strings are
>> > not very good at that. I see progress in the manifesto but nothing
>> > to really close the performance gap with C. That's where "unsafe"
>> > mechanisms could come into play.
>>
>> extendedASCII is supposed to address that. Given a smart enough
>> optimizer, it should be possible to become competitive with C even
>> without using unsafe constructs. However, we recognize the

importance

>> of being able to squeeze out that last bit of performance by dropping
>> down to unsafe storage.
>
> I doubt a 32-bit encoding can bridge the performance gap with C in
> particular because wire protocols will continue to favor compact
> encodings. Incoming strings will have to be expanded to the
> extendedASCII representation before processing and probably compacted
> afterwards. So while this may address the needs of computationally
> intensive string processing tasks, this does not help simple parsing
> tasks on simple strings.

I'm pretty sure it does; we're not going to change representations

extendedASCII doesn't require anything to actually be expanded to
32-bits per code unit, except *maybe* in a register, and then only if
the optimizer isn't smart enough to eliminate zero-extension followed by
comparison with a known narrow value. You can always

latin1.lazy.map { UInt32($0) }

to produce 32-bit code units. All the common encodings are ASCII
supersets, so this will “just work” for those. The only places where it
becomes more complicated is in encodings like Shift-JIS (which might not
even be important enough to support as a String backing-storage format).

>
>> > To guarantee Unicode correctness, a C string must be validated or
>> > transformed to be considered a Swift string.
>>
>> Not really. You can do error-correction on the fly. However, I

think

>> pre-validation is often worthwhile because once you know something is
>> valid it's much cheaper to decode correctly (especially for UTF-8).
>
> Sure. Eager vs. lazy validation is a valuable distinction, but what I

am

> after here is side-stepping validation altogether. I understand now

that

> user-defined encodings will make side-stepping validation possible.

Right.

>
>> > If I understand the C String interop section correctly, in Swift 4,
>> > this should not force a copy, but traversing the string is still
>> > required.
>>
>> *What* should not force a copy?
>
> I would like to have a constructor that takes a pointer to a
> null-terminated sequence of bytes (or a sequence of bytes and a

length)

> and turns it into a Swift string without allocation of a new backing

store

> for the string and without copying the bytes in the sequence from one
> place in memory to another.

We probably won't expose this at the top level of String, but you should
be able to construct an UnsafeCString (which is-a Unicode) and then, if
you really need the String type, construct a String from that:

String(UnsafeCString(ntbs))

That would not do any copying.

> I understand this may require the programmer to handle memory
> management for the backing store.
>
>> > I hope I am correct about the no-copy thing, and I would also like

to

>> > permit promoting C strings to Swift strings without validation.

This

>> > is obviously unsafe in general, but I know my strings... and I care
>> > about performance. ;)
>>
>> We intend to support that use-case. That's part of the reason for

the

>> ValidUTF8 and ValidUTF16 encodings you see here:
>> https://github.com/apple/swift/blob/unicode-rethink/stdlib/public/
>> core/Unicode2.swift#L598
>> and here:
>> https://github.com/apple/swift/blob/unicode-rethink/stdlib/public/
>> core/Unicode2.swift#L862
>
> OK
>
>> > More importantly, it is not possible to mutate bytes in a Swift

string

>> > at will. Again it makes sense from the point of view of always
>> > correct Unicode sequences. But it does not for machine processing

of

>> > C strings with C-like performance. Today, I can cheat using a
>> > "_public" API for this, i.e., myString._core. _baseAddress!. This
>> > should be doable from an official "unsafe" API.
>>
>> We intend to support that use-case.
>>
>> > Memory safety is also at play here, as well as ownership. A proper
>> > API could guarantee the backing store is writable for instance,

that

>> > it is not shared. A memory-safe but not unicode-safe API could do
>> > bounds checks.
>> >
>> > While low-level C string processing can be done using unsafe memory
>> > buffers with performance, the lack of bridging with "real" Swift
>> > strings kills the deal. No literals syntax (or costly coercions),
>> > none of the many useful string APIs.
>> >
>> > To illustrate these points here is a simple experiment: code

written

>> > to synthesize an http date string from a bunch of integers. There

are

>> > four versions of the code going from nice high-level Swift code to
>> > low-level C-like code. (Some of this code is also about avoiding

ARC

>> > overheads, and string interpolation overheads, hence the four
>> > versions.)
>> >
>> > On my macbook pro (swiftc -O), the performance is as follows:
>> >
>> > interpolation + func: 2.303032365s
>> > interpolation + array: 1.224858418s
>> > append: 0.918512377s
>> > memcpy: 0.182104674s
>> >
>> > While the benchmarking could be done more carefully, I think the

main

>> > observation is valid. The nice code is more than 10x slower than

the

>> > C-like code. Moreover, the ugly-but-still-valid-Swift code is

still

>> > about 5x slower than the C like code. For some applications, e.g.

web

>> > servers, this kind of numbers matter...
>> >
>> > Some of the proposed improvements would help with this, e.g., small
>> > strings optimization, and maybe changes to the concatenation
>> > semantics. But it seems to me that a big performance gap will

remain.

>> > (Concatenation even with strncat is significantly slower than

memcpy

>> > for fixed-size strings.)
>> >
>> > I believe there is a need and an opportunity for a fast "less safe"
>> > String API. I hope it will be on the roadmap soon.
>>
>> I think it's already in the roadmap...the one that's in my head. If

you

>> want to submit a PR with amendments to the manifesto, that'd be

great.

···

dabrahams@apple.com wrote on 01/31/2017 02:23:49 PM: > From: Dave Abrahams <dabrahams@apple.com>

on Mon Jan 30 2017, Olivier Tardieu <tardieu-AT-us.ibm.com> wrote:
> dabrahams@apple.com wrote on 01/24/2017 05:50:59 PM:
>> Also thanks very much for the example below; we'll definitely
>> be referring to it as we proceed forward.
>
> Here is a gist for the example code:
> Several ways to compose an HTTP date in Swift · GitHub
>
> I can sketch key elements of an unsafe String API and some motivating
> arguments in a pull request. Is this what you are asking for?

That would be awesome, thanks!

--
-Dave

Charlie_Monroe1 · January 20, 2017, 12:10pm

Please see discussion inline.

>
> One ask - make string interpolation great again?
>
> Taking from examples supplied at https://github.com/apple/swift/blob/master/docs/StringManifesto.md#string-interpolation
>
> "Column 1: \(n.format(radix:16, width:8)) *** \(message)"
>
> Why not use:
>
> "Column 1: ${n.format(radix:16, width:8)} *** $message"
>
> Which for my preference makes the syntax feel more readable, avoids the "double ))" in terms of string interpolation termination and function termination points. And if that's not enough brings the "feel" of the language to be scriptable in nature common in bash, sh, zsh and co.. scripting interpreters and has been adopted as part of ES6 interpolation syntax[1].
>

This idea came up once before on Swift Evo. The arguments against are:

1. Swift already has an “escape” character for inserting non literal stuff into strings - the “\” character. Either you have two - increasing complexity for both the developer and the Swift compiler’s tokeniser - or you have to change everything that uses “\” to use $ e.g. $t $n instead of \t \n.

I would claim that this serves as an reinforcement of making the distinctions. "\t" is not the same behavior as "\(someVariable)" both conceptually - I think there is a clear distinction between inserting a "constant symbol" to inserting "the string content of a variable" and semantically - While you would use \t to insert a tab you are mandated by the semantics to use \( .. ) to insert the contents of a variable.

Hi Maxim,

there was quite a discussion on this matter a few months ago - I can't find the thread right now, but the consensus of majority here seemed to be that the current status is desirable by most and that it contains some unified philosophy over anything that will "not print as typed". I believe that this is something that has been discussed here several times - keep in mind Swift 4 is supposed to be as much backward compatible as possible with breaking changes requiring severe justification - there is unlikely to be one for this other than "I like it better this way".

···

On Jan 20, 2017, at 12:55 PM, Maxim Veksler via swift-evolution <swift-evolution@swift.org> wrote:
On Fri, Jan 20, 2017 at 1:09 PM Jeremy Pereira <jeremy.j.pereira@googlemail.com <mailto:jeremy.j.pereira@googlemail.com>> wrote:
> On 20 Jan 2017, at 10:30, Maxim Veksler via swift-evolution <swift-evolution@swift.org <mailto:swift-evolution@swift.org>> wrote:

2. The dollar sign is a disastrous symbol to use for an special character, especially in the USA where it is commonly used to signify the local currency. Yes, I know it is used for interpolation in Perl, Shell and Javascript and others, but “this other language I like does X, therefore Swift should do X” is not a good argument.

Please name concrete examples? I would believe that the case for $variableName to be rare enough to justify expecting the developer to make an escape claim with \$variableName, likewise for ${variableName}, if expected output is plain text I wouldn't imagine this "\$\{variableName\}" to be a far reaching expectation.

The use of $ symbol is more reaching[1], and is being adopted constantly as the selected patten for even recent developments as Facebook's GraphQL query syntax[2] which to the best of my knowledge was invented in US.

3. There is already quite a lot of code that uses \( … ) for interpolation, this would be a massive breaking change.

True, but going forward that would enable a "better readable" code for larger number of users. Additionally I would suggest that automatic conversion using Swift Migration Assistant should be possible.

_______________________________________________
swift-evolution mailing list
swift-evolution@swift.org
https://lists.swift.org/mailman/listinfo/swift-evolution